![]() | |
![]() |
Illumina IGN Webinar Series: Webinar Two - Expanding your Current WGS Knowledge
|
![]() | |
![]() |
Illumina IGN Webinar Series: Webinar Two - Expanding your Current WGS Knowledge
|
Vmware releases new project that allows hadoop to be deployed in a virtual environment.
http://www.datanami.com/datanami/2012-06-26/virtualizing_the_mighty_elephant.html
1. | Ann Hum Genet. 2012 Jun 25. doi: 10.1111/j.1469-1809.2012.00718.x. [Epub ahead of print]Statistical Tests for Detecting Rare Variants Using Variance-Stabilising Transformations.Wang K, Fingert JH.SourceDepartment of Biostatistics, College of Public Health, The University of Iowa, Iowa City, IA, USA Department of Ophthalmology and Visual Sciences, Carver College of Medicine, The University of Iowa, IA, USA. AbstractNext generation sequencing holds great promise for detecting rare variants underlying complex human traits. Due to their extremely low allele frequencies, the normality approximation for a proportion no longer works well. The Fisher's exact method appears to be suitable but it is conservative. We investigate the utility of various variance-stabilising transformations in single marker association analysis on rare variants. Unlike a proportion itself, the variance of the transformed proportions no longer depends on the proportion, making application of such transformations to rare variant association analysis extremely appealing. Simulation studies demonstrate that tests based on such transformations are more powerful than the Fisher's exact test while controlling for type I error rate. Based on theoretical considerations and results from simulation studies, we recommend the test based on the Anscombe transformation over tests with other transformations. © 2012 The Authors Annals of Human Genetics © 2012 Blackwell Publishing Ltd/University College London. |
PMID: 22724536 [PubMed - as supplied by publisher] | |
1. | BMC Bioinformatics. 2012 Jun 22;13(1):145. [Epub ahead of print]Error-correcting properties of the SOLiD Exact Call Chemistry.Massingham T, Goldman N.AbstractABSTRACT: BACKGROUND:The Exact Call Chemistry for the SOLiD Next-Generation Sequencing platform augments the two-base-encoding chemistry with an additional round of ligation, using an alternative set of probes, that allows some mistakes made when reading the first set of probes to be corrected. Additionally, the Exact Call Chemistry allows reads produced by the platform to be decoded directly into nucleotide sequence rather than its two-base 'color' encoding. RESULTS:We apply the theory of linear codes to analyse the new chemistry, showing the types of sequencing mistakes it can correct and identifying those where the presence of an error can only be detected. For isolated mistakes that cannot be unambiguously corrected, we show that the type of substitution can be determined, and its location can be narrowed down to two or three positions, leading to a significant reduction in the the number of plausible alternative reads. CONCLUSIONS:The Exact Call Chemistry increases the accuracy of the SOLiD platform, enabling many potential miscalls to be prevented. However, single miscalls in the color sequence can produce complex but localised patterns of error in the decoded nucleotide sequence. Analysis of similar codes shows that some exist that, if implemented in alternative chemistries, should have superior performance. |
PMID: 22726842 [PubMed - as supplied by publisher] | |
EMC - Singapore
Responsibilities:
Greenplum is setting the pace in the Big Data Analytics space. We are growing rapidly and providing solutions to major companies in the industry.
EMC provides the technologies and tools that can help you release the power of your information. We can help you design, build, and manage flexible, scalable, and secure information infrastructures. And with these infrastructures, you'll be able to intelligently and efficiently store, protect, and manage your information so that it can be made accessible, searchable, shareable, and, ultimately, actionable. We believe that information is a business's most important asset. Ideas—and the people who come up with them—are the only real differentiator. Our promise is to help you take that differentiator as far as possible. We will deliver on this promise by helping organizations of all sizes manage more information more effectively than ever before. We will provide solutions that meet and exceed your most demanding business and IT challenges. We will bring your information to life. DISCUSS all things EMC, right here on LinkedIn! http://emc.im/DiscussOnLinkedIn This page maintained by @kemipa
excellent blog post showing how segmental duplications can skew ur CNV analysis & SNP calling. The latter was something I wasn't aware of ....
Excerpted ...
Alert followers of this blog may recall a cautionary statement I made
previously about working with Illumina CNV data — that males and
females sometimes have different baseline signal intensity levels (this
was more of a GenomeStudio software issue than a hardware problem).
To find out if this issue affects the Omni2.5, I ran a simple t-test to
compare the Log-R Ratio (LR) intensity values between males and
females across the genome. The results are shown in the Manhattan Plot
below.
http://blog.goldenhelix.com/?p=1153&_cldee=ZXBobGt5QG51cy5lZHUuc2c%3d
MC Genomics 2012, 13:241 doi:10.1186/1471-2164-13-241
Published: 15 June 2012Genotypes obtained with commercial SNP arrays have been extensively used in many large case-control or population-based cohorts for SNP-based genome-wide association studies for a multitude of traits. Yet, these genotypes capture only a small fraction of the variance of the studied traits. Genomic structural variants (GSV) such as Copy Number Variation (CNV) may account for part of the missing heritability, but their comprehensive detection requires either next-generation arrays or sequencing. Sophisticated algorithms that infer CNVs by combining the intensities from SNP-probes for the two alleles can already be used to extract a partial view of such GSV from existing data sets.
Here we present several advances to facilitate the latter approach. First, we introduce a novel CNV detection method based on a Gaussian Mixture Model. Second, we propose a new algorithm, PCA merge, for combining copy-number profiles from many individuals into consensus regions. We applied both our new methods as well as existing ones to data from 5612 individuals from the CoLaus study who were genotyped on Affymetrix 500 K arrays. We developed a number of procedures in order to evaluate the performance of the different methods. This includes comparison with previously published CNVs as well as using a replication sample of 239 individuals, genotyped with Illumina 550 K arrays. We also established a new evaluation procedure that employs the fact that related individuals are expected to share their CNVs more frequently than randomly selected individuals. The ability to detect both rare and common CNVs provides a valuable resource that will facilitate association studies exploring potential phenotypic associations with CNVs.
Our new methodologies for CNV detection and their evaluation will help in extracting additional information from the large amount of SNP-genotyping data on various cohorts and use this to explore structural variants and their impact on complex traits.
1. | Hum Hered. 2012 Jun 7;73(3):139-147. [Epub ahead of print]Two-Stage Extreme Phenotype Sequencing Design for Discovering and Testing Common and Rare Genetic Variants: Efficiency and Power.Kang G, Lin D, Hakonarson H, Chen J.SourceDepartment of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, Pa., USA. AbstractNext-generation sequencing technology provides an unprecedented opportunity to identify rare susceptibility variants. It is not yet financially feasible to perform whole-genome sequencing on a large number of subjects, and a two-stage design has been advocated to be a practical option. In stage I, variants are discovered by sequencing the whole genomes of a small number of carefully selected individuals. In stage II, the discovered variants of a large number of individuals are genotyped to assess associations. Individuals with extreme phenotypes are typically selected in stage I. Using simulated data for unrelated individuals, we explore two important aspects of this two-stage design: the efficiency of discovering common and rare single-nucleotide polymorphisms (SNPs) in stage I and the impact of incomplete SNP discovery in stage I on the power of testing associations in stage II. We applied a sum test and a sum of squared score test for gene-based association analyses evaluating the power of the two-stage design. We obtained the following results from extensive simulation studies and analysis of the GAW17 dataset. When individuals with trait values more extreme than the 99.7-99th quantile were included in stage I, the two-stage design could achieve the same power as or even higher than the one-stage design if the rare causal variants had large effect sizes. In such design, fewer than half of the total SNPs including more than half of the causal SNPs were discovered, which included nearly all SNPs with minor allele frequencies (MAFs) ≥5%, more than half of the SNPs with MAFs between 1% and 5%, and fewer than half of the SNPs with MAFs <1%. Although a one-stage design may be preferable to identify multiple rare variants having small to moderate effect sizes, our observations support using the two-stage design as a cost-effective option for next-generation sequencing studies. Copyright © 2012 S. Karger AG, Basel. |
PMID: 22678112 [PubMed - as supplied by publisher] | |
1. | PLoS One. 2012;7(6):e38538. Epub 2012 Jun 4.Caution in Interpreting Results from Imputation Analysis When Linkage Disequilibrium Extends over a Large Distance: A Case Study on Venous Thrombosis.Germain M, Saut N, Oudot-Mellakh T, Letenneur L, Dupuy AM, Bertrand M, Alessi MC, Lambert JC, Zelenika D, Emmerich J, Tiret L, Cambien F, Lathrop M, Amouyel P, Morange PE, Trégouët DA.SourceINSERM UMR_S 937, ICAN Institute, Université Pierre et Marie Curie, Paris, France. AbstractBy applying an imputation strategy based on the 1000 Genomes project to two genome-wide association studies (GWAS), we detected a susceptibility locus for venous thrombosis on chromosome 11p11.2 that was missed by previous GWAS analyses that had been conducted on the same datasets. A comprehensive linkage disequilibrium and haplotype analysis of the whole locus where twelve SNPs exhibited association p-values lower than 2.23 10(-11) and the use of independent case-control samples demonstrated that the culprit variant was a rare variant located ∼1 Mb away from the original hits, not tagged by current genome-wide genotyping arrays and even not well imputed in the original GWAS samples. This variant was in fact the rs1799963, also known as the FII G20210A prothrombin mutation. This work may be of major interest not only for its scientific impact but also for its methodological findings. |
PMID: 22675575 [PubMed - in process] | |
Although next-generation DNA sequencing technologies have made rare variant association studies feasible and affordable, the development of powerful statistical methods for rare variant association studies is still under way. Most of the existing methods for rare variant association studies compare the number of rare mutations in a group of rare variants (in a gene or a pathway) between cases and controls. However, these methods assume that all causal variants are risk to diseases. Recently, several methods that are robust to the direction and magnitude of effects of causal variants have been proposed. However, they are applicable to unrelated individuals only, whereas family data have been shown to improve power to detect rare variants. In this article, we propose two adaptive weighting methods for rare variant association studies based on family data for quantitative traits. Using extensive simulation studies, we evaluate and compare our proposed methods with two methods based on the weights proposed by Madsen and Browning. Our results show that both proposed methods are robust to population stratification, robust to the direction and magnitude of the effects of causal variants, and more powerful than the methods using weights suggested by Madsen and Browning, especially when both risk and protective variants are present. Genet. Epidemiol. 36:499-507, 2012. © 2012 Wiley Periodicals, Inc.
Identity by descent (IBD) has played a fundamental role in the discovery of genetic loci underlying human diseases. Both pedigree-based and population-based linkage analyses rely on estimating recent IBD, and evidence of ancient IBD can be used to detect population structure in genetic association studies. Various methods for detecting IBD, including those implemented in the soft- ware programs fastIBD and GERMLINE, have been developed in the past several years using population genotype data from microarray platforms. Now, nextgeneration DNA sequencing data is becoming increasingly available, enabling the comprehensive analysis of genomes, in- cluding identifying rare variants. These sequencing data may provide an opportunity to detect IBD with higher resolution than previously possible, potentially enabling the detection of disease causing loci that were previously undetectable with sparser genetic data.
Here, we investigate how different levels of variant coverage in sequencing and microarray genotype data influences the resolution at which IBD can be detected. This includes microarray genotype data from the WTCCC study, denser genotype data from the HapMap Project, low coverage sequencing data from the 1000 Genomes Project, and deep coverage complete genome data from our own projects. With high power (78%), we can detect segments of length 0.4 cM or larger using fastIBD and GERMLINE in sequencing data. This compares to similar power to detect segments of length 1.0 cM or higher with microarray genotype data. We find that GERMLINE has slightly higher power than fastIBD for detecting IBD segments using sequencing data, but also has a much higher false positive rate.
We further quantify the effect of variant density, conditional on genetic map length, on the power to resolve IBD segments. These investigations into IBD resolution may help guide the design of future next generation sequencing studies that utilize IBD, including family-based association studies, association studies in admixed populations, and homozygosity mapping studies.
1. | Am J Hum Genet. 2012 Jun 8;90(6):1028-45.Family-based association studies for next-generation sequencing.Zhu Y, Xiong M.SourceHuman Genetics Center and Division of Biostatistics, The University of Texas School of Public Health, Houston, TX 77030, USA. AbstractAn individual's disease risk is determined by the compounded action of both common variants, inherited from remote ancestors, that segregated within the population and rare variants, inherited from recent ancestors, that segregated mainly within pedigrees. Next-generation sequencing (NGS) technologies generate high-dimensional data that allow a nearly complete evaluation of genetic variation. Despite their promise, NGS technologies also suffer from remarkable limitations: high error rates, enrichment of rare variants, and a large proportion of missing values, as well as the fact that most current analytical methods are designed for population-based association studies. To meet the analytical challenges raised by NGS, we propose a general framework for sequence-based association studies that can use various types of family and unrelated-individual data sampled from any population structure and a universal procedure that can transform any population-based association test statistic for use in family-based association tests. We develop family-based functional principal-component analysis (FPCA) with or without smoothing, a generalized T(2), combined multivariate and collapsing (CMC) method, and single-marker association test statistics. Through intensive simulations, we demonstrate that the family-based smoothed FPCA (SFPCA) has the correct type I error rates and much more power to detect association of (1) common variants, (2) rare variants, (3) both common and rare variants, and (4) variants with opposite directions of effect from other population-based or family-based association analysis methods. The proposed statistics are applied to two data sets with pedigree structures. The results show that the smoothed FPCA has a much smaller p value than other statistics. Copyright © 2012 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved. |
PMID: 22682329 [PubMed - in process] | |
1.
Bioinformatics. 2012 Jun 15;28(12):i188-i196.
SEQuel: improving the accuracy of genome assemblies.
Ronen R, Boucher C, Chitsaz H, Pevzner P.
Source
Bioinformatics Graduate Program, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093 and Department of Computer Science, Wayne State University, Detroit, MI 48202, USA.
Abstract
MOTIVATION:
Assemblies of next-generation sequencing (NGS) data, although accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop SEQuel, a tool that corrects errors (i.e. insertions, deletions and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the positional de Bruijn graph, a graph structure that models k-mers within reads while incorporating the approximate positions of reads into the model.
RESULTS:
SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell Escherichia coli data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently assembled Deltaproteobacterium SAR324 genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly.
AVAILABILITY:
SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at http://bix.ucsd.edu/SEQuel/.
CONTACT:
PMID: 22689760 [PubMed - in process]
---------- Forwarded message ----------
From: "Peter Cock"
Date: Jun 10, 2012 6:25 PM
Subject: [Biopython] EU-codefest
Dear Biopythoneers,
Some of you might like to attend an Open-Bio Hackathon in Italy this
summer - 19 and 20 July 2012, in Lodi.
This is about a week after BOSC and the pre-BOSC CodeFest in California
http://www.open-bio.org/wiki/BOSC_2012
Peter
---------- Forwarded message ----------
From: *Pjotr Prins*
Date: Saturday, June 9, 2012
Subject: EU-codefest
Hi Chris and Peter,
Would you mind sending a reminder of the EU-codefest to your lists?
Registration form is up:
http://www.open-bio.org/wiki/EU_Codefest_2012
Three main topics will be worked on during the CodeFest:
NGS and high performance parsers for OpenBio projects.
RDF and semantic web for bioinformatics.
Bioinformatics pipelines definition, execution and distribution.
other tracks are welcome!
Pj.
_______________________________________________
Biopython mailing list -
http://lists.open-bio.org/mailman/listinfo/biopython
Appropriate choice of the 'exp_cov' (expected coverage) parameter in Velvet is very important to get an assembly right. In the following figure, we show data from a calculation on a set of reads taken from a 3Kb region of a genome, and reassembling them with varying exp_cov parameters. X-axis in the chart shows the exp_cov and y-axis shows the size of the largest scaffold assembled by Velvet.
http://www.homolog.us/blogs/2012/06/08/an-explanation-of-velvet-parameter-exp_cov/
GUI for Velvet called
VAGUE. It is written in JRuby but compiled to Java bytecode and will
run on Mac and Linux. You need to have the latest Velvet binaries (>=
1.2.06) as David has made improvements to Velvet to make VAGUE simpler
to use. You can optionally install velvetk.pl which I announced
recently on this list.
You can look at screenshots and download it from here:
http://bioinformatics.net.au/software.vague.shtml
Enjoy!
--
--Dr Torsten Seemann
--Scientific Director : Victorian Bioinformatics Consortium, Monash
University, AUSTRALIA
--Senior Researcher : VLSCI Life Sciences Computation Centre,
Parkville, AUSTRALIA
--http://www.bioinformatics.net.au/
_______________________________________________
Velvet-users mailing list
http://listserver.ebi.ac.uk/mailman/listinfo/velvet-users
Motivation: Genome resequencing and short read mapping are two of the primary tools of genomics and are used for many important applications. The current state-of-the-art in mapping uses the quality values and mapping quality scores to evaluate the reliability of the mapping. These attributes, however, are assigned to individual reads and don't directly measure the problematic repeats across the genome. Here we present the Genome Mappability Score (GMS) as a novel measure of the complexity of resequencing a genome. The GMS is a weighted probability that any read could be unambiguously mapped to a given position, and thus measures the overall composition of the genome itself.
Results: We have developed the Genome Mappability Analyzer (GMA) to compute the GMS of every position in a genome. It leverages the parallelism of cloud computing to analyze large genomes, and enabled us to identify the 5-14% of the human, mouse, fly, and yeast genomes that are difficult to analyze with short reads. We examined the accuracy of the widely used BWA/SAMtools polymorphism discovery pipeline in the context of the GMS, and found discovery errors are dominated by false negatives, especially in regions with poor GMS. These errors are fundamental to the mapping process and cannot be overcome by increasing coverage. As such, the GMS should be considered in every resequencing project to pinpoint the dark matter of the genome, including of known clinically relevant variations in these regions.
Availability: The source code and profiles of several model organisms are available at http://gma-bio.sourceforge.net
YT plugging recent ACRG collaboration with BGI. 14TB of liver cancer genome data from this available in GigaDB
http://gigadb.org/hepatocellular-carcinoma/
Genomic DNA was purified for at least 30-fold coverage paired-end (PE) sequencing, and PE reads were mapped on human reference genome (UCSC build hg19) and HBV (NC_003977). Two sequencing libraries with different insert size were constructed for each genomic DNA sample (200bp and 800bp). Paired end, 90bp read length sequencing was performed in the HiSeq 2000 sequencer according to the manufacturer's instructions. Raw gene expression profiling data of these human HCC samples have been deposited to GEO with the accession number GSE25097.
Raw data
History
May 31, 2012: Data released.
In accordance with our terms of use, please cite this dataset as:
Kan, Z; Zheng, H; Liu, X; Li, S; Barber, TD; Gong, Z; Gao, H; Hao, K; Willard, MD; Xu, J; Hauptschein, R; Rejto, PA; Fernandez, J; Wang, G; Zhang, Q; Wang, B; Chen, R; Wang, J; Lee, NP; Lee, WH; Ariyaratne, PN; Tennakoon, C; Mulawadi, FH; Wong, KF; Liu, AM; Chan, KL; Hu, Y; Chou, WC; Buser, C; Zhou, W; Lin, Z; Peng, Z; Yi, K; Chen, S; Li, L; Fan, X; Yang, J; Ye, R; Ju, J; Wang, K; Estrella, H; Deng, S; Wulur, IH; Liu, J; Ehsani, ME; Zhang, C; Loboda, A; Sung, WK; Aggarwal, A; Poon, RT; Fan, ST; Wang, J; Hardwick, J; Reinhard, C; Dai, H; Li, Y; Luk, JM; Mao, M; the Asian Cancer Research Group (2012): Hepatocellular carcinoma genomic data from the Asia Cancer Research Group. GigaScience.http://dx.doi.org/10.5524/100034
Related manuscript available at:
Accession codes associated with this data:
EMBL-EBI ENA ERP001196