De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0046937
Sent via Flipboard
Sent from my phone
1. | Bioinformatics. 2012 Oct 27. [Epub ahead of print]Rare variant discovery and calling by sequencing pooled samples with overlaps.Wang W, Yin X, Pyon YS, Hayes M, Li J.SourceDepartment of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA. AbstractMOTIVATION:For many complex traits/diseases, it is believed that rare variants account for some of the missing heritability that cannot be explained by common variants. Sequencing a large number of samples through DNA pooling is a cost effective strategy to discover rare variants and to investigate their associations with phenotypes. Overlapping pool designs provide further benefit because such approaches can potentially identify variant carriers, which is important for downstream applications of association analysis of rare variants. However, existing algorithms for analyzing sequence data from overlapping pools are limited. RESULTS:We propose a complete data analysis framework for overlapping pool designs, with novelties in all three major steps: variant pool and variant locus identification, variant allele frequency estimation and variant sample decoding. The framework can be utilized in combination with any design matrix. We have investigated its performance based on two different overlapping designs, and have compared it with three state-of-the-art methods, by simulating targeted sequencing and by pooling real sequence data. Results on both datasets show that our algorithm has made significant improvements over existing ones. In conclusion, successful discovery of rare variants and identification of variant carriers using overlapping pool strategies critically depends on many steps, from generation of design matrixes to decoding algorithms. The proposed framework in combination with the design matrixes generated based on the Chinese remainder theorem achieves best overall results. AVAILABILITY:Source code of the program, termed VIP for Variant Identification by Pooling, is available at http://cbc.case.edu/VIP. CONTACT: |
PMID: 23104896 [PubMed - as supplied by publisher] | |
|
Dell has sold 1M webscale servers in five years
http://gigaom.com/cloud/dell-has-sold-1m-webscale-servers-in-five-years/
As promised, the source distribution for R 2.15.2 is now available for download from the master CRAN repository. (Binary distributions for Windows, MacOS and Linux will be available from the CRAN mirror network in the coming days.) This latest point-update — codenamed "Trick or Treat" — improves the performance of the R engine and adds a few minor but useful features. Detailed changes can be found in the NEWS file, but highlights of the improvements include:
There is likely to be at least one further update to the 2.15.x series: a round-up of any further changes will probable be released as 2.15.3 shortly before R 2.16.0 is released, most likely around March 2013.
r-announce mailing list: R 2.15.2 is released
Subject: [ensembl-announce] Ensembl release 69 is out!
The latest Ensembl update (e!69) has been released : http://www.ensembl.org/Here are some highlights:Human : Human dbSNP has been updated to release 137 and Human somatic variants from COSMIC have been updated to release 60. We now provide the 1000 genomes phase 1 data for the structural variation set, as well as DNA methylation data and ENCODE data.New species : Ferret (Mustela putorius furo) and Platyfish (Xiphophorus maculatus).Other news: We are happy to introduce the first inclusion of HAVANA manual curation for Ensembl genes in Pig, the release of the new scrollable region views and a new species home page redesign.A complete list of the changes in release 69 can be found at http://www.ensembl.org/info/website/news.html
For the latest news on the ensembl project visit our blog at http://www.ensembl.info/The release blog post can be viewed here : http://www.ensembl.info/blog/2012/10/19/ensembl-69-has-been-released/Best Regards,Thomas Maurel--Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK
We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration.
Each Node (1 machine master, 2 machines are slave)
1. 500 GB hard disk.
2. 4Gb RAM
3. 3 quad code CPUs.
4. Speed 1333 MHz
Now, we are planning to load 1 petabyte of data (single file) into Hadoop HDFS and Hive table within 10-20 minutes. For this we need a clarification below.
1. what are the system configuration setup required for all the 3 machine's ?.
2. Hard disk size.
3. RAM size.
4. Mother board
5. Network cable
6. How much Gbps Infiniband required.
For the same setup we need cloud computing environment too?
Please suggest and help me on this.
Thanks,
P.
Emerging sequencing technologies allow common and rare variants to be systematically assayed across the human genome in many individuals. In order to improve variant detection and genotype calling, raw sequence data are typically examined across many individuals. Here, we describe a method for genotype calling in settings where sequence data are available for unrelated individuals and parent-offspring trios and show that modeling trio information can greatly increase the accuracy of inferred genotypes and haplotypes, especially on low to modest depth sequencing data. Our method considers both linkage disequilibrium (LD) patterns and the constraints imposed by family structure when assigning individual genotypes and haplotypes. Using simulations, we show trios provide higher genotype calling accuracy across the frequency spectrum, both overall and at hard-to-call heterozygous sites. In addition, trios provide greatly improved phasing accuracy - improving the accuracy of downstream analyses (such as genotype imputation) that rely on phased haplotypes. To further evaluate our approach, we analyzed data on the first 508 individuals sequenced by the SardiNIA sequencing project. Our results show that our method reduces the genotyping error rate by 50% compared to analysis using existing methods that ignore family structure. We anticipate our method will facilitate genotype calling and haplotype inference for many ongoing sequencing projects.
1. | Bioinformatics. 2012 Oct 11. [Epub ahead of print]CD-HIT: accelerated for clustering the next generation sequencing data.Fu L, Niu B, Zhu Z, Wu S, Li W.SourceCenter for Research in Biological Systems, University of California San Diego, La Jolla, California, USA. AbstractSUMMARY: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ~24 cores and a quasi-linear speedup for up to ~8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. AVAILABILITY: http://cd-hit.org CONTACT: liwz@sdsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
PMID: 23060610 [PubMed - as supplied by publisher] | |
1. | Bioinformatics. 2012 Oct 11. [Epub ahead of print]CLEVER: Clique-Enumerating Variant Finder.Marschall T, Costa I, Canzar S, Bauer M, Klau GW, Schliep A, Schönhuth A.SourceCentrum Wiskunde & Informatica, Amsterdam, the Netherlands, Federal University of Pernambuco, Recife, Brazil, Illumina, Cambridge, UK, Rutgers, The State University of New Jersey, Piscataway, NJ, USA.AbstractMOTIVATION:Next-generation sequencing techniques have facilitated large scale analysis of human genetic variation. Despite the advances in sequencing speed, the computational discovery of structural variants is not yet standard. It is likely that many variants have remained undiscovered in most sequenced individuals.RESULTS:Here we present a novel internal segment size based approach, which organizes all, including concordant, reads into a read alignment graph where max-cliques represent maximal contradiction-free groups of alignments. A novel algorithm then enumerates all max-cliques and statistically evaluates them for their potential to reflect insertions or deletions. For the first time in the literature, we compare a large range of state-of-the-art approaches using simulated Illumina reads from a fully annotated genome and present relevant performance statistics. We achieve superior performance in particular for indels of length 20-100nt. This has been previously identified as a remaining major challenge in structural variation discovery, in particular for insert size based approaches. In this size range we outperform even split read aligners. We achieve competitive results also on biological data where our method is the only one to make a substantial amount of correct predictions, which, additionally, are disjoint from those by split-read aligners.AVAILABILITY:CLEVER is open source (GPL) and available from http://clever-sv.googlecode.com.CONTACT:tobias.marschall@tu-dortmund.de. |
PMID: 23060616 [PubMed - as supplied by publisher] | |
Following the release of Bioconductor 2.11, CummeRbund 2.0.0 is now available and is a recommended update for all CummeRbund users.
Version 2.0.0 adds a host of new options/features/bugfixes and is the first stable release version to fully support Cuffdiff2.
=================================================================================
1. | PLoS One. 2012;7(9):e46211. Epub 2012 Sep 27.Paired-End Sequencing of Long-Range DNA Fragments for De Novo Assembly of Large, Complex Mammalian Genomes by Direct Intra-Molecule Ligation.Asan, Geng C, Chen Y, Wu K, Cai Q, Wang Y, Lang Y, Cao H, Yang H, Wang J, Zhang X.SourceBGI-Shenzhen, Shenzhen, Guangdong, China.AbstractBACKGROUND:The relatively short read lengths from next generation sequencing (NGS) technologies still pose a challenge for de novo assembly of complex mammal genomes. One important solution is to use paired-end (PE) sequence information experimentally obtained from long-range DNA fragments (>1 kb). Here, we characterize and extend a long-range PE library construction method based on direct intra-molecule ligation (or molecular linker-free circularization) for NGS.RESULTS:We found that the method performs stably for PE sequencing of 2- to 5- kb DNA fragments, and can be extended to 10-20 kb (and even in extremes, up to ∼35 kb). We also characterized the impact of low quality input DNA on the method, and develop a whole-genome amplification (WGA) based protocol using limited input DNA (<1 µg). Using this PE dataset, we accurately assembled the YanHuang (YH) genome, the first sequenced Asian genome, into a scaffold N50 size of >2 Mb, which is over100-times greater than the initial size produced with only small insert PE reads(17 kb). In addition, we mapped two 7- to 8- kb insertions in the YH genome using the larger insert sizes of the long-range PE data.CONCLUSIONS:In conclusion, we demonstrate here the effectiveness of this long-range PE sequencing method and its use for the de novo assembly of a large, complex genome using NGS short reads. |
PMID: 23029438 [PubMed - as supplied by publisher] | |