Friday, 28 October 2011

mental note to self ..

I should put in the log file a printout of the program version each time I run an analysis. Right now am writing hand over documentation and trying to remember which version of BWA that I used for xx analysis! Argh!

Novel miRNAs discovered using SOLiD, AND Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing.



1. Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing.
Angiuoli SV, White JR, Matalka M, White O, Fricke WF.
PLoS One. 2011;6(10):e26624. Epub 2011 Oct 19.
PMID: 22028928 [PubMed - in process]

Abstract

BACKGROUND:

The widespread popularity of genomic applications is threatened by the "bioinformatics bottleneck" resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly.

RESULTS:

We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers.

CONCLUSIONS:

Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.


2.First Survey of the Wheat Chromosome 5A Composition through a Next Generation Sequencing Approach.
Vitulo N, Albiero A, Forcato C, Campagna D, Dal Pero F, Bagnaresi P, Colaiacovo M, Faccioli P, Lamontanara A, Simková H, Kubaláková M, Perrotta G, Facella P, Lopez L, Pietrella M, Gianese G, Doležel J, Giuliano G, Cattivelli L, Valle G, Stanca AM.
PLoS One. 2011;6(10):e26421. Epub 2011 Oct 18.
PMID: 22028874 [PubMed - in process]
3.The distal hereditary motor neuropathies.
Rossor AM, Kalmar B, Greensmith L, Reilly MM.
J Neurol Neurosurg Psychiatry. 2011 Oct 25. [Epub ahead of print]
PMID: 22028385 [PubMed - as supplied by publisher]
4.Next-generation sequencing identifies novel microRNAs in peripheral blood of lung cancer patients.
Keller A, Backes C, Leidinger P, Kefer N, Boisguerin V, Barbacioru C, Vogel B, Matzas M, Huwer H, Katus HA, Stähler C, Meder B, Meese E.
Mol Biosyst. 2011 Oct 25. [Epub ahead of print]
PMID: 22027949 [PubMed - as supplied by publisher]

Abstract

MicroRNAs (miRNAs) are increasingly envisaged as biomarkers for various tumor and non-tumor diseases. MiRNA biomarker identification is, as of now, mostly performed in a candidate approach, limiting discovery to annotated miRNAs and ignoring unknown ones with potential diagnostic value. Here, we applied high-throughput SOLiD transcriptome sequencing of miRNAs expressed in human peripheral blood of patients with lung cancer. We developed a bioinformatics pipeline to generate profiles of miRNA markers and to detect novel miRNAs with diagnostic information. Applying our approach, we detected 76 previously unknown miRNAs and 41 novel mature forms of known precursors. In addition, we identified 32 annotated and seven unknown miRNAs that were significantly altered in cancer patients. These results demonstrate that deep sequencing of small RNAs bears high potential to quantify miRNAs in peripheral blood and to identify previously unknown miRNAs serving as biomarker for lung cancer.


5.Next generation sequencing in epigenetics: Insights and challenges.
Meaburn E, Schulz R.
Semin Cell Dev Biol. 2011 Oct 19. [Epub ahead of print]
PMID: 22027613 [PubMed - as supplied by publisher]

Wednesday, 26 October 2011

'Junk DNA' defines differences between humans and chimps

http://www.sciencedaily.com/releases/2011/10/111025122615.htm

Researchers at the Georgia Institute of Technology have now determined that the insertion and deletion of large pieces of DNA near genes are highly variable between humans and chimpanzees and may account for major differences between the two species. The research team lead by Georgia Tech Professor of Biology John McDonald has verified that while the DNA sequence of genes between humans and chimpanzees is nearly identical, there are large genomic "gaps" in areas adjacent to genes that can affect the extent to which genes are "turned on" and "turned off." The research shows that these genomic "gaps" between the two species are predominantly due to the insertion or deletion (INDEL) of viral-like sequences called retrotransposons that are known to comprise about half of the genomes of both species. The findings are reported in the most recent issue of the online, open-access journal Mobile DNA.

Tuesday, 25 October 2011

20-30x genome coverage is optimal for whole genome assembly of pyrosequencing data


1. Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data.
Finotello F, Lavezzo E, Fontana P, Peruzzo D, Albiero A, Barzon L, Falda M, Di Camillo B, Toppo S.
Brief Bioinform. 2011 Oct 21. [Epub ahead of print]
PMID: 22021898 [PubMed - as supplied by publisher]

Abstract

Next-generation sequencing technologies have fostered an unprecedented proliferation of high-throughput sequencing projects and a concomitant development of novel algorithms for the assembly of short reads. In this context, an important issue is the need of a careful assessment of the accuracy of the assembly process. Here, we review the efficiency of a panel of assemblers, specifically designed to handle data from GS FLX 454 platform, on three bacterial data sets with different characteristics in terms of reads coverage and repeats content. Our aim is to investigate their strengths and weaknesses in the reconstruction of the reference genomes. In our benchmarking, we assess assemblers' performance, quantifying and characterizing assembly gaps and errors, and evaluating their ability to solve complex genomic regions containing repeats. The final goal of this analysis is to highlight pros and cons of each method, in order to provide the final user with general criteria for the right choice of the appropriate assembly strategy, depending on the specific needs. A further aspect we have explored is the relationship between coverage of a sequencing project and quality of the obtained results. The final outcome suggests that, for a good tradeoff between costs and results, the planned genome coverage of an experiment should not exceed 20-30 ×.


2.Efficient targeted resequencing of human germline and cancer genomes by oligonucleotide-selective sequencing.
Myllykangas S, Buenrostro JD, Natsoulis G, Bell JM, Ji HP.
Nat Biotechnol. 2011 Oct 23. doi: 10.1038/nbt.1996. [Epub ahead of print]
PMID: 22020387 [PubMed - as supplied by publisher]

Abstract

We describe an approach for targeted genome resequencing, called oligonucleotide-selective sequencing (OS-Seq), in which we modify the immobilized lawn of oligonucleotide primers of a next-generation DNA sequencer to function as both a capture and sequencing substrate. We apply OS-Seq to resequence the exons of either 10 or 344 cancer genes from human DNA samples. In our assessment of capture performance, >87% of the captured sequence originated from the intended target region with sequencing coverage falling within a tenfold range for a majority of all targets. Single nucleotide variants (SNVs) called from OS-Seq data agreed with >95% of variants obtained from whole-genome sequencing of the same individual. We also demonstrate mutation discovery from a colorectal cancer tumor sample matched with normal tissue. Overall, we show the robust performance and utility of OS-Seq for the resequencing analysis of human germline and cancer genomes.


3.Integrating Molecular Mechanisms and Clinical Evidence in the Management of Trastuzumab Resistant or Refractory HER-2+ Metastatic Breast Cancer.
Wong H, Leung R, Kwong A, Chiu J, Liang R, Swanton C, Yau T.
Oncologist. 2011 Oct 21. [Epub ahead of print]
PMID: 22020213 [PubMed - as supplied by publisher]
4.Myogenic conversion and transcriptional profiling of embryonic blastomeres in Caenorhabditis elegans.
Fukushige T, Krause M.
Methods. 2011 Oct 13. [Epub ahead of print]
PMID: 22019720 [PubMed - as supplied by publisher]
5.High-throughput RNA interference screening using pooled shRNA libraries and next generation sequencing.
Sims D, Mendes-Pereira AM, Frankum J, Burgess D, Cerone MA, Lombardelli C, Mitsopoulos C, Hakas J, Murugaesu N, Isacke CM, Fenwick K, Assiotis I, Kozarewa I, Zvelebil M, Ashworth A, Lord CJ.
Genome Biol. 2011 Oct 21;12(10):R104. [Epub ahead of print]
PMID: 22018332 [PubMed - as supplied by publisher]


Saturday, 22 October 2011

BaseSpace - Illumina's version of iCloud for NGS?

http://basespace.illumina.com/home

BaseSpace:securely & easily analyze, store, share #NGS data w/ anyone, anywhere, anytime. Get started with free account

Limitless storage

Friday, 21 October 2011

The sequence read archive: explosive growth of sequencing data.

1. The sequence read archive: explosive growth of sequencing data.
Kodama Y, Shumway M, Leinonen R; on behalf of the International Nucleotide Sequence Database Collaboration.
Nucleic Acids Res. 2011 Oct 18. [Epub ahead of print]
PMID: 22009675 [PubMed - as supplied by publisher]

Abstract

New generation sequencing platforms are producing data with significantly higher throughput and lower cost. A portion of this capacity is devoted to individual and community scientific projects. As these projects reach publication, raw sequencing datasets are submitted into the primary next-generation sequence data archive, the Sequence Read Archive (SRA). Archiving experimental data is the key to the progress of reproducible science. The SRA was established as a public repository for next-generation sequence data as a part of the International Nucleotide Sequence Database Collaboration (INSDC). INSDC is composed of the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). The SRA is accessible at www.ncbi.nlm.nih.gov/sra from NCBI, at www.ebi.ac.uk/ena from EBI and at trace.ddbj.nig.ac.jp from DDBJ. In this article, we present the content and structure of the SRA and report on updated metadata structures, submission file formats and supported sequencing platforms. We also briefly outline our various responses to the challenge of explosive data growth.

Click here to read

2. Functional Annotation of the Transcriptome of Sorghum bicolor in Response to Osmotic Stress and Abscisic Acid.
Dugas DV, Monaco MK, Olsen A, Klein RR, Kumari S, Ware D, Klein PE.
BMC Genomics. 2011 Oct 18;12(1):514. [Epub ahead of print]
PMID: 22008187 [PubMed - as supplied by publisher]

ABSTRACT:

BACKGROUND:

Higher plants exhibit remarkable phenotypic plasticity allowing them to adapt to an extensive range of environmental conditions. Sorghum is a cereal crop that exhibits exceptional tolerance to adverse conditions, in particular, water-limiting environments. This study utilized next generation sequencing (NGS) technology to examine the transcriptome of sorghum plants challenged with osmotic stress and exogenous abscisic acid (ABA) in order to elucidate genes and gene networks that contribute to sorghum's tolerance to water-limiting environments with a long-term aim of developing strategies to improve plant productivity under drought.

RESULTS:

RNA-Seq results revealed transcriptional activity of 28,335 unique genes from sorghum root and shoot tissues subjected to polyethylene glycol (PEG)-induced osmotic stress or exogenous ABA. Differential gene expression analyses in response to osmotic stress and ABA revealed a strong interplay among various metabolic pathways including abscisic acid and 13-lipoxygenase, salicylic acid, jasmonic acid, and plant defense pathways. Transcription factor analysis indicates that groups of genes may be co-regulated by similar regulatory sequences to which the expressed transcription factors bind. We successfully exploited the data presented here in conjunction with published transcriptome analyses for rice, maize, and Arabidopsis to discover more than 50 differentially expressed, drought-responsive gene orthologs for which no function had been previously ascribed.

CONCLUSIONS:

The present study provides an initial assemblage of sorghum genes and gene networks regulated by osmotic stress and hormonal treatment. We are providing an RNA-Seq data set and an initial collection of transcription factors, which offer a preliminary look into the cascade of global gene expression patterns that arise in a drought tolerant crop subjected to abiotic stress. These resources will allow scientists to query gene expression and functional annotation in response to drought.

Click here to read


Thursday, 20 October 2011

sequencing of maternal plasma to detect Down syndrome: An international clinical validation study


1. Deep impact: deciphering mucosal microbiomes using next-generation sequencing approaches.
[No authors listed]
Mucosal Immunol. 2011 Nov;4(6):586-7. doi: 10.1038/mi.2011.46. No abstract available.
PMID: 22005879 [PubMed - in process]
Related citations
2.DNA sequencing of maternal plasma to detect Down syndrome: An international clinical validation study.
Palomaki GE, Kloza EM, Lambert-Messerlian GM, Haddow JE, Neveux LM, Ehrich M, van den Boom D, Bombard AT, Deciu C, Grody WW, Nelson SF, Canick JA.
Genet Med. 2011 Oct 14. [Epub ahead of print]
PMID: 22005709 [PubMed - as supplied by publisher]

Abstract

PURPOSE:

Prenatal screening for Down syndrome has improved, but the number of resulting invasive diagnostic procedures remains problematic. Measurement of circulating cell-free DNA in maternal plasma might offer improvement.

METHODS:

A blinded, nested case-control study was designed within a cohort of 4664 pregnancies at high risk for Down syndrome. Fetal karyotyping was compared with an internally validated, laboratory-developed test based on next-generation sequencing in 212 Down syndrome and 1484 matched euploid pregnancies. None had been previously tested. Primary testing occurred at a CLIA-certified commercial laboratory, with cross validation by a CLIA-certified university laboratory.

RESULTS:

Down syndrome detection rate was 98.6% (209/212), the false-positive rate was 0.20% (3/1471), and the testing failed in 13 pregnancies (0.8%); all were euploid. Before unblinding, the primary testing laboratory also reported multiple alternative interpretations. Adjusting chromosome 21 counts for guanine cytosine base content had the largest impact on improving performance.

CONCLUSION:

When applied to high-risk pregnancies, measuring maternal plasma DNA detects nearly all cases of Down syndrome at a very low false-positive rate. This method can substantially reduce the need for invasive diagnostic procedures and attendant procedure-related fetal losses. Although implementation issues need to be addressed, the evidence supports introducing this testing on a clinical basis.



Related citations
3.Assessing the Impact of Non-Differential Genotyping Errors on Rare Variant Tests of Association.
Powers S, Gopalakrishnan S, Tintle N.
Hum Hered. 2011 Oct 15;72(3):152-159. [Epub ahead of print]
PMID: 22004945 [PubMed - as supplied by publisher]

Abstract

Background/Aims: We aim to quantify the effect of non-differential genotyping errors on the power of rare variant tests and identify those situations when genotyping errors are most harmful. Methods: We simulated genotype and phenotype data for a range of sample sizes, minor allele frequencies, disease relative risks and numbers of rare variants. Genotype errors were then simulated using five different error models covering a wide range of error rates. Results: Even at very low error rates, misclassifying a common homozygote as a heterozygote translates into a substantial loss of power, a result that is exacerbated even further as the minor allele frequency decreases. While the power loss from heterozygote to common homozygote errors tends to be smaller for a given error rate, in practice heterozygote to homozygote errors are more frequent and, thus, will have measurable impact on power. Conclusion: Error rates from genotype-calling technology for next-generation sequencing data suggest that substantial power loss may be seen when applying current rare variant tests of association to called genotypes.

Click here to read


Related citations
4.Transcriptome map of mouse isochores.
Arhondakis S, Frousios K, Iliopoulos CS, Pissis SP, Tischler G, Kossida S.
BMC Genomics. 2011 Oct 17;12(1):511. [Epub ahead of print]
PMID: 22004510 [PubMed - as supplied by publisher]

Abstract

ABSTRACT:

BACKGROUND:

The availability of fully sequenced genomes and the implementation of transcriptome technologies have increased the studies investigating the expression profiles for a variety of tissues, conditions, and species. In this study, using RNA-seq data for three distinct tissues (brain, liver, and muscle), we investigate how base composition affects mammalian gene expression, an issue of prime practical and evolutionary interest.

RESULTS:

We present the transcriptome map of the mouse isochores (DNA segments with a fairly homogeneous base composition) for the three different tissues and the effects of isochores' base composition on their expression activity. Our analyses also cover the relations between the genes' expression activity and their localization in the isochore families.

CONCLUSIONS:

This study is the first where next-generation sequencing data are used to associate the effects of both genomic and genic compositional properties to their corresponding expression activity. Our findings confirm previous results, and further support the existence of a relationship between isochores and gene expression. This relationship corroborates that isochores are primarily a product of evolutionary adaptation rather than a simple by-product of neutral evolutionary processes.

Click here to read

Related citations

Article: Pyicos – a powerful toolkit for the analysis of mapped reads

Pyicos – a powerful toolkit for the analysis of mapped reads
http://rna-seqblog.com/data-analysis/expression-tools/pyicos-a-powerful-toolkit-for-the-analysis-of-mapped-reads/

(Sent from Flipboard)

Sent from my iPad

Tuesday, 18 October 2011

low mem assembler for NGS http://cortexassembler.sourceforge.net/


http://cortexassembler.sourceforge.net/

Cortex is an efficient and low-memory software framework for analysis of genomes using sequence data. There are two main executables, being developed in parallel streams: cortex_con (primary contact Mario Caccamo) is for consensus genome assembly, and cortex_var (primary contact Zamin Iqbal) is for variation and population assembly.

cortex_var
Typical memory use: 1 high coverage human in under 80Gb of RAM, 1000 yeasts in under 64Gb RAM, 10 humans in under 256 Gb RAM


Fwd: What's new for 'next generation sequencing' in PubMed


1. Current genetic methodologies in the identification of disaster victims and in forensic analysis.
Ziętkiewicz E, Witt M, Daca P, Zebracka-Gala J, Goniewicz M, Jarząb B, Witt M.
J Appl Genet. 2011 Oct 15. [Epub ahead of print]
PMID: 22002120 [PubMed - as supplied by publisher]
2.Rapid detection of gene mutations responsible for non-syndromic aortic aneurysm and dissection using two different methods: resequencing microarray technology and next-generation sequencing.
Sakai H, Suzuki S, Mizuguchi T, Imoto K, Yamashita Y, Doi H, Kikuchi M, Tsurusaki Y, Saitsu H, Miyake N, Masuda M, Matsumoto N.
Hum Genet. 2011 Oct 15. [Epub ahead of print]
PMID: 22001912 [PubMed - as supplied by publisher]

Abstract

Aortic aneurysm and/or dissection (AAD) is a life-threatening condition, and several syndromes are known to be related to AAD. In this study, two new technologies, resequencing array technology (ResAT) and next-generation sequencing (NGS), were used to analyze eight genes associated with syndromic AAD in 70 patients with non-syndromic AAD. Eighteen sequence variants were detected using both ResAT and NGS. In addition one of these sequence variants was detected by ResAT only and two additional variants by NGS only. Three of the 18 variants are likely to be pathogenic (in 4.3% of AAD patients and in 8.6% of a subset of patients with thoracic AAD), highlighting the importance of genetic analysis in non-syndromic AAD. ResAT and NGS similarly detected most, but not all, of the variants. Resequencing array technology was a rapid and efficient method for detecting most nucleotide substitutions, but was unable to detect short insertions/deletions, and it is impractical to update custom arrays frequently. Next-generation sequencing was able to detect almost all types of mutation, but requires improved informatics methods.


3.The genomics of autoimmune disease in the era of genome-wide association studies and beyond.
Lessard CJ, Ice JA, Adrianto I, Wiley G, Kelly JA, Gaffney PM, Montgomery CG, Moser KL.
Autoimmun Rev. 2011 Oct 7. [Epub ahead of print]
PMID: 22001415 [PubMed - as supplied by publisher]
4.Sequencing of BAC pools by different next generation sequencing platforms and strategies.
Taudien S, Steuernagel B, Ariyadasa R, Schulte D, Schmutzer T, Groth M, Felder M, Petzold A, Scholz U, Mayer KF, Stein N, Platzer M.
BMC Res Notes. 2011 Oct 14;4(1):411. [Epub ahead of print]
PMID: 21999860 [PubMed - as supplied by publisher]

Abstract

ABSTRACT:

BACKGROUND:

Next generation sequencing of BACs is a viable option for deciphering the sequence of even large and highly repetitive genomes. In order to optimize this strategy, we examined the influence of read length on the quality of Roche/454 sequence assemblies, to what extent Illumina/Solexa mate pairs (MPs) improve the assemblies by scaffolding and whether barcoding of BACs is dispensable.

RESULTS:

Sequencing four BACs with both FLX and Titanium technologies revealed similar sequencing accuracy, but showed that the longer Titanium reads produce considerably less misassemblies and gaps. The 454 assemblies of 96 barcoded BACs were improved by scaffolding 79% of the total contig length with MPs from a non-barcoded library. Assembly of the unmasked 454 sequences without separation by barcodes revealed chimeric contig formation to be a major problem, encompassing 47% of the total contig length. Masking the sequences reduced this fraction to 24%.

CONCLUSION:

Optimal BAC pool sequencing should be based on the longest available reads, with barcoding essential for a comprehensive assessment of both repetitive and non-repetitive sequence information. When interest is restricted to non-repetitive regions and repeats are masked prior to assembly, barcoding is non-essential. In any case, the assemblies can be improved considerably by scaffolding with non-barcoded BAC pool MPs.


5.Chipster: user-friendly analysis software for microarray and other high-throughput data.
Kallio MA, Tuimala JT, Hupponen T, Klemela P, Gentile M, Scheinin I, Koski M, Kaki J, Korpelainen EI.
BMC Genomics. 2011 Oct 14;12(1):507. [Epub ahead of print]
PMID: 21999641 [PubMed - as supplied by publisher]

ABSTRACT:

BACKGROUND:

The growth of high-throughput technologies such as microarrays and next generation sequencing has been accompanied by active research in data analysis methodology, producing new analysis methods at a rapid pace. While most of the newly developed methods are freely available, their use requires substantial computational skills. In order to enable non-programming biologists to benefit from the method development in a timely manner, we have created the Chipster software.

RESULTS:

Chipster (http://chipster.csc.fi/) brings a powerful collection of data analysis methods within the reach of bioscientists via its intuitive graphical user interface. Users can analyze and integrate different data types such as gene expression, miRNA and aCGH. The analysis functionality is complemented with rich interactive visualizations, allowing users to select datapoints and create new gene lists based on these selections. Importantly, users can save the performed analysis steps as reusable, automatic workflows, which can also be shared with other users. Being a versatile and easily extendable platform, Chipster can be used for microarray, proteomics and sequencing data. In this article we describe its comprehensive collection of analysis and visualization tools for microarray data using three case studies.

CONCLUSIONS:

Chipster is a user-friendly analysis software for high-throughput data. Its intuitive graphical user interface enables biologists to access a powerful collection of data analysis and integration tools, and to visualize data interactively. Users can collaborate by sharing analysis sessions and workflows. Chipster is open source, and the server installation package is freely available.



Friday, 14 October 2011

DNAnexus to Host Short Read Archive (SRA) Database in Google Cloud

DNAnexus Raises $15 Million, Teams With Google To Host Massive ...

techcrunch.com/.../dnanexus-raises-15-million-teams-with-google-to-...
1 day ago – Today a company called DNAnexus is announcing that it's raised $15 million from Google Ventures and TPG Biotech to help help scientists ...

DNAnexus Secures $15 Million Funding Led by Google Ventures and TPG Biotech

MarketWatch (press release) - ‎Oct 12, 2011‎
After the National Center for Biotechnology Information (NCBI), the main US host of public genomic data, announced in February 2011 that it would phase-out support of the SRA due to federal funding cuts, DNAnexus selected Google Cloud Storage ...

DNAnexus to Host Short Read Archive (SRA) Database in Google Cloud

Bio-IT World - Kevin Davies - ‎Oct 12, 2011‎
This new community resource was made publically available today at sra.dnanexus.com. The relationship grew in part from investment interest in DNAnexus from Google Ventures, which separately announced a funding deal (see below). ...

Major investments show promise of big data in biotech

GigaOm - Derrick Harris - ‎Oct 12, 2011‎
Cloud-based DNA-sequencing specialist DNAnexus has closed a $15 million second round led by Google Ventures and TPG Biotech. Elsewhere, we learned Wednesday that agribusiness giant Monsanto has ...



Wednesday, 12 October 2011

1.Exploring giant plant genomes with next-generation sequencing technology.



1.Exploring giant plant genomes with next-generation sequencing technology.
Kelly LJ, Leitch IJ.
Chromosome Res. 2011 Oct 11. [Epub ahead of print]
PMID: 21987187 [PubMed - as supplied by publisher]

Abstract
Genome size in plants is characterised by its extraordinary range. Although it appears that the majority of plants have small genomes, in several lineages genome size has reached giant proportions. The recent advent of next-generation sequencing (NGS) methods has for the first time made detailed analysis of even the largest of plant genomes a possibility. In this review, we highlight investigations that have utilised NGS for the study of plants with large genomes, as well as describing ongoing work that aims to harness the power of these technologies to gain insights into their evolution. In addition, we emphasise some areas of research where the use of NGS has the potential to generate significant advances in our current understanding of how plant genomes evolve. Finally, we discuss some of the future developments in sequencing technology that may further improve our ability to explore the content and evolutionary dynamics of the very largest genomes.

Knime4Bio, a set of custom nodes for the KNIME and other publications


1. Knime4Bio: a set of custom nodes for the interpretation of Next Generation Sequencing data with KNIME.
Lindenbaum P, Le Scouarnec S, Portero V, Redon R.
Bioinformatics. 2011 Oct 7. [Epub ahead of print]
PMID: 21984761 [PubMed - as supplied by publisher]

Abstract

SUMMARY:

Analysing large amounts of data generated by next-generation sequencing (NGS) technologies is difficult for researchers or clinicians without computational skills. They are often compelled to delegate this task to computer biologists working with command line utilities. The availability of easy-to-use tools will become essential with the generalisation of NGS in research and diagnosis. It will enable investigators to handle much more of the analysis. Here, we describe Knime4Bio, a set of custom nodes for the KNIME (The Konstanz Information Miner) interactive graphical workbench, for the interpretation of large biological datasets. We demonstrate that this tool can be utilised to quickly retrieve previously published scientific findings.

AVAILABILITY:

http://code.google.com/p/knime4bio/

Click here to read
2.Fetal akinesia: review of the genetics of the neuromuscular causes.
Ravenscroft G, Sollis E, Charles AK, North KN, Baynam G, Laing NG.
J Med Genet. 2011 Oct 7. [Epub ahead of print]
PMID: 21984750 [PubMed - as supplied by publisher]
3.Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease.
Rivas MA, Beaudoin M, Gardet A, Stevens C, Sharma Y, Zhang CK, Boucher G, Ripke S, Ellinghaus D, Burtt N, Fennell T, Kirby A, Latiano A, Goyette P, Green T, Halfvarson J, Haritunians T, Korn JM, Kuruvilla F, Lagacé C, Neale B, Lo KS, Schumm P, Törkvist L; National Institute of Diabetes and Digestive Kidney Diseases Inflammatory Bowel Disease Genetics Consortium (NIDDK IBDGC); United Kingdom Inflammatory Bowel Disease Genetics Consortium; International Inflammatory Bowel Disease Genetics Consortium, Dubinsky MC, Brant SR, Silverberg MS, Duerr RH, Altshuler D, Gabriel S, Lettre G, Franke A, D'Amato M, McGovern DP, Cho JH, Rioux JD, Xavier RJ, Daly MJ.
Nat Genet. 2011 Oct 9. doi: 10.1038/ng.952. [Epub ahead of print]
PMID: 21983784 [PubMed - as supplied by publisher]

Abstract

More than 1,000 susceptibility loci have been identified through genome-wide association studies (GWAS) of common variants; however, the specific genes and full allelic spectrum of causal variants underlying these findings have not yet been defined. Here we used pooled next-generation sequencing to study 56 genes from regions associated with Crohn's disease in 350 cases and 350 controls. Through follow-up genotyping of 70 rare and low-frequency protein-altering variants in nine independent case-control series (16,054 Crohn's disease cases, 12,153 ulcerative colitis cases and 17,575 healthy controls), we identified four additional independent risk factors in NOD2, two additional protective variants in IL23R, a highly significant association with a protective splice variant in CARD9 (P < 1 × 10(-16), odds ratio ≈ 0.29) and additional associations with coding variants in IL18RAP, CUL2, C1orf106, PTPN22 and MUC19. We extend the results of successful GWAS by identifying new, rare and probably functional variants that could aid functional experiments and predictive models.

Click here to read


Friday, 7 October 2011

GnuBio Targets 1K-Base Reads for Commercial Launch of $50K Sequencer | In Sequence | Sequencing | GenomeWeb

http://www.genomeweb.com//node/980647?hq_e=el&hq_m=1108591&hq_l=6&hq_v=4f37903830

Currently at 600 bp and at very affordable price per run.and very fast turnaround time of 3 hours and minimal sample prep. This machine will directly compete with the ion torrent.
Using fluorescent hexamers to seq dna seems quite a complicated process at the primary analysis stage. But from the article it appears tt it has  very high accuracy and ability to correctly pick up variants. It will likely be platform of choice if it pushes out before the other platforms gain a foothold.

Thursday, 6 October 2011

Whole-transcriptome RNAseq analysis from minute amount of total RNA. etc

1. Whole-transcriptome RNAseq analysis from minute amount of total RNA.
Tariq MA, Kim HJ, Jejelowo O, Pourmand N.
Nucleic Acids Res. 2011 Oct 1;39(18):e120. Epub 2011 Jul 6.
PMID: 21737426 [PubMed - in process] Free Article

Abstract

RNA sequencing approaches to transcriptome analysis require a large amount of input total RNA to yield sufficient mRNA using either poly-A selection or depletion of rRNA. This feature makes it difficult to miniaturize transcriptome analysis for greater efficiency. To address this challenge, we devised and validated a simple procedure for the preparation of whole-transcriptome cDNA libraries from a minute amount (500 pg) of total RNA. We compared a single-sample library prepared by this Ovation® RNA-Seq system with two available methods of mRNA enrichment (TruSeq™ poly-A enrichment and RiboMinus™ rRNA depletion). Using the Ovation® preparation method for a set of eight mouse tissue samples, the RNA sequencing data obtained from two different next-generation sequencing platforms (SOLiD and Illumina Genome Analyzer IIx) yielded negligible rRNA reads (<3.5%) while retaining transcriptome sequencing fidelity. We further validated the Ovation® amplification technique by examining the resulting library complexity, reproducibility, evenness of transcript coverage, 5' and 3' bias and platform-specific biases. Notably, in this side-by-side comparison, SOLiD sequencing chemistry is biased toward higher GC content of transcriptome and Illumina Genome analyzer IIx is biased away from neutral to lower GC content of the transcriptomics regions.

PMID:
21737426
[PubMed - in process]

PMCID: PMC3185437
Click here to read Click here to read
Free full text

Related citations
2.Next-generation insights into regulatory T cells: expression profiling and FoxP3 occupancy in Human.
Birzele F, Fauti T, Stahl H, Lenter MC, Simon E, Knebel D, Weith A, Hildebrandt T, Mennerich D.
Nucleic Acids Res. 2011 Oct 1;39(18):7946-60. Epub 2011 Jul 4.
PMID: 21729870 [PubMed - in process] Free Article

Abstract

Regulatory T-cells (Treg) play an essential role in the negative regulation of immune answers by developing an attenuated cytokine response that allows suppressing proliferation and effector function of T-cells (CD4(+) Th). The transcription factor FoxP3 is responsible for the regulation of many genes involved in the Treg gene signature. Its ablation leads to severe immune deficiencies in human and mice. Recent developments in sequencing technologies have revolutionized the possibilities to gain insights into transcription factor binding by ChiP-seq and into transcriptome analysis by mRNA-seq. We combine FoxP3 ChiP-seq and mRNA-seq in order to understand the transcriptional differences between primary human CD4(+) T helper and regulatory T-cells, as well as to study the role of FoxP3 in generating those differences. We show, that mRNA-seq allows analyzing the transcriptomal landscape of T-cells including the expression of specific splice variants at much greater depth than previous approaches, whereas 50% of transcriptional regulation events have not been described before by using diverse array technologies. We discovered splicing patterns like the expression of a kinase-dead isoform of IRAK1 upon T-cell activation. The immunoproteasome is up-regulated in both Treg and CD4(+) Th cells upon activation, whereas the 'standard' proteasome is up-regulated in Tregs only upon activation.

Click here to read Click here to read

Related citations

Wednesday, 5 October 2011

Recent Human Evolution Detected in Quebec Town History | Wired Science | Wired.com

http://m.wired.com/wiredscience/2011/10/recent-human-evolution/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+wired%2Findex+%28Wired%3A+Index+3+%28Top+Stories+2%29%29

Though ongoing human evolution is difficult to see, researchers believe they've found signs of rapid genetic changes among the recent residents of a small Canadian town. Between 1800 and 1940, mothers in Ile aux Coudres, Quebec gave birth at steadily younger ages, with the average age of first maternity dropping from 26 to 22. Increased fertility, and thus larger families, could have been especially useful in the rural settlement's early history.

gvfs commands on Ubuntu

gvfs-cat            gvfs-monitor-dir    gvfs-rm
gvfs-copy           gvfs-monitor-file   gvfs-save
gvfs-info           gvfs-mount          gvfs-set-attribute
gvfs-less           gvfs-move           gvfs-trash
gvfs-ls             gvfs-open           gvfs-tree
gvfs-mkdir          gvfs-rename        


I use gvfs-copy in certain situations. 
Mental note to check out the other commands to see what they do...

What's new for 'next generation sequencing' in PubMed




1.Analysis of 16S rRNA Amplicon Sequencing Options on the Roche/454 Next-Generation Titanium Sequencing Platform.
Tamaki H, Wright CL, Li X, Lin Q, Hwang C, Wang S, Thimmapuram J, Kamagata Y, Liu WT.
PLoS One. 2011;6(9):e25263. Epub 2011 Sep 23.
PMID: 21966473 [PubMed - in process]

Abstract

BACKGROUND:

16S rRNA gene pyrosequencing approach has revolutionized studies in microbial ecology. While primer selection and short read length can affect the resulting microbial community profile, little is known about the influence of pyrosequencing methods on the sequencing throughput and the outcome of microbial community analyses. The aim of this study is to compare differences in output, ease, and cost among three different amplicon pyrosequencing methods for the Roche/454 Titanium platform

METHODOLOGY/PRINCIPAL FINDINGS:

The following three pyrosequencing methods for 16S rRNA genes were selected in this study: Method-1 (standard method) is the recommended method for bi-directional sequencing using the LIB-A kit; Method-2 is a new option designed in this study for unidirectional sequencing with the LIB-A kit; and Method-3 uses the LIB-L kit for unidirectional sequencing. In our comparison among these three methods using 10 different environmental samples, Method-2 and Method-3 produced 1.5-1.6 times more useable reads than the standard method (Method-1), after quality-based trimming, and did not compromise the outcome of microbial community analyses. Specifically, Method-3 is the most cost-effective unidirectional amplicon sequencing method as it provided the most reads and required the least effort in consumables management.

CONCLUSIONS:

Our findings clearly demonstrated that alternative pyrosequencing methods for 16S rRNA genes could drastically affect sequencing output (e.g. number of reads before and after trimming) but have little effect on the outcomes of microbial community analysis. This finding is important for both researchers and sequencing facilities utilizing 16S rRNA gene pyrosequencing for microbial ecological studies.

PMID:
21966473
[PubMed - in process]

PMCID: PMC3179495

Free full text
Click here to read

2.Insight into the heterogeneity of breast cancer through next-generation sequencing.
Russnes HG, Navin N, Hicks J, Borresen-Dale AL.
J Clin Invest. 2011 Oct 3;121(10):3810-8. doi: 10.1172/JCI57088. Epub 2011 Oct 3.
PMID: 21965338 [PubMed - in process]

Abstract

Rapid and sophisticated improvements in molecular analysis have allowed us to sequence whole human genomes as well as cancer genomes, and the findings suggest that we may be approaching the ability to individualize the diagnosis and treatment of cancer. This paradigmatic shift in approach will require clinicians and researchers to overcome several challenges including the huge spectrum of tumor types within a given cancer, as well as the cell-to-cell variations observed within tumors. This review discusses how next-generation sequencing of breast cancer genomes already reveals insight into tumor heterogeneity and how it can contribute to future breast cancer classification and management.

PMID:
21965338
[PubMed - in process]

Free full text
Click here to read

3. Using next generation sequencing to identify yellow fever virus in Uganda.
McMullan LK, Frace M, Sammons SA, Shoemaker T, Balinandi S, Wamala JF, Lutwama JJ, Downing RG, Stroeher U, Macneil A, Nichol ST.
Virology. 2011 Sep 30. [Epub ahead of print]

Abstract

In October and November 2010, hospitals in northern Uganda reported patients with suspected hemorrhagic fevers. Initial tests for Ebola viruses, Marburg virus, Rift Valley fever virus, and Crimean Congo hemorrhagic fever virus were negative. Unbiased PCR amplification of total RNA extracted directly from patient sera and next generation sequencing resulted in detection of yellow fever virus and generation of 98% of the virus genome sequence. This finding demonstrated the utility of next generation sequencing and a metagenomic approach to identify an etiological agent and direct the response to a disease outbreak.

Copyright © 2011. Published by Elsevier Inc.


Click here to read
PMID: 21962764 [PubMed - as supplied by publisher]
4.A blueprint for advancing genetics-based cancer therapy.
Sellers WR.
Cell. 2011 Sep 30;147(1):26-31.
PMID: 21962504 [PubMed - in process]
5.Unraveling the Chinese hamster ovary cell line transcriptome by next-generation sequencing.
Becker J, Hackl M, Rupp O, Jakobi T, Schneider J, Szczepanowski R, Bekel T, Borth N, Goesmann A, Grillari J, Kaltschmidt C, Noll T, Pühler A, Tauch A, Brinkrolf K.
J Biotechnol. 2011 Sep 17. [Epub ahead of print]
PMID: 21945585 [PubMed - as supplied by publisher]

Abstract

The pyrosequencing technology from 454 Life Sciences and a novel assembly approach for cDNA sequences with the Newbler Assembler were used to achieve a major step forward to unravel the transcriptome of Chinese hamster ovary (CHO) cells. Normalized cDNA libraries originating from several cell lines and diverse culture conditions were sequenced and the resulting 1.84 million reads were assembled into 32,801 contiguous sequences, 29,184 isotigs, and 24,576 isogroups. A taxonomic classification of the isotigs showed that more than 70% of the assembled data is most similar to the transcriptome of Mus musculus, with most of the remaining isotigs being homologous to DNA sequences from Rattus norvegicus. Mapping of the CHO cell line contigs to the mouse transcriptome demonstrated that 9124 mouse transcripts, representing 6701 genes, are covered by more than 95% of their sequence length. Metabolic pathways of the central carbohydrate metabolism and biosynthesis routes of sugars used for protein N-glycosylation were reconstructed from the transcriptome data. All relevant genes representing major steps in the N-glycosylation pathway of CHO cells were detected. The present manuscript represents a data set of assembled and annotated genes for CHO cells that can now be used for a detailed analysis of the molecular functioning of CHO cell lines.

Click here to read

Related citations

Tuesday, 4 October 2011

Copy number variation analysis in the great apes reveals species-specific patterns of structural variation [RESEARCH]

Copy number variation analysis in the great apes reveals species-specific patterns of structural variation [RESEARCH]

Copy number variants (CNVs) are increasingly acknowledged as an important source of evolutionary novelties in the human lineage. However, our understanding of their significance is still hindered by the lack of primate CNV data. We performed intraspecific comparative genomic hybridizations to identify loci harboring copy number variants in each of the four great apes: bonobos, chimpanzees, gorillas, and orangutans. For the first time, we could analyze differences in CNV location and frequency in these four species, and compare them with human CNVs and primate segmental duplication (SD) maps. In addition, for bonobo and gorilla, patterns of CNV and nucleotide diversity were studied in the same individuals. We show that CNVs have been subject to different selective pressures in different lineages. Evidence for purifying selection is stronger in gorilla CNVs overlapping genes, while positive selection appears to have driven the fixation of structural variants in the orangutan lineage. In contrast, chimpanzees and bonobos present high levels of common structural polymorphism, which is indicative of relaxed purifying selection together with the higher mutation rates induced by the known burst of segmental duplication in the ancestor of the African apes. Indeed, the impact of the duplication burst is noticeable by the fact that bonobo and chimpanzee share more CNVs with gorilla than expected. Finally, we identified a number of interesting genomic regions that present high-frequency CNVs in all great apes, while containing only very rare or even pathogenic structural variants in humans.

OpenHelix Blog non programmatic ways of obtaining information from a list of SNPs

Obtaining information about SNPs

This question was a while back on BioStar, ways to get information about a list of SNPs. It got me to thinking, what are the various ways to obtain a file of information about a list of SNPs  (I'm assuming no programming skills, web or other simple query)? The obvious way for me is the UCSC Table Browser. Our tutorial (free) has an exercise that does just that. The question, for a given gene find data about all the SNPs annotated for that region, is simply answered.

What other ways are there? Turns out there are quite a few, I've started to list them here. All these I've used "clock" in human as my query and pulled a list of information about the SNPS in that region. In no particular order:

UCSC Table Browser (many species): See (tutorial - free)
Genome Variation Server (GVS, human only): click "gene name" > select populations & parameters > select "display SNP summary" > add or remove columns of data needed.  (tutorial – subscription)
Ensembl BioMart (many species):  choose Database  (Ensembl Variation) > choose dataset (homo sapiens dbSNP 132) > choose filter (chromosome and start/end base pair) > choose attributes (name, strand, etc). (tutorial- subscription)
Varietas (human):  choose genes > type in gene name > click search (a video tip from the blog)
F-SNP (human):  click search > choose "query by gene" > type in "clock" > submit (a video tip from the blog)
dbSNP  of course (many species): choose SNP database > choose limits (chromosome & location) > click search (tutorial – subscription)
SNPVar (various): in the comments below, Glenn details how to obtain a list using this NCBI feature. More here about gettings SNPs in a gene at NCBI.

GC-Content Normalization for RNA-Seq Data.

EDASeq – Another Tool to Fight Bias

Major technology-related artifacts and biases affect RNA-Seq expression data. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof. Researchers at UC Berkeley focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis.

They propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Their methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression.

The normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq. The resulting normalized counts (or raw counts and associated normalization offsets) can then be supplied seamlessly to other R packages for differential expression analysis, such as DESeq or edgeR

  • Risso D, Schwartz K, Sherlock G, Dudoit S. (2011) GC-Content Normalization for RNA-Seq Data. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 291. [abstract]

EDASeq – Another Tool to Fight Bias is a post from: RNA-Seq Blog More information about RNA-Seq can be here.

Sunday, 2 October 2011

Gee Fu: a sequence version and web-services database tool for genomic assembly, genome feature and NGS data

Gee Fu: a sequence version and web-services database tool for genomic assembly, genome feature and NGS data

Summary: Scientists now use high-throughput sequencing technologies and short-read assembly methods to create draft genome assemblies in just days. Tools and pipelines like the assembler, and the workflow management environments make it easy for a non-specialist to implement complicated pipelines to produce genome assemblies and annotations very quickly. Such accessibility results in a proliferation of assemblies and associated files, often for many organisms. These assemblies get used as a working reference by lots of different workers, from a bioinformatician doing gene prediction or a bench scientist designing primers for PCR. Here we describe Gee Fu, a database tool for genomic assembly and feature data, including next-generation sequence alignments. Gee Fu is an instance of a Ruby-On-Rails web application on a feature database that provides web and console interfaces for input, visualization of feature data via AnnoJ, access to data through a web-service interface, an API for direct data access by Ruby scripts and access to feature data stored in BAM files. Gee Fu provides a platform for storing and sharing different versions of an assembly and associated features that can be accessed and updated by bench biologists and bioinformaticians in ways that are easy and useful for each.

Availability: http://tinyurl.com/geefu

Contact: dan.maclean@tsl.ac.uk

Transcriptome Viewer – Comprehensive Genome-wide Map of Human Gene Expression Activity

Transcriptome Viewer – Comprehensive Genome-wide Map of Human Gene Expression Activity

MediSapiens Ltd, has released the most comprehensive map of human gene expression yet for public use. The data is available through a graphical tool, Transcriptome Viewer, allowing exploration of the expression activity of genes across chromosomes in tens of healthy human tissues.

Data upon which this tool is built (over 300 million manually curated data points) were collected from public science and in itself it is the largest fully integrated gene expression collection in the world.

The fully integrated gene expression information is available online at: http://www.medisapiens.com/transcriptome-viewer-overview/

(Read the press release… )

Transcriptome Viewer – Comprehensive Genome-wide Map of Human Gene Expression Activity is a post from: RNA-Seq Blog More information about RNA-Seq can be here.

Incoming search terms:

A powerful and flexible approach to the analysis of RNA sequence count data

A powerful and flexible approach to the analysis of RNA sequence count data

Motivation: A number of penalization and shrinkage approaches have been proposed for the analysis of microarray gene expression data. Similar techniques are now routinely applied to RNA sequence transcriptional count data, although the value of such shrinkage has not been conclusively established. If penalization is desired, the explicit modeling of mean–variance relationships provides a flexible testing regimen that 'borrows' information across genes, while easily incorporating design effects and additional covariates.

Results: We describe BBSeq, which incorporates two approaches: (i) a simple beta-binomial generalized linear model, which has not been extensively tested for RNA-Seq data and (ii) an extension of an expression mean–variance modeling approach to RNA-Seq data, involving modeling of the overdispersion as a function of the mean. Our approaches are flexible, allowing for general handling of discrete experimental factors and continuous covariates. We report comparisons with other alternate methods to handle RNA-Seq data. Although penalized methods have advantages for very small sample sizes, the beta-binomial generalized linear model, combined with simple outlier detection and testing approaches, appears to have favorable characteristics in power and flexibility.

Availability: An R package containing examples and sample datasets is available at http://www.bios.unc.edu/research/genomic_software/BBSeq

Contact: yzhou@bios.unc.edu; fwright@bios.unc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

The Interactive Vim Tutorial Teaches You How to Use Vim, the Fast, Mouseless Text Editor [Text Editors]


The Interactive Vim Tutorial Teaches You How to Use Vim, the Fast, Mouseless Text Editor [Text Editors]

Vim has long been praised as one of the best text editors around, mostly for its completely mouseless navigation. However, it can be very confusing for beginners. This interactive tutorial gets you started so you can edit text files with blinding speed. More »

Q&A: Yale University's Mark Gerstein on the Real Cost of Sequencing

Q&A: Yale University's Mark Gerstein on the Real Cost of Sequencing

 

read more


A recent study by scientists at Yale University suggests that the actual cost of sequencing may be much higher than some current estimates indicate since those figures may not factor in the analysis costs that are necessary for a successful sequencing project.

In the paper, published in Genome Biology last month, Yale's Mark Gerstein and colleagues consider costs that weren't taken into account in a survey conducted by the National Human Genome Research Institute that pegged the cost per genome as of March 2011 to be a little over $10,000.

Gerstein and colleagues note that the NHGRI survey, which analyzed data from the Large-Scale Genome Sequencing Program, omitted so-called "non-production activities," such as costs for the development of computational tools to improve sequencing pipelines or downstream analysis; quality assessment and quality control; technology development to improve sequencing pipelines; management of individual sequencing projects; informatics equipment; and downstream analyses such as sequence assembly, sequence alignment, identifying variants, and the interpretation of results.

They estimate that the cost of downstream analysis for a whole-genome sequencing project could add as much as $100,000 to the overall costs.

BioInform spoke with Gerstein earlier this month. What follows is an edited version of the conversation.

Single molecule sequencing gives more endegenous DNA seq than Illumina GAIIx

True single-molecule DNA sequencing of a pleistocene horse bone [METHOD]

Second-generation sequencing platforms have revolutionized the field of ancient DNA, opening access to complete genomes of past individuals and extinct species. However, these platforms are dependent on library construction and amplification steps that may result in sequences that do not reflect the original DNA template composition. This is particularly true for ancient DNA, where templates have undergone extensive damage post-mortem. Here, we report the results of the first "true single molecule sequencing" of ancient DNA. We generated 115.9 Mb and 76.9 Mb of DNA sequences from a permafrost-preserved Pleistocene horse bone using the Helicos HeliScope and Illumina GAIIx platforms, respectively. We find that the percentage of endogenous DNA sequences derived from the horse is higher among the Helicos data than Illumina data. This result indicates that the molecular biology tools used to generate sequencing libraries of ancient DNA molecules, as required for second-generation sequencing, introduce biases into the data that reduce the efficiency of the sequencing process and limit our ability to fully explore the molecular complexity of ancient DNA extracts. We demonstrate that simple modifications to the standard Helicos DNA template preparation protocol further increase the proportion of horse DNA for this sample by threefold. Comparison of Helicos-specific biases and sequence errors in modern DNA with those in ancient DNA also reveals extensive cytosine deamination damage at the 3' ends of ancient templates, indicating the presence of 3'-sequence overhangs. Our results suggest that paleogenomes could be sequenced in an unprecedented manner by combining current second- and third-generation sequencing approaches.

The Phenotype-Genotype Integrator

http://www.ncbi.nlm.nih.gov/gap/PheGenI
The Phenotype-Genotype Integrator (PheGenI), merges NHGRI genome-wide association study (GWAS) catalog data with several databases housed at the National Center for Biotechnology Information (NCBI), including Gene, dbGaP, OMIM, GTEx and dbSNP.  This phenotype-oriented resource, intended for clinicians and epidemiologists interested in following up results from GWAS, can facilitate prioritization of variants to follow up, study design considerations, and generation of biological hypotheses.  Users can search based on chromosomal location, gene, SNP, or phenotype and view and download results including annotated tables of SNPs, genes and association results, a dynamic genomic sequence viewer, and gene expression data. PheGenI is still under active development.  Currently, the phenotype search terms are based on MeSH and will be enhanced with additional options in the future.

http://www.youtube.com/ncbinlm#p/c/1/XS3p924nWCA
A tutorial introducing PheGenI, which aggregates genome-wide association study results with genetic variations, genes and gene expression differences for a variety of phenotypic traits.

Datanami, Woe be me