Showing posts with label RNA-seq. Show all posts
Showing posts with label RNA-seq. Show all posts

Monday, 12 December 2011

How much coverage / throughput for my RNA-seq?

One of the earliest questions to bug anyone planning an RNA-seq experiment has to be the throughput (how many reads do I need?)

If you are dealing with human samples, you have the benefit of extensive publications with example coverages and some papers that test the limits of detection. All of this info is nicely summarised here in experimental design considerations in RNA-Seq.

Bashir et al. have  concluded that more than 90% of the transcripts in human samples are adequately covered with just one million  sequence reads.  Wang et al. showed that 8 million reads are sufficient to reach RNA-Seq saturation for most  samples

The ENCODE consortium also has published a Guidelines for Experiments within you can read RNA Standards v1.0 (May 2011) and also RNA-seq Best Practices (2009)


Experiments whose purpose is to evaluate the similarity between the
transcriptional profiles of two polyA+ samples may require only modest depths of
sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are
mappable to the genome or known transcriptome, Experiments whose purpose is
discovery of novel transcribed elements and strong quantification of known
transcript isoforms requires more extensive sequencing. The ability to detect
reliably low copy number transcripts/isoforms depends upon the depth of
sequencing and on a sufficiently complex library.


RNA-seq blog also covers this issue in How Many Reads are Enough? Where they cited an article on RNA-seq in chicken lungs

The analysis from the current study demonstrated that 30 M (75 bp) reads is sufficient to detect all annotated genes in chicken lungs. Ten million (75 bp) reads could detect about 80% of annotated chicken genes.

There are also papers that showed that RNA-seq gives reproducible results when sequenced from the same RNA-seq library which means that if coverage isn't enough, it is possible to sequence more using the same library and not have it affect your results. The real issue then becomes whether  you have planned for additional sequencing with your budget.



References
Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from paired-end RNA-seq data by  SpliceMap. Nucleic acids research 38, 4570-4578 (2010).

Maher, C.A., Palanisamy, N., Brenner, J.C., Cao, X., Kalyana-Sundaram, S., Luo, S., Khrebtukova, I., Barrette, T.R.,  Grasso, C., Yu, J., Lonigro, R.J., Schroth, G., Kumar-Sinha, C. & Chinnaiyan, A.M. Chimeric transcript discovery by  paired-end transcriptome sequencing. Proceedings of the National Academy of Sciences of the United States of America   106, 12353-12358 (2009).

Bashir, A., Bansal, V. & Bafna, V. Designing deep sequencing experiments: detecting structural variation and estimating  transcript abundance. BMC genomics 11, 385 (2010).

Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews 10, 57-63 (2009).


Wang Y, Ghaffari N, Johnson CD, Braga-Neto UM, Wang H, Chen R, Zhou H. (2011) Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens. BMC Bioinformatics Proceedings of the Eighth Annual MCBIOS Conference. Computational Biology and Bioinformatics for a New Decade, College Station, TX, USA. 1-2 April 2011. [article

Thursday, 8 December 2011

A reference transcriptome for the cauliflower coral – Pocillopora damicornis | RNA-Seq Blog

http://rna-seqblog.com/transcrioptome-sequenced/a-reference-transcriptome-for-the-cauliflower-coral-pocillopora-damicornis/
By identifying changes in coral gene expression that are triggered by particular environmental stressors, we can begin to characterize coral stress responses at the molecular level, which should lead to the development of more powerful diagnostic tools for evaluating the health of corals in the field.
With the goal of identifying genetic variants that are more or less resilient in the face of particular stressors, a team led by researchers at  Boston & Stanford Universities performed deep mRNA sequencing of the cauliflower coral, Pocillopora damicornis, a geographically widespread Indo-Pacific species that exhibits a great diversity of colony forms and is able to thrive in habitats subject to a wide range of human impacts. They isolated RNA from colony fragments ("nubbins") exposed to four environmental stressors (heat, desiccation, peroxide, and hypo-saline conditions) or control conditions. The RNA was pooled and sequenced using the 454 platform. Description. Both the raw reads (n = 1,116,551) and the assembled contigs (n = 70,786; mean length = 836 nucleotides) were deposited in a new publicly available relational database called PocilloporaBase (www.PocilloporaBase.org).
P. damicornis now joins the handful of coral species for which extensive transcriptomic data are publicly available. Through PocilloporaBase (www.PocilloporaBase.org), one can obtain assembled contigs and raw reads and query the data according to a wide assortment of attributes including taxonomic origin, PFAM motif, KEGG pathway, and GO annotation.
  • Traylor-Knowles N et al. (2011) Production of a reference transcriptome and a transcriptomic database (PocilloporaBase) for the cauliflower coral, Pocillopora damicornis. BMC Genomics 12, 585. [article]

Saturday, 17 September 2011

FAQ - Howto do RNA-seq Bioinformatics analysis on Galaxy

One of the top questions posted in the Galaxy User mailing list. 
reposted the summary links here for convenience.

Tutorial covering RNA-seq analysis (tool under "NGS: RNA Analysis")
http://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise

FAQ to help with troubleshooting (if needed):
http://usegalaxy.org/u/jeremy/p/transcriptome-analysis-faq

For visualization, an update that allows the use of a user-specified
fasta reference genome is coming out very soon. For now, you can view
annotation by creating a custom genome build, but the actual reference
will be not included. Use "Visualization -> New Track Browser" and
follow the instructions for "Is the build not listed here? Add a Custom
Build".

Help for using the tool is available here:
http://galaxyproject.org/Learn/Visualization
 

Currently, RNA-seq analysis for SOLiD data is available only on Galaxy test server:
http://test.g2.bx.psu.edu/

Please note that there are quotas associated with the test server:
http://galaxyproject.org/wiki/News/Galaxy%20Public%20Servers%20Usage%20Quotas


[Credit : Jennifer Jackson ]
http://usegalaxy.org
http://galaxyproject.org/Support


Another helpful resource (non-Galaxy related though) is
http://seqanswers.com/wiki/How-to/RNASeq_analysis written by Matthew Young
and the discussion on this wiki @ seqanswers
http://seqanswers.com/forums/showthread.php?t=7068

As well as this review paper in Genome Biology RNA-seq Review

Stephen mentions this tutorial as well in this blog


Dr David Matthews has posted a starter thread to discuss RNA seq analysis workflow on Paired End Seq with Tophat on Galaxy in the mailling list.

RNA seq analysis workflow on Galaxy (Bristol workflow)


His post and the discussion thread is here.
http://gmod.827538.n3.nabble.com/Replicates-tt2397672.html#a2560404 

kevin:waiting for the next common question to come next, is there Ion Torrent Support on Galaxy ?) 

Sunday, 11 September 2011

Differential expression in RNA-seq: A matter of d... [Genome Res. 2011] - PubMed - NCBI

Click here to read

http://www.ncbi.nlm.nih.gov/pubmed/21903743
Abstract Next Generation Sequencing (NGS) technologies are revolutionizing genome research and in particular, their application to transcriptomics (RNA-seq) is increasingly being used for gene expression profiling as a replacement for microarrays. However, the properties of RNA-seq data have not been yet fully established and additional research is needed for understanding how these data respond to differential expression analysis. In this work we set out to gain insights into the characteristics of RNA-seq data analysis by studying an important parameter of this technology: the sequencing depth. We have analyzed how sequencing depth affects the detection of transcripts and their identification as differentially expressed, looking at aspects such as transcript biotype, length, expression level and fold-change. We have evaluated different algorithms available for the analysis of RNA-seq and proposed a novel approach -NOISeq-that differs from existing methods in that it is data-adaptive and non-parametric. Our results reveal that most existing methodologies suffer from a strong dependency on sequencing depth for their differential expression calls and that this results in a considerable number of false positives that increases as the number of reads grows. In contrast, our proposed method models the noise distribution from the actual data, can therefore better adapt to the size of the dataset and is more effective in controlling the rate of false discoveries. This work discusses the true potential of RNA-seq for studying regulation at low expression ranges, the noise within RNA-seq data and the issue of replication.

PMID: 21903743 [PubMed -as supplied by publisher]

Wednesday, 13 July 2011

RNA-seq on the Ion Torrent PGM

K I must admit with the 314 chip, 500k reads seem .... stretching the limits of usability for RNA-seq. but looking at the Life Tech's presentation New to RNA-seq; how it compares to microarrays. it does makes sense to use PGM over microarray for certain reasons.. and certain samples. e.g. bacteria / virus transcriptomes. Granted that PGM also gives better dynamic range than microarrays with a price that's not too far from microarray, it does make sense to beef up one's data with a run or two of Ion Torrent.

at USD $595 for a 316 chip run "All included"as quoted. They do make it very attractive for microarray users to switch over. Granted u might need a couple of runs to make sense of human samples.
Though I be wary about hidden costs not anticipated in their calculations.
What's interesting is that they claim no platform bias between SOLiD and PGM runs, no details are given, but i assume they ran PGM runs to match SOLiD Throughput and compared the output?

Would you consider PGM for ur RNA-seq?
post in comments please...

Sunday, 13 March 2011

RNA-seq of a 12GB dataset in less than 7 hours

Folks at CLCbio have updated the CLC Genomics Machine benchmarks with newer datasets and the new hardware configuration. There are two RNA-Seq data sets and a full genome mapping data set, and more will come. All the benchmarks are now using the CLC Genomics Server rather than the Assembly Cell.

Saturday, 12 March 2011

The Tea Transcriptome


The Tea Transcriptome


Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Using high-throughput Illumina RNA-seq, the transcriptome from C. sinensis was analyzed at an unprecedented depth.
(read more… )
Shi C et al. (2011) Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compoundsBMC Genomics [Epub ahead of print]. [article]
The Tea Transcriptome is a post from: RNA-Seq Blog More information about RNA-Seq can be here.

Statistics Meeting Focused on RNA-Seq Data Handling from RNA-Seq Blog


Statistics Meeting Focused on RNA-Seq Data Handling

Joint Statistical Meeting July 30th-August 4th, 2011
Miami Beach, Fl
Sharing Information Across Genes to Estimate Overdispersion in RNA-seq Data 
Steven Peder Lund, Iowa State University; Dan Nettleton, Iowa State University31/2011
Differential expression analysis for paired RNA-seq data 
Lisa M Chung, Department of Epidemiology and Public Health, Yale University ; John Ferguson, Department of Epidemiology and Public Health, Yale University ; Hongyu Zhao, Yale University
How to characterize dynamic bayesian networks across multiple species from time series mRNA-Seq count gene expression profiles:An intelligent Dynamic Bayesian Networks (IDBNs) 
sunghee OH, Yale University; Hongyu Zhao, Yale University; James P. Noonan, Yale University
Statistical methods for the analysis of next-generation sequencing data 
Karthik Devarajan, Fox Chase Cancer Center
MEAN-VARIANCE MODELING OF RNA-SEQ TRANSCRIPTIONAL COUNT DATA 
Yihui Zhou, Univ North Carolina; Fred Andrew Wright, Univ North Carolina
On Differential Gene Expression Using RNA-Seq Data 
Ju Hee Lee, University of Texas, MD Anderson Cancer Center; Yuan Ji, MD Anderson Cancer Centre – University of Texas; Shoudan Liang, University of Texas, MD Anderson Cancer Center; Guoshuai Cai, University of Texas, MD Anderson Cancer Center; Peter Mueller, MD Anderson Cancer Center
A Bayesian nonparametric method for differential expression analysis of RNA-seq data 
Yiyi Wang, Department of Statistics, Texas A&M University; David B. Dahl, Department of Statistics, Texas A&M University
Statistical strategy for eQTL mapping using RNA-seq data 
Wei Sun, University of North Carolina, Chapel Hill
Joint analyses of high-throughput DNA and RNA-seq data from cancer samples 
Su Yeon Kim, University of California, Berkeley; Terence Speed, University of California, Berkeley
Significance Analysis of time-series gene expression profiles :via differential/trajectory models in temporal mRNA-Seq data 
sunghee OH, Yale University; Hongyu Zhao, Yale University; James P. Noonan, Yale University
Model-Based Clustering for RNA-Seq Data 
Yaqing Si, Iowa State University; Peng Liu, Iowa State University
An integrative approach to comparing and normalizing gene expression data generated from RNA-seq, microarray, and RT-PCR technologies 
Zhaonan Sun, Department of Statistics, Purdue University; Yu Zhu, Department of Statistics, Purdue University
Normalization, testing, and false discovery rate estimation for RNA-sequencing data 
Jun Li, Department of Statistics, Stanford University; Daniela Witten, University of Washington; Iain M Johnstone, Stanford University; Robert Tibshirani, Dept of Health Research and Policy, & Statistics, Stanford University
Statistics Meeting Focused on RNA-Seq Data Handling is a post from: RNA-Seq Blog More information about RNA-Seq can be here.

A pipeline for RNA-seq data processing and quality assessment from Bioinformatics - current issue


A pipeline for RNA-seq data processing and quality assessment

Summary: We present an R based pipeline, ArrayExpressHTS, for pre-processing, expression estimation and data quality assessment of high-throughput sequencing transcriptional profiling (RNA-seq) datasets. The pipeline starts from raw sequence files and produces standard Bioconductor R objects containing gene or transcript measurements for downstream analysis along with web reports for data quality assessment. It may be run locally on a user's own computer or remotely on a distributed R-cloud farm at the European Bioinformatics Institute. It can be used to analyse user's own datasets or public RNA-seq datasets from the ArrayExpress Archive.
Availability: The R package is available at www.ebi.ac.uk/tools/rcloud with online documentation at www.ebi.ac.uk/Tools/rwiki/, also available as supplementary material.

Datanami, Woe be me