Kevin's GATTACA World: usegalaxy

Showing posts with label usegalaxy. Show all posts

Saturday, 17 September 2011

FAQ - Howto do RNA-seq Bioinformatics analysis on Galaxy

One of the top questions posted in the Galaxy User mailing list.
reposted the summary links here for convenience.

Tutorial covering RNA-seq analysis (tool under "NGS: RNA Analysis")
http://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise

FAQ to help with troubleshooting (if needed):
http://usegalaxy.org/u/jeremy/p/transcriptome-analysis-faq

For visualization, an update that allows the use of a user-specified
fasta reference genome is coming out very soon. For now, you can view
annotation by creating a custom genome build, but the actual reference
will be not included. Use "Visualization -> New Track Browser" and
follow the instructions for "Is the build not listed here? Add a Custom
Build".

Help for using the tool is available here:
http://galaxyproject.org/Learn/Visualization

Currently, RNA-seq analysis for SOLiD data is available only on Galaxy test server:
http://test.g2.bx.psu.edu/

Please note that there are quotas associated with the test server:
http://galaxyproject.org/wiki/News/Galaxy%20Public%20Servers%20Usage%20Quotas

[Credit : Jennifer Jackson ]
http://usegalaxy.org
http://galaxyproject.org/Support

Another helpful resource (non-Galaxy related though) is
http://seqanswers.com/wiki/How-to/RNASeq_analysis written by Matthew Young
and the discussion on this wiki @ seqanswers
http://seqanswers.com/forums/showthread.php?t=7068

As well as this review paper in Genome Biology RNA-seq Review

Stephen mentions this tutorial as well in this blog

Dr David Matthews has posted a starter thread to discuss RNA seq analysis workflow on Paired End Seq with Tophat on Galaxy in the mailling list.

RNA seq analysis workflow on Galaxy (Bristol workflow)

His post and the discussion thread is here.
http://gmod.827538.n3.nabble.com/Replicates-tt2397672.html#a2560404

kevin:waiting for the next common question to come next, is there Ion Torrent Support on Galaxy ?)

Sunday, 13 March 2011

Using Galaxy for NGS sample submission and tracking for service providers

Over at the Blue Collar Bioinformatics
Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy.
This post will show videos of the interface in action, describe installation and extension of the system, and detail the implementation architecture.

Thursday, 24 February 2011

RNA seq analysis workflow on Galaxy (Bristol workflow)

Dr David Matthews has posted a starter thread to discuss RNA seq analysis workflow on Paired End Seq with Tophat on Galaxy. in the mailling list.

His post and the discussion thread is here.
http://gmod.827538.n3.nabble.com/Replicates-tt2397672.html#a2560404

I thought I'd write to get a discussion of a workflow for people doing RNA seq that I have found very useful and addresses some issues in mapping mRNA derived RNA-seq paired end data to the genome using tophat. Here is the approach I use (I have a human mRNA sample deep sequenced with a 56bp paired end read on an illumina generating 29 million reads):

Bristol Method

1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for each sequence read
2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is mapped in a proper pair"
3. Use "group" to group the filtered sam file on c1 (which is the "bio-sequencer" read number) and set an operation to count on c1 as well. This provides a list of the reads and how many times they map to the human genome, because you have filtered the set for reads that have a mate pair there will be an even number for each read. For most of the reads the number will be 2 (indicating the forward read maps once and the reverse read maps once and in a proper pair) but for reads that map ambiguously the number will be multiples of 2. If you count these up I find that 18 million reads map once, 1.3 million map twice, 400,000 reads map 3 times and so on until you get down to 1 read mapping 30 times, 1 read mapping 31 times and so on...
4. Filter the reads to remove any reads that map more than 2 times.
5. Use "compare two datasets" to compare your new list of reads that map only twice to pull out all the reads in your sam file that only map twice (i.e. the mate pairs).
6. You'll need to sort the sam file before you can use it with other applications like IGV.

What you end up with is a sam file where all the reads map to one site only and all the reads map as a proper pair. This may seem similar to setting tophat to ignore non-unique reads. However, it is not. This approach gives you 10-15% more reads. I think it is because if tophat finds (for example) that the forward read maps to one site but the reverse read maps to two sites it throws away the whole read. By filtering the sam file to restrict it to only those mappings that make sense you increase the number of unique reads by getting rid of irrational mappings.

Has anyone else found this? Does this make sense to anyone else? Am I making a huge mistake somewhere?

A nice aspect of this (or at least I think so!) is that by filtering in this manner you can also create a sam file of non-unique mappings which you can monitor. This can be useful if one or more genes has a problem of generating a lot of non-unique maps which may give problems accurately estimating its expression. Also, you also get a list of how many multi hits you have in your data so you know the scale of the problem.

Best Wishes,

David.

__________________________________

Dr David A. Matthews

Senior Lecturer in Virology

Room E49

Department of Cellular and Molecular Medicine,

School of Medical Sciences

University Walk,

University of Bristol

Kevin's GATTACA World

Saturday, 17 September 2011

FAQ - Howto do RNA-seq Bioinformatics analysis on Galaxy

RNA seq analysis workflow on Galaxy (Bristol workflow)

Sunday, 13 March 2011

Using Galaxy for NGS sample submission and tracking for service providers

Thursday, 24 February 2011

RNA seq analysis workflow on Galaxy (Bristol workflow)

Datanami, Woe be me

Analytics code

Contributors