Showing posts with label coverage. Show all posts
Showing posts with label coverage. Show all posts

Thursday, 28 March 2013

Fancy new interactive visualization of coverage depth for exome sequencing. http://t.co/3a61F4OdKY

Fancy new interactive visualization of coverage depth for exome sequencing.

This has to appeal to the biologists ...

Thursday, 1 March 2012

Mac (BSD) awk != Linux (GNU) awk, split BED file by chromosomes


To split your BED file by chromosome, this simple GNU AWK script will create "my.chrNNN.bed" file for each chromosome:
  awk '{ print >> "my." $1 ".bed" }' my.bed


BSD's AWK 
 awk '{ file = "TFBS." $1 ".bed" ; print >> file }' TFBS.bed





credit: Assaf Gordon on BedTools mailing list for pointing this out






On a side note, 
To split your BAM file by chromosome, you can use "bamtools split" ( bamtools here: https://github.com/pezmaster31/bamtools ) .


There is a SAMtools filter option as well. 
Anyone benchmarked both to see which runs faster? 




this discussion arose from someone trying to compute the mean coverage 

bedtools coverage -abam my.bam -b my.bed -d | sort -k1,1 -k2,2n | groupby -g 1,2,3,4,5,6 -c 8 -o mean > my.txt


it doesn't scale too well with the no. of entries in the bed file 
1,000 lines takes a few minutes
1,200,000 lines " it's been running for 12 hours and still not done yet " on a Mac Pro with speed of 2.66 GHz and 8 GB of Memory.


So splitting the file by chromosome helps to parallelize the process although the job might scale linearly

Thursday, 29 December 2011

FAQ - How to get average depth of coverage for targeted seq regions using BedTools

Question
If I have a list of targeted resequencing regions and a bam file is
there any easy way to get the average depth of coverage for each region?
I could calculate it from all of the values in the histogram output from
coverageBed -hist but that seems messy. In addition is there any way to
adjust the minimum number of hits at a location for it to be included in
the fraction of bases in B that had non-zero coverage into bases in B
that had more-than-X coverage?

Reply
There is no option for directly computing the mean coverage. However, one can combine the -d option (reports the depth at each base in each interval) with the groupBy tool found in the filo package (https://github.com/arq5x/filo
) to compute the mean.  For example, here's a 5 column BED of the UCSC simple repeat track:

$ head simRep.chr22.bed
22      16052163        16052199        trf     65
22      16059465        16059496        trf     53
22      16060479        16060524        trf     90
22      16060479        16060592        trf     120
22      16060517        16060609        trf     129
22      16071925        16071962        trf     74
22      16078616        16078686        trf     86
22      16079642        16079731        trf     61
22      16079644        16079696        trf     59
22      16079648        16079724        trf     52

I can compute the coverage at each base in each of these intervals as follows (column 6 is the position in the interval, column 7 is the depth):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | head
22      17825583        17825839        trf     372     1       2
22      17825583        17825839        trf     372     2       2
22      17825583        17825839        trf     372     3       2
22      17825583        17825839        trf     372     4       2
22      17825583        17825839        trf     372     5       2
22      17825583        17825839        trf     372     6       2
22      17825583        17825839        trf     372     7       3
22      17825583        17825839        trf     372     8       3
22      17825583        17825839        trf     372     9       3
22      17825583        17825839        trf     372     10      3

Now, I can compute the mean by grouping on columns 1-5 of the output (the orig. BED entry) and computing the mean on column 7 for each grouped interval (the last column in the output below is the mean):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | groupBy -g 1,2,3,4,5 -c 7 -o mean | head
22      17825583        17825839        trf     372     2.96875
22      16506905        16558419        trf     20099   7.6543852156695271205
22      16506905        16558419        trf     23903   7.6543852156695271205
22      16506941        16537880        trf     14401   7.47002165551569241586
22      16515001        16516375        trf     832     4.88427947598253275885
22      17039134        17049375        trf     5493    6.38755980861244054836
22      17039134        17049375        trf     4544    6.38755980861244054836
22      17039134        17049375        trf     5071    6.38755980861244054836
22      18087713        18088319        trf     759     5.42904290429042912791
22      19529662        19529964        trf     469     5.91721854304635730415

Note that coverageBed does not maintain the order of the intervals in the original file, so you might want to resort the output first:

$ coverageBed -abam aln.bam -b simRep.22.bed -d | sort -k1,1 -k2,2n | groupBy -g 1,2,3,4,5 -c 7 -o mean

>  In addition is there any way to
> adjust the minimum number of hits at a location for it to be included in
> the fraction of bases in B that had non-zero coverage into bases in B
> that had more-than-X coverage?

There is no way to do this except by taking the raw output of the -d or -hist and computing this yourself.

Best,
Aaron

Monday, 12 December 2011

How much coverage / throughput for my RNA-seq?

One of the earliest questions to bug anyone planning an RNA-seq experiment has to be the throughput (how many reads do I need?)

If you are dealing with human samples, you have the benefit of extensive publications with example coverages and some papers that test the limits of detection. All of this info is nicely summarised here in experimental design considerations in RNA-Seq.

Bashir et al. have  concluded that more than 90% of the transcripts in human samples are adequately covered with just one million  sequence reads.  Wang et al. showed that 8 million reads are sufficient to reach RNA-Seq saturation for most  samples

The ENCODE consortium also has published a Guidelines for Experiments within you can read RNA Standards v1.0 (May 2011) and also RNA-seq Best Practices (2009)


Experiments whose purpose is to evaluate the similarity between the
transcriptional profiles of two polyA+ samples may require only modest depths of
sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are
mappable to the genome or known transcriptome, Experiments whose purpose is
discovery of novel transcribed elements and strong quantification of known
transcript isoforms requires more extensive sequencing. The ability to detect
reliably low copy number transcripts/isoforms depends upon the depth of
sequencing and on a sufficiently complex library.


RNA-seq blog also covers this issue in How Many Reads are Enough? Where they cited an article on RNA-seq in chicken lungs

The analysis from the current study demonstrated that 30 M (75 bp) reads is sufficient to detect all annotated genes in chicken lungs. Ten million (75 bp) reads could detect about 80% of annotated chicken genes.

There are also papers that showed that RNA-seq gives reproducible results when sequenced from the same RNA-seq library which means that if coverage isn't enough, it is possible to sequence more using the same library and not have it affect your results. The real issue then becomes whether  you have planned for additional sequencing with your budget.



References
Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from paired-end RNA-seq data by  SpliceMap. Nucleic acids research 38, 4570-4578 (2010).

Maher, C.A., Palanisamy, N., Brenner, J.C., Cao, X., Kalyana-Sundaram, S., Luo, S., Khrebtukova, I., Barrette, T.R.,  Grasso, C., Yu, J., Lonigro, R.J., Schroth, G., Kumar-Sinha, C. & Chinnaiyan, A.M. Chimeric transcript discovery by  paired-end transcriptome sequencing. Proceedings of the National Academy of Sciences of the United States of America   106, 12353-12358 (2009).

Bashir, A., Bansal, V. & Bafna, V. Designing deep sequencing experiments: detecting structural variation and estimating  transcript abundance. BMC genomics 11, 385 (2010).

Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews 10, 57-63 (2009).


Wang Y, Ghaffari N, Johnson CD, Braga-Neto UM, Wang H, Chen R, Zhou H. (2011) Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens. BMC Bioinformatics Proceedings of the Eighth Annual MCBIOS Conference. Computational Biology and Bioinformatics for a New Decade, College Station, TX, USA. 1-2 April 2011. [article

Monday, 4 July 2011

genomeCoverageBed to look at coverage of your WGS

BEDTools is a very useful set of programs for looking at your NGS sequencing result. 


one of which I use regularly is 


    $ genomeCoverageBed -d -ibam sortedBamFile.bam -g genome.csv > coverage.csv


so you only need your bam file and a genome file (tab-delimited file with "contig_name contig_size" information for all contigs in the reference genome.
This file can be done automatically with 

   $  samtools faidx reference_genome.fasta

which generates a .fai file which contains the info required. (I very nearly went and wrote a bioperl script to generate this, luckily I remembered the contents of the fai file.




Update: 
Do read the helpful comments by Micheal for using samtools idxstats on the bam file. And the shortcomings of looking at the coverage this way. 
He has a helpful recent post here 

Accurate genome-wide read depth calculation (how-to)



I must say I hadn't thought of the points he raised, but I was generating these to check for evenness of coverage for a bacteria reseq project. I like the coverage plot in IGV but am not sure if there are opensource tools to do the same. 
Any tips to share?


Other references
Discussion at BioStar

Datanami, Woe be me