Kevin's GATTACA World: genomeCoverageBed to look at coverage of your WGS

Monday, 4 July 2011

genomeCoverageBed to look at coverage of your WGS

BEDTools is a very useful set of programs for looking at your NGS sequencing result.

one of which I use regularly is

$ genomeCoverageBed -d -ibam sortedBamFile.bam -g genome.csv > coverage.csv

so you only need your bam file and a genome file (tab-delimited file with "contig_name contig_size" information for all contigs in the reference genome.
This file can be done automatically with

$ samtools faidx reference_genome.fasta

which generates a .fai file which contains the info required. (I very nearly went and wrote a bioperl script to generate this, luckily I remembered the contents of the fai file.

Update:
Do read the helpful comments by Micheal for using samtools idxstats on the bam file. And the shortcomings of looking at the coverage this way.
He has a helpful recent post here

Accurate genome-wide read depth calculation (how-to)

I must say I hadn't thought of the points he raised, but I was generating these to check for evenness of coverage for a bacteria reseq project. I like the coverage plot in IGV but am not sure if there are opensource tools to do the same.
Any tips to share?

Other references
Discussion at BioStar

4 comments:

Unknown4 July 2011 at 14:11
You can also just do a samtools idxstats on the indexed BAM file and the first two columns are the chromosome and size.

I have an issue with genomeCoverageBed used this way, though: It will include all the heterochromatin (centromeres) in its calculation.

Is it really fair to include that in a calculation of coverage if those sequences aren't even present in the reference genome you aligned to?

I guess that kind of leads down the path of asking whether genome-wide coverage is a particularly meaningful statistic at all, and if it's not, what replacement statistic is.

A simple solution is to use coverageBed instead, and utilize the "gaps" track from UCSC (Tables->All Tracks->Gap) for your particular genome. This track basically just has blocks where there is no genomic assembly in that version of the genome (and therefore, those positions inherently are not mappable since the reference isn't even known). Instead of using the whole genome, calculate coverage across everything that isn't in a gap. Of course, not everything outside the gaps is mappable, but a much greater percent of that region is and, at the very least, those bases are present in the reference genome.
ReplyDelete
Replies
Jermdemo4 August 2011 at 01:51
This is one of those cases where Bioconductor is especially useful because you have access to all the genome metadata contained in BSGenomes:
http://jermdemo.blogspot.com/2010/12/chromosome-bias-in-r-my-notebook.html
ReplyDelete
Replies
Konstantin Okonechnikov8 June 2012 at 20:22
BAM QC tool from QualiMap (http://qualimap.org) provides a coverage plot. See example output here: http://qualimap.bioinfo.cipf.es/samples/ERR089819_result/qualimapReport.html
ReplyDelete
Replies
jack18 October 2012 at 01:51
Thanks okKo for suggesting excellent tool. Thanks kevin for maintaing blog, it is really helpful for newbies.
ReplyDelete
Replies

Add comment

Kevin's GATTACA World

Monday, 4 July 2011

genomeCoverageBed to look at coverage of your WGS

Accurate genome-wide read depth calculation (how-to)

4 comments:

Datanami, Woe be me

Analytics code

Contributors