Monday 4 July 2011

genomeCoverageBed to look at coverage of your WGS

BEDTools is a very useful set of programs for looking at your NGS sequencing result. 


one of which I use regularly is 


    $ genomeCoverageBed -d -ibam sortedBamFile.bam -g genome.csv > coverage.csv


so you only need your bam file and a genome file (tab-delimited file with "contig_name contig_size" information for all contigs in the reference genome.
This file can be done automatically with 

   $  samtools faidx reference_genome.fasta

which generates a .fai file which contains the info required. (I very nearly went and wrote a bioperl script to generate this, luckily I remembered the contents of the fai file.




Update: 
Do read the helpful comments by Micheal for using samtools idxstats on the bam file. And the shortcomings of looking at the coverage this way. 
He has a helpful recent post here 

Accurate genome-wide read depth calculation (how-to)



I must say I hadn't thought of the points he raised, but I was generating these to check for evenness of coverage for a bacteria reseq project. I like the coverage plot in IGV but am not sure if there are opensource tools to do the same. 
Any tips to share?


Other references
Discussion at BioStar

4 comments:

  1. You can also just do a samtools idxstats on the indexed BAM file and the first two columns are the chromosome and size.

    I have an issue with genomeCoverageBed used this way, though: It will include all the heterochromatin (centromeres) in its calculation.

    Is it really fair to include that in a calculation of coverage if those sequences aren't even present in the reference genome you aligned to?

    I guess that kind of leads down the path of asking whether genome-wide coverage is a particularly meaningful statistic at all, and if it's not, what replacement statistic is.

    A simple solution is to use coverageBed instead, and utilize the "gaps" track from UCSC (Tables->All Tracks->Gap) for your particular genome. This track basically just has blocks where there is no genomic assembly in that version of the genome (and therefore, those positions inherently are not mappable since the reference isn't even known). Instead of using the whole genome, calculate coverage across everything that isn't in a gap. Of course, not everything outside the gaps is mappable, but a much greater percent of that region is and, at the very least, those bases are present in the reference genome.

    ReplyDelete
  2. This is one of those cases where Bioconductor is especially useful because you have access to all the genome metadata contained in BSGenomes:
    http://jermdemo.blogspot.com/2010/12/chromosome-bias-in-r-my-notebook.html

    ReplyDelete
  3. BAM QC tool from QualiMap (http://qualimap.org) provides a coverage plot. See example output here: http://qualimap.bioinfo.cipf.es/samples/ERR089819_result/qualimapReport.html

    ReplyDelete
  4. Thanks okKo for suggesting excellent tool. Thanks kevin for maintaing blog, it is really helpful for newbies.

    ReplyDelete

Datanami, Woe be me