Wednesday, 29 August 2012

region based calling in SAMtools Mpileup

Ok my new favourite feature about samtools mpileup is that you can make use of the bam indexing to specify regions you wish to do mpileup on (now I can easily max out that 250 core system in shared HPC resources! )

Calling SNPs/INDELs in small regions (see )
  1. splitchr -l 500000 | xargs -i \  
  2.   echo samtools mpileup -C50 -m3 -F0.0002 -DSuf ref.fa -r {} -b bam.list \| bcftools \  
  3.   view -bcvg - \> part-{}.bcf

Random position retrieval that works. One of the most powerful features of mpileup is that you can specify a region with -r chrom:start-stop and it will report pileup output for the specified position(s). The old pileup command had this option, but took a long time because it looked at all positions and just reported the ones within your desired region. Instead, mpileup leverages BAM file indexing to retrieve data quite rapidly: In my experience, it takes about 1 second to retrieve the pileup for several samples at any given position in the human genome. Multi-sample, rapid random access has lots of uses for bio-informaticians; for example, I can retrieve all bases observed in all samples at a variant of interest to look at the evidence in each sample.

No comments:

Post a Comment

Datanami, Woe be me