Wednesday, 25 August 2010

howto do BWA mapping in colorspace

Here's what I use for bwa alignment (without removing PCR dups).
You can replace the paths with your own and put into a bash script for automation
comments or corrections welcome!


#Visit kevin-gattaca.blogspot.com to see updates of this template!
#http://kevin-gattaca.blogspot.com/2010/08/howto-do-bwa-mapping-in-colorspace.html
#updated 16th Mar 2011
#Creates colorspace index
bwa index -a bwtsw -c hg18.fasta

#convert to fastq.gz
perl /opt/bwa-0.5.7/solid2fastq.pl Sample-input-prefix-name Sample

#aln using 4 threads
#-l 25        seed length
#-k 2         mismatches allowed in seed
#-n 10      total mismatches allowed

bwa aln -c -t 4 -l 25 -k 2 -n 10 /data/public/bwa-color-index/hg18.fasta Sample.single.fastq.gz > Sample.bwa.hg18.sai

#for bwa samse
bwa samse /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sai Sample.single.fastq.gz > Sample.bwa.hg18.sam

#creates bam file from pre-generated .fai file

samtools view -bt /data/public/hg18.fasta.fai -o Sample.bwa.hg18.sam.bam Sample.bwa.hg18.sam

#sorts bam file

samtools sort Sample.bwa.hg18.sam.bam{,.sorted}

#From a sorted BAM alignment, raw SNP and indel calls are acquired by:

samtools pileup -vcf /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup

#resultant output should be further filtered by:

/opt/samtools/misc/samtools.pl varFilter Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup | awk '$6>=20' > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup.final.pileup



#new section using mpileup and bcftools to generate vcf files
samtools mpileup -ugf hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf


Do note the helpful comments below! Repost here for clarity.

Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.




Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.


Also:


1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".


2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."


The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).


2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.


3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)


4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.


5) View the alignments with samtools tview.

12 comments:

  1. I agree with this pipeline, but... Why don't you use -l (seed option) in the alignment step (bwa aln)?

    ReplyDelete
  2. Hmmm from this in the manual "For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’." so with the thinking that SOLiD reads are short (50 bp) I assumed that I need not specify seed length?

    I have not experimented with this but how does it affect your alignment? what seed length do you use for SOLiD data?

    ReplyDelete
  3. I had supposed that you normally use the default value for colorspace. I think it is correct.
    For illumina alignment I use 80% of the read length.

    ReplyDelete
  4. Based on the 'highly popular' seed and extend of the Bioscope software and the knowledge that NGS reads quality taper off at the 3' end I might do some experiments with giving a seed length of 25 to see how it works out.
    Thanks!

    ReplyDelete
  5. I would up the "-n" option to be more sensitive in mapping, sine SOLiD has a high raw color space error rate. Also, add in "-e" if you want to detect longer indels.

    ReplyDelete
  6. What value would you suggest for -n and -e?

    Btw I haven't noticed indels in fragment mapping at all. I would presume there would be some small indels at least. am I doing something wrong to miss it?

    ReplyDelete
  7. Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.

    ReplyDelete
  8. Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.

    Also:

    1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".

    2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."

    The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).

    2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.

    3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)

    4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.

    5) View the alignments with samtools tview.

    ReplyDelete
  9. I have one question:

    besides the content that shown in .sam file(alignment of the best match, and number of suboptimal/all hits), seems it also contains some information of the suboptimal hits, is it possible to look at the details of these hits.
    with the command "bwa samse -n INT", I can only get the position where they mapped and number of mismatch.

    ReplyDelete
  10. I'm not finding any intron-skipping alignments (i.e. with 'N' in the CIGAR string) even with -e 100000. Any idea how to get round this?

    ReplyDelete
  11. This discussion is quite helpful to me. I am using BWA for zebrafish microRNA analysis. I aligned my SOLiD reads (after adaptor trimming) to zebrafish genome using $ bwa aln -n 3 -t 2 -c and didn't get any matching reads to known microRNAs.

    Can anyone suggest what settings I should modify.

    Thanks, Neel

    ReplyDelete
  12. Hi Folks,

    great tutorial!

    I have got a questions:

    When I create the raw pileup and the final pileup I can't view them. Samtools is telling me "this is not a bam file, header...". Without calling it works fine, so the the files before pileup... So something went wrong in these 2 steps.

    So, what is your /data/public/bowtie-color-index/hg18.fasta

    Is it hg18 to colorspace using bowtie? I used the colorspaced hg18 created by bwa. This doesn't seem to work.

    Got a hint for me?

    With best!

    ReplyDelete

Datanami, Woe be me