Kevin's GATTACA World: howto do BWA mapping in colorspace

Wednesday, 25 August 2010

howto do BWA mapping in colorspace

Here's what I use for bwa alignment (without removing PCR dups).
You can replace the paths with your own and put into a bash script for automation
comments or corrections welcome!

#Visit kevin-gattaca.blogspot.com to see updates of this template!
#http://kevin-gattaca.blogspot.com/2010/08/howto-do-bwa-mapping-in-colorspace.html
#updated 16th Mar 2011
#Creates colorspace index
bwa index -a bwtsw -c hg18.fasta

#convert to fastq.gz
perl /opt/bwa-0.5.7/solid2fastq.pl Sample-input-prefix-name Sample

#aln using 4 threads
#-l 25        seed length
#-k 2         mismatches allowed in seed
#-n 10      total mismatches allowed

bwa aln -c -t 4 -l 25 -k 2 -n 10 /data/public/bwa-color-index/hg18.fasta Sample.single.fastq.gz > Sample.bwa.hg18.sai

#for bwa samse
bwa samse /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sai Sample.single.fastq.gz > Sample.bwa.hg18.sam

#creates bam file from pre-generated .fai file

samtools view -bt /data/public/hg18.fasta.fai -o Sample.bwa.hg18.sam.bam Sample.bwa.hg18.sam

#sorts bam file

samtools sort Sample.bwa.hg18.sam.bam{,.sorted}

#From a sorted BAM alignment, raw SNP and indel calls are acquired by:

samtools pileup -vcf /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup

#resultant output should be further filtered by:

/opt/samtools/misc/samtools.pl varFilter Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup | awk '$6>=20' > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup.final.pileup

#new section using mpileup and bcftools to generate vcf files
samtools mpileup -ugf hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

Do note the helpful comments below! Repost here for clarity.

Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.

Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.

Also:

1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".

2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."

The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).

2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.

3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)

4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.

5) View the alignments with samtools tview.

12 comments:

bioinformatico25 August 2010 at 17:55
I agree with this pipeline, but... Why don't you use -l (seed option) in the alignment step (bwa aln)?
ReplyDelete
Replies
Kevin25 August 2010 at 18:04
Hmmm from this in the manual "For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’." so with the thinking that SOLiD reads are short (50 bp) I assumed that I need not specify seed length?

I have not experimented with this but how does it affect your alignment? what seed length do you use for SOLiD data?
ReplyDelete
Replies
bioinformatico25 August 2010 at 18:18
I had supposed that you normally use the default value for colorspace. I think it is correct.
For illumina alignment I use 80% of the read length.
ReplyDelete
Replies
Kevin25 August 2010 at 18:27
Based on the 'highly popular' seed and extend of the Bioscope software and the knowledge that NGS reads quality taper off at the 3' end I might do some experiments with giving a seed length of 25 to see how it works out.
Thanks!
ReplyDelete
Replies
Anonymous26 August 2010 at 05:51
I would up the "-n" option to be more sensitive in mapping, sine SOLiD has a high raw color space error rate. Also, add in "-e" if you want to detect longer indels.
ReplyDelete
Replies
Kevin27 August 2010 at 10:36
What value would you suggest for -n and -e?

Btw I haven't noticed indels in fragment mapping at all. I would presume there would be some small indels at least. am I doing something wrong to miss it?
ReplyDelete
Replies
Anonymous28 August 2010 at 05:41
Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.
ReplyDelete
Replies
Anonymous28 August 2010 at 07:54
Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.

Also:

1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".

2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."

The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).

2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.

3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)

4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.

5) View the alignments with samtools tview.
ReplyDelete
Replies
Anonymous1 September 2010 at 02:45
I have one question:

besides the content that shown in .sam file(alignment of the best match, and number of suboptimal/all hits), seems it also contains some information of the suboptimal hits, is it possible to look at the details of these hits.
with the command "bwa samse -n INT", I can only get the position where they mapped and number of mismatch.
ReplyDelete
Replies
Anonymous16 September 2010 at 07:47
I'm not finding any intron-skipping alignments (i.e. with 'N' in the CIGAR string) even with -e 100000. Any idea how to get round this?
ReplyDelete
Replies
Unknown10 December 2010 at 04:31
This discussion is quite helpful to me. I am using BWA for zebrafish microRNA analysis. I aligned my SOLiD reads (after adaptor trimming) to zebrafish genome using $ bwa aln -n 3 -t 2 -c and didn't get any matching reads to known microRNAs.

Can anyone suggest what settings I should modify.

Thanks, Neel
ReplyDelete
Replies
md3 May 2011 at 17:24
Hi Folks,

great tutorial!

I have got a questions:

When I create the raw pileup and the final pileup I can't view them. Samtools is telling me "this is not a bam file, header...". Without calling it works fine, so the the files before pileup... So something went wrong in these 2 steps.

So, what is your /data/public/bowtie-color-index/hg18.fasta

Is it hg18 to colorspace using bowtie? I used the colorspaced hg18 created by bwa. This doesn't seem to work.

Got a hint for me?

With best!
ReplyDelete
Replies

Add comment

Kevin's GATTACA World

Wednesday, 25 August 2010

howto do BWA mapping in colorspace

12 comments:

Datanami, Woe be me

Analytics code

Contributors