Here's what I use for bwa alignment (without removing PCR dups).
You can replace the paths with your own and put into a bash script for automation
comments or corrections welcome!
#Visit kevin-gattaca.blogspot.com to see updates of this template!
#http://kevin-gattaca.blogspot.com/2010/08/howto-do-bwa-mapping-in-colorspace.html
#updated 16th Mar 2011
#Creates colorspace index
bwa index -a bwtsw -c hg18.fasta
#convert to fastq.gz
perl /opt/bwa-0.5.7/solid2fastq.pl Sample-input-prefix-name Sample
#aln using 4 threads
#-l 25 seed length
#-k 2 mismatches allowed in seed
#-n 10 total mismatches allowed
bwa aln -c -t 4 -l 25 -k 2 -n 10 /data/public/bwa-color-index/hg18.fasta Sample.single.fastq.gz > Sample.bwa.hg18.sai
#for bwa samse
bwa samse /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sai Sample.single.fastq.gz > Sample.bwa.hg18.sam
#creates bam file from pre-generated .fai file
samtools view -bt /data/public/hg18.fasta.fai -o Sample.bwa.hg18.sam.bam Sample.bwa.hg18.sam
#sorts bam file
samtools sort Sample.bwa.hg18.sam.bam{,.sorted}
#From a sorted BAM alignment, raw SNP and indel calls are acquired by:
samtools pileup -vcf /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup
#resultant output should be further filtered by:
/opt/samtools/misc/samtools.pl varFilter Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup | awk '$6>=20' > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup.final.pileup
#new section using mpileup and bcftools to generate vcf files
samtools mpileup -ugf hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
Do note the helpful comments below! Repost here for clarity.
Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.
Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.
Also:
1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".
2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."
The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).
2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.
3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)
4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.
5) View the alignments with samtools tview.
You can replace the paths with your own and put into a bash script for automation
comments or corrections welcome!
#Visit kevin-gattaca.blogspot.com to see updates of this template!
#http://kevin-gattaca.blogspot.com/2010/08/howto-do-bwa-mapping-in-colorspace.html
#updated 16th Mar 2011
#Creates colorspace index
bwa index -a bwtsw -c hg18.fasta
#convert to fastq.gz
perl /opt/bwa-0.5.7/solid2fastq.pl Sample-input-prefix-name Sample
#aln using 4 threads
#-l 25 seed length
#-k 2 mismatches allowed in seed
#-n 10 total mismatches allowed
bwa aln -c -t 4 -l 25 -k 2 -n 10 /data/public/bwa-color-index/hg18.fasta Sample.single.fastq.gz > Sample.bwa.hg18.sai
#for bwa samse
bwa samse /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sai Sample.single.fastq.gz > Sample.bwa.hg18.sam
#creates bam file from pre-generated .fai file
samtools view -bt /data/public/hg18.fasta.fai -o Sample.bwa.hg18.sam.bam Sample.bwa.hg18.sam
#sorts bam file
samtools sort Sample.bwa.hg18.sam.bam{,.sorted}
#From a sorted BAM alignment, raw SNP and indel calls are acquired by:
samtools pileup -vcf /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup
#resultant output should be further filtered by:
/opt/samtools/misc/samtools.pl varFilter Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup | awk '$6>=20' > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup.final.pileup
#new section using mpileup and bcftools to generate vcf files
samtools mpileup -ugf hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
Do note the helpful comments below! Repost here for clarity.
Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.
Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.
Also:
1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".
2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."
The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).
2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.
3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)
4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.
5) View the alignments with samtools tview.
I agree with this pipeline, but... Why don't you use -l (seed option) in the alignment step (bwa aln)?
ReplyDeleteHmmm from this in the manual "For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’." so with the thinking that SOLiD reads are short (50 bp) I assumed that I need not specify seed length?
ReplyDeleteI have not experimented with this but how does it affect your alignment? what seed length do you use for SOLiD data?
I had supposed that you normally use the default value for colorspace. I think it is correct.
ReplyDeleteFor illumina alignment I use 80% of the read length.
Based on the 'highly popular' seed and extend of the Bioscope software and the knowledge that NGS reads quality taper off at the 3' end I might do some experiments with giving a seed length of 25 to see how it works out.
ReplyDeleteThanks!
I would up the "-n" option to be more sensitive in mapping, sine SOLiD has a high raw color space error rate. Also, add in "-e" if you want to detect longer indels.
ReplyDeleteWhat value would you suggest for -n and -e?
ReplyDeleteBtw I haven't noticed indels in fragment mapping at all. I would presume there would be some small indels at least. am I doing something wrong to miss it?
Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.
ReplyDeleteAlso, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.
ReplyDeleteAlso:
1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".
2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."
The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).
2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.
3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)
4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.
5) View the alignments with samtools tview.
I have one question:
ReplyDeletebesides the content that shown in .sam file(alignment of the best match, and number of suboptimal/all hits), seems it also contains some information of the suboptimal hits, is it possible to look at the details of these hits.
with the command "bwa samse -n INT", I can only get the position where they mapped and number of mismatch.
I'm not finding any intron-skipping alignments (i.e. with 'N' in the CIGAR string) even with -e 100000. Any idea how to get round this?
ReplyDeleteThis discussion is quite helpful to me. I am using BWA for zebrafish microRNA analysis. I aligned my SOLiD reads (after adaptor trimming) to zebrafish genome using $ bwa aln -n 3 -t 2 -c and didn't get any matching reads to known microRNAs.
ReplyDeleteCan anyone suggest what settings I should modify.
Thanks, Neel
Hi Folks,
ReplyDeletegreat tutorial!
I have got a questions:
When I create the raw pileup and the final pileup I can't view them. Samtools is telling me "this is not a bam file, header...". Without calling it works fine, so the the files before pileup... So something went wrong in these 2 steps.
So, what is your /data/public/bowtie-color-index/hg18.fasta
Is it hg18 to colorspace using bowtie? I used the colorspaced hg18 created by bwa. This doesn't seem to work.
Got a hint for me?
With best!