Monday, 25 February 2013

[Bio-bwa-help] A new alignment algorithm merged into master (beta phase)

---------- Forwarded message ----------
From: Heng Li
Date: Mon, Feb 25, 2013 at 1:34 PM
Subject: [Bio-bwa-help] A new alignment algorithm merged into master (beta phase)

Bwa-mem (mem for maximal exact match) is a new algorithm that essentially seeds alignments with the fastmap/fermi-exact algorithm and then extends seeds with SW. It combines some key features from both bwa-backtrack, the first algorithm, and bwa-sw and aims to replace them for high-quality 100bp-100kbp sequences. I made this move because bwa-backtrack will fail to deliver satisfactory results for 150bp+ reads (which a few centers have observed), while bwa-sw is relatively slow without achieving the accuracy that I think is possible given longer reads. I would recommend the current bwa-backtrack users to keep an eye on the progress of bwa-mem. You will have to change the mapper when hiseq reads reach 150bp+.

At present, bwa-mem has the basic elements of a typical aligner. On a couple of simulated data sets, it shows a similar speed to bowtie2 and bwa-backtrack, twice as fast as bwa-sw, and is more accurate. It can also achieve the same accuracy as bowtie2/bwa-sw at halved computing time. There are, though, still a few important things on the TODO list: fine tuning the algorithm for better performance; testing on more data sets; testing for BAC-sized long sequences; bug fixes. As I have merged bwa-mem to the master branch, I need your feedbacks to push it forward. This is also a good time to request features in bwa-mem when I am actively working on it.

Thank you,


PS: I plan not to add BAM support to bwa-mem, but because bwa-mem reads fastq only once, this is actually less a concern. You can:

samtools bam2fq reads.bam | bwa mem -p ref.fa -

to map paired-end reads. Here '-p' indicates the inputs are interleaved fasta/q. You can also put bam optional tags (e.g. barcodes) in the fasta/q comment. With "-C", mem/bwasw will copy these tags to the final SAM output. Bwa-mem also supports more advanced piping such as:

bwa mem ref.fa '<bzip2 -dc read1.bz2' '<bzip2 -dc read2.bz2'

which is equivalent to

bwa mem ref.fa <(bzip2 -dc read1.bz2) <(bzip2 -dc read2.bz2)

but without the bash support. The former is still working when you launch bwa in tcsh or outside a shell.

 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.

Bio-bwa-help mailing list

No comments:

Post a Comment

Datanami, Woe be me