Thursday, 22 March 2012

Multi-threaded BAM compression and sorting The multi-threaded sort/merge/view is available at the "mt" branch:

---------- Forwarded message ----------
From: Heng Li
Date: Thu, Mar 22, 2012 at 11:32 AM
Subject: [Samtools-help] Multi-threaded BAM compression and sorting

To convert coordinated sorted BAM back to FASTQ, the recommended way is to sort BAM by name and then convert the name sorted BAM to fastq. This is important because some mapper such as BWA assumes the input is random. They may have some troubles if we directly convert a coordinate sorted BAM with Picard's bam2fastq. While novosort, it does not sort by name. As I need to do BAM=>FASTQ for some huge BAMs, I added multi-threading to "sort", "merge" and "view".

This is not a full parallelization in that not all the steps are parallelized. Thus the efficiency is not scaled linearly with the number of threads. It is not recommended to use more than 8 threads. With 4 threads, time on sorting is reduced to 40% according to limited test. It may save you half a day if you have a huge BAM.

All my changes are naive and simple. It is possible to speed up sorting and compression further, but so far as I can see, this needs quite a lot of code restructuring and development time. For coordinate sorting, novosort scales much better with the number of threads (though I do not know if multi-threaded novosort is free to use beyond 15 days). Nils' multi-threaded bgzip should also do better on compression. These are sophisticated implementations. Mine is not.

The multi-threaded sort/merge/view is available at the "mt" branch:

The samtools/bgzf APIs stay the same except a few new functions to enable threading. In addition to multithreading, there are a few other improvements to sorting (some are based on Nils'):

1) @HD-SO tag is properly set (finally).

2) The compression level can be changed on the command line (-l).

3) Coordinate sorting considers strand as part of the key.

4) Improved alpha-numeric comparison between query names. The previous version was slower and did not work when there is a large integer.

5) Supporting "K/M/G" with option "-m". The maximum memory is estimated a little more accurately.

6) I kept claiming samtools sort was stable (i.e. the relative order of two records having the same coordinate are retained), but this was not true. The new sort is truly stable. This also means under the same compression level, sort always produces exactly the same output. For endusers, stable sorting is largely irrelevant. This just makes me feel more comfortable, "in theory".


 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.

contributed by iceman  (see comments )
relevant links to an alternative implementation: 

1 comment:

  1. My name's in there, so here are the relevant links to an alternative implementation:


Datanami, Woe be me