My Weblog on Bioinformatics, Genome Science and Next Generation Sequencing
Wednesday, 27 October 2010
de novo assembly of large genomes
Here's a informative post by Ewan Birney on the velvet user list about de novo assembly of large genomes.
Velvet's algorithms in theory work for any size. However, the engineering aspects of Velvet, in particular memory consumption, means it's unable to handle read sets of a particular size. This of course depends on how big a real memory machine you have. I know we have "routinely" (ie, for multiple strains) done Drosophila sized genomes (~120MB) on a 125GB machine. I've heard of Velvet being used into the 200-300MB region, but rarely further. Memory size is not just about the size of the genome but also how error prone you reads are (though sheer size is important). Beyond this there are a variety of strategies: "Raw" de Bruijn graphs, without a tremendously aggressive use of read pairs can be made using Cortex (unpublished, from Mario Cacamo and Zam Iqbal) or ABySS (published, well understood, from the BC genome centre). Curtain (unpublished, but available, from Matthias Haimel at EBI) can do a smart partition of the reads given an initial de Bruijn graph, run Velvet on the paritions and thus provide an improved more read-pair aware graph. This can be iterated and in at least some cases, the Curtain approach gets close to what Velvet can produce alone (in the scenarios where Velvet can be run on a single memory machine to understand Curtain's performance) SOAP de novo from the BGI is responsible for a number of the published assemblies (eg, Panda, YH) although like many assemblers, tuning it seems quite hard, and I would definitely be asking the BGI guys for advice. A new version of ALLPATHS (from the Broad crew) looks extremely interesting, but is not quite released yet. In all above the cases I know of successes, but also quite a few failures, and untangling data quality/algorithm/choice of parameters/running bugs is really complex. So - whereas assemblies < 100MB are "routine", currently assemblies 100MB-500MB are "challenging" and >500MB are theoretically doable, and have been done by specific groups, but I think still are at the leading edge of development and one should not be confident of success for "any particular genome".
Thanks Ewan for letting me reproduce his post here