Wednesday, 27 October 2010

de novo assembly of large genomes

Here's a informative post by Ewan Birney on the velvet user list about de novo assembly of large genomes.

Velvet's algorithms in theory work for any size. However, the engineering aspects
of Velvet, in particular memory consumption, means it's unable to handle read sets
of a particular size. This of course depends on how big a real memory machine
you have.

I know we have "routinely" (ie, for multiple strains) done Drosophila sized genomes
(~120MB) on a 125GB machine.

I've heard of Velvet being used into the 200-300MB region, but rarely further. Memory
size is not just about the size of the genome but also how error prone you reads
are (though sheer size is important).

Beyond this there are a variety of strategies:

  "Raw" de Bruijn graphs, without a tremendously aggressive use of read pairs can
be made using Cortex (unpublished, from Mario Cacamo and Zam Iqbal) or ABySS (published,
well understood, from the BC genome centre).

   Curtain (unpublished, but available, from Matthias Haimel at EBI) can do a
smart partition of the reads given an initial de Bruijn graph, run Velvet on the paritions
and thus provide an improved more read-pair aware graph. This can be iterated and in
at least some cases, the Curtain approach gets close to what Velvet can produce alone
(in the scenarios where Velvet can be run on a single memory machine to understand
Curtain's performance)

   SOAP de novo from the BGI is responsible for a number of the published assemblies
(eg, Panda, YH) although like many assemblers, tuning it seems quite hard, and I would
definitely be asking the BGI guys for advice.

   A new version of ALLPATHS (from the Broad crew) looks extremely interesting, but
is not quite released yet.

In all above the cases I know of successes, but also quite a few failures, and untangling
data quality/algorithm/choice of parameters/running bugs is really complex. So - whereas
assemblies < 100MB are "routine", currently assemblies 100MB-500MB are "challenging" and
>500MB are theoretically doable, and have been done by specific groups, but I think still
are at the leading edge of development and one should not be confident of success for
"any particular genome".

Thanks Ewan for letting me reproduce his post here

Velvet-users mailing list

Cortex seems very promising for de novo assembly of human reads using reasonable amounts of ram ( 128 Gb ) based on the mailing list. I know I be watching out for it on Sourceforge!

No comments:

Post a Comment

Datanami, Woe be me