Kevin's GATTACA World: velvet

Showing posts with label velvet. Show all posts

Tuesday, 22 March 2011

de novo assembly of Illumina CEO genome in 11.5 h - new ver of Ray

Kevin:You can't ignore an email with that subject header.. but 512 compute cores? Shall have a chat with my HPC vendor..
Also am waiting for public release of Cortex http://sourceforge.net/projects/cortexassembler/
Strange that courses that teach the software are available but the software ain't ...
http://www.ebi.ac.uk/training/onsite/NGS_120510.html

Velvet and Curtain seems promising for de novo assembly as well.

Ray 1.3.0 is now available online.
http://sourceforge.net/projects/denovoassembler/files/Ray-1.3.0.tar.bz2

The most important change is the correction of a major bug that caused
parallel infinite loop on the human genome.

This, along concepts incorporated in Ray 1.2.4, allowed Ray to assemble
the genome of Illumina's CEO in 11.5 hours using 512 compute cores (see
below for the link).

What's new?

1.3.0

2011-03-22

* Vertices with less than 1 of coverage are ignored during the
computation of seeds and during the computation of extensions.
* Computation of library outer distances relies on the virtual
communicator.
* Expiry positions are used to toss away reads that are out-of-range
* When only one choice is given during the extension and some reads
are in-range, then the sole choice is picked up.
* Fixed a bug for empty reads.
* A read is not added in the active set if it is marked on a
repeated vertex and its mate was not encountered yet.
* Grouped messages in the extension of seeds.
* Reads marked on repeated vertices are cached during the extension.
* Paths are cached in the computation of fusions.
* Fixed an infinite loop in the extension of seeds.
* When fetching read markers for a vertex, send a list of mates to
meet if the vertex is repeated in order to reduce the communication.
* Updated the Instruction Manual
* Added a version of the logo without text.

I fixed a bug that caused an infinite loop. Now Ray can assemble large
genomes. See my blog post for more detail about that.
http://dskernel.blogspot.com/2011/03/de-novo-assembly-of-illumina-ceo-genome.html

Version 1.2.4 of Ray incorporated also new concepts that I will present
at RECOMB-Seq 2011.

The talk is available online:
http://boisvert.info/dropbox/recomb-seq-2011-talk.pdf

Sébastien Boisvert

Wednesday, 27 October 2010

de novo assembly of large genomes

Here's a informative post by Ewan Birney on the velvet user list about de novo assembly of large genomes.

Velvet's algorithms in theory work for any size. However, the engineering aspects
of Velvet, in particular memory consumption, means it's unable to handle read sets
of a particular size. This of course depends on how big a real memory machine
you have.

I know we have "routinely" (ie, for multiple strains) done Drosophila sized genomes
(~120MB) on a 125GB machine.

I've heard of Velvet being used into the 200-300MB region, but rarely further. Memory
size is not just about the size of the genome but also how error prone you reads
are (though sheer size is important).

Beyond this there are a variety of strategies:

"Raw" de Bruijn graphs, without a tremendously aggressive use of read pairs can
be made using Cortex (unpublished, from Mario Cacamo and Zam Iqbal) or ABySS (published,
well understood, from the BC genome centre).

Curtain (unpublished, but available, from Matthias Haimel at EBI) can do a
smart partition of the reads given an initial de Bruijn graph, run Velvet on the paritions
and thus provide an improved more read-pair aware graph. This can be iterated and in
at least some cases, the Curtain approach gets close to what Velvet can produce alone
(in the scenarios where Velvet can be run on a single memory machine to understand
Curtain's performance)

SOAP de novo from the BGI is responsible for a number of the published assemblies
(eg, Panda, YH) although like many assemblers, tuning it seems quite hard, and I would
definitely be asking the BGI guys for advice.

A new version of ALLPATHS (from the Broad crew) looks extremely interesting, but
is not quite released yet.

In all above the cases I know of successes, but also quite a few failures, and untangling
data quality/algorithm/choice of parameters/running bugs is really complex. So - whereas
assemblies < 100MB are "routine", currently assemblies 100MB-500MB are "challenging" and
>500MB are theoretically doable, and have been done by specific groups, but I think still
are at the leading edge of development and one should not be confident of success for
"any particular genome".

Thanks Ewan for letting me reproduce his post here

Velvet-users mailing list
http://listserver.ebi.ac.uk/mailman/listinfo/velvet-users

Cortex seems very promising for de novo assembly of human reads using reasonable amounts of ram ( 128 Gb ) based on the mailing list. I know I be watching out for it on Sourceforge!

Kevin's GATTACA World

Tuesday, 22 March 2011

de novo assembly of Illumina CEO genome in 11.5 h - new ver of Ray

Wednesday, 27 October 2010

de novo assembly of large genomes

Datanami, Woe be me

Analytics code

Contributors