Showing posts with label ray. Show all posts
Showing posts with label ray. Show all posts

Friday, 15 July 2011

Performance of Ray @ Assemblathon 2

Ray is one of the assemblers that I watch closely but sadly lack the time to experiment with. Here's the candid email from Sebastien to the Ray mailing list on Ray's performance on the test data

For those who follow Assemblathon 2, my last run on my testbed (Illumina data from BGI and from Illumina UK for the Bird/Parrot):

(all mate-pairs failed detection because of the many peaks in each library, I will modify Ray to consider that)



Total number of unfiltered Illumina TruSeq v3 sequences: Total:
3 072 136 294, that is ~3 G sequences !


512 compute cores (64 computers * 8 cores/computer = 512)


Typical communication profile for one compute core:


[1,0]:Rank 0: sent 249841326 messages, received 249840303 messages.


Yes, each core sends an average of 250 M messages during the 18 hours !




Peak memory usage per core: 2.2 GiB


Peak memory usage (distributed in a peer-to-peer fashion): 1100 GiB


The peak occurs around 3 hours and goes down to 1.1 GiB per node immediately because the pool of defragmentation groups for k-mers occuring once is freed.



The compute cluster I use has 3 GiB per compute core. So using 2048 compute cores would give me 6144 GiB of distributed memory.




Number of contigs: 550764

Total length of contigs: 1672750795
Number of contigs >= 500 nt: 501312
Total length of contigs >= 500 nt: 1656776315
Number of scaffolds: 510607
Total length of scaffolds: 1681345451
Number of scaffolds >= 500 nt: 463741
Total length of scaffolds >= 500: 1666464367

k-mer length: 31

Lowest coverage observed: 1
MinimumCoverage: 42
PeakCoverage: 171
RepeatCoverage: 300
Number of k-mers with at least MinimumCoverage: 2453479388 k-mers
Estimated genome length: 1226739694 nucleotides
Percentage of vertices with coverage 1: 83.7771 %
DistributionFile: parrot-Testbed-A2-k31-

20110712.CoverageDistribution.txt

[1,0]: Sequence partitioning: 1 hours, 54 minutes, 47 seconds

[1,0]: K-mer counting: 5 hours, 47 minutes, 20 seconds
[1,0]: Coverage distribution analysis: 30 seconds
[1,0]: Graph construction: 2 hours, 52 minutes, 27 seconds
[1,0]: Edge purge: 57 minutes, 55 seconds
[1,0]: Selection of optimal read markers: 1 hours, 38 minutes, 13 seconds
[1,0]: Detection of assembly seeds: 16 minutes, 7 seconds
[1,0]: Estimation of outer distances for paired reads: 6 minutes, 26 seconds
[1,0]: Bidirectional extension of seeds: 3 hours, 18 minutes, 6 seconds
[1,0]: Merging of redundant contigs: 15 minutes, 45 seconds
[1,0]: Generation of contigs: 1 minutes, 41 seconds
[1,0]: Scaffolding of contigs: 54 minutes, 3 seconds
[1,0]: Total: 18 hours, 3 minutes, 50 seconds


10 largest scaffolds:


257646

266905
268737
272828
281502
294105
294106
296978
333171
397201


# average latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes

# Message passing interface rank Name Latency in microseconds
0 r107-n24 138
1 r107-n24 140
2 r107-n24 140
3 r107-n24 140
4 r107-n24 141
5 r107-n24 141
6 r107-n24 140
7 r107-n24 140
8 r107-n25 140
9 r107-n25 139
10 r107-n25 138
11 r107-n25 139
  Sébastien

Tuesday, 22 March 2011

de novo assembly of Illumina CEO genome in 11.5 h - new ver of Ray

Kevin:You can't ignore an email with that subject header.. but 512 compute cores? Shall have a chat with my HPC vendor.. 
Also am waiting for public release of Cortex http://sourceforge.net/projects/cortexassembler/
Strange that courses that teach the software are available but the software ain't ... 
http://www.ebi.ac.uk/training/onsite/NGS_120510.html


Velvet and Curtain seems promising for de novo assembly as well.

Ray 1.3.0 is now available online.
http://sourceforge.net/projects/denovoassembler/files/Ray-1.3.0.tar.bz2

The most important change is the correction of a major bug that caused
parallel infinite loop on the human genome.

This, along concepts incorporated in Ray 1.2.4, allowed Ray to assemble
the genome of Illumina's CEO in 11.5 hours using 512 compute cores (see
below for the link).

What's new?

1.3.0

2011-03-22

   * Vertices with less than 1 of coverage are ignored during the
computation of seeds and during the computation of extensions.
   * Computation of library outer distances relies on the virtual
communicator.
   * Expiry positions are used to toss away reads that are out-of-range
   * When only one choice is given during the extension and some reads
are in-range, then the sole choice is picked up.
   * Fixed a bug for empty reads.
   * A read is not added in the active set if it is marked on a
repeated vertex and its mate was not encountered yet.
   * Grouped messages in the extension of seeds.
   * Reads marked on repeated vertices are cached during the extension.
   * Paths are cached in the computation of fusions.
   * Fixed an infinite loop in the extension of seeds.
   * When fetching read markers for a vertex, send a list of mates to
meet if the vertex is repeated in order to reduce the communication.
   * Updated the Instruction Manual
   * Added a version of the logo without text.


I fixed a bug that caused an infinite loop. Now Ray can assemble large
genomes. See my blog post for more detail about that.
http://dskernel.blogspot.com/2011/03/de-novo-assembly-of-illumina-ceo-genome.html


Version 1.2.4 of Ray incorporated also new concepts that I will present
at RECOMB-Seq 2011.

The talk is available online:
http://boisvert.info/dropbox/recomb-seq-2011-talk.pdf


Sébastien Boisvert

Datanami, Woe be me