excerpted from Genomeweb
According to a Genome Research paper describing the method, the secret to the String Graph Assembler's reduced memory footprint is that it uses compressed data structures to "exploit the redundancy" in sequence reads and to "substantially lower the amount of memory required to perform de novo assembly."
SGA relies on an algorithm its developers published in 2010 that constructs an assembly string graph from a so-called “full-text minute-space” index, or FM-index, which enables searching over a compressed representation of a text. Unlike other short-read assemblers that rely on the de Bruijn graph model, which breaks reads up into k-mers, the string graph model “keeps all reads intact and creates a graph from overlaps between reads,” the Sanger team wrote.
his approach makes even more sense now that sequencing instruments like the Pacific Biosciences RS are generating longer reads.
In the paper, Durbin and co-developer Jared Simpson report that SGA successfully assembled 1.2 billion human genome sequence reads using 54 GB of memory. This was compared with SOAPdenovo, which required 118 GB for the same task.
Results from comparisons of SGA with Velvet, ABySS, and SOAPdenovo using a C. elegans dataset showed that its assembled contigs covered 95.9 percent of the reference genome while the other three programs covered 94.5 percent, 95.6 percent, and 94.8 percent respectively.
Furthermore, SGA required only 4.5 gigabytes of memory to assemble the C. elegans dataset compared to 14.1 GB, 23 GB, and 38.8 GB required for ABySS, Velvet, and SOAPdenovo respectively.
SGA is slower that its counterparts, however. SGA took 1,427 CPU hours to complete a human genome assembly, while SOAPdenovo required 479 CPU hours.
"We explicitly trade off a bit longer CPU time for lower memory usage," Simpson, a doctoral student in Durbin's lab, told BioInform. "We feel that fits better into most of the clusters that are available right now."
On the other hand, SGA is parallelizable, so its most compute-intensive activities — error-correcting reads and building the FM-index of corrected reads — can be distributed across a compute cluster to reduce run time, the researchers explain in the paper.
He explained that while de bruijn assemblers like Velvet require a separate processing step after completing the genome assembly, SGA doesn’t and potentially avoids "some of the errors or incomplete analysis" that can occur in the extra processing step.
This reduced error risk plus its lower memory requirement ensures that tools like SGA have a "future," Durbin said.
For their next steps, Durbin and Simpson are adapting SGA to work with longer read data from the Roche 454 sequencer and the Life Technologies Ion Torrent Personal Genome Machine. They are also exploring ways of discovering variants using the program.
The approach could also be used to analyze metagenomic data, the researchers said in the paper.
Read the full article here http://www.genomeweb.com/informatics/sanger-teams-de-novo-assembler-adopts-compressed-approach-reduce-memory-footprin
1427 CPU hours is ALOT more than 479 CPU hours (~ 3x) but of course when you can parallelize it, it's definitely a worthwhile tradeoff especially when it's more likely that one has a lot of lower memory clusters then one single cluster with a lot of memory and is likely to be hogged by a 479 CPU hour job. I wonder if this might encourage investigators to relook at existing data by de novo assembly. Of course it would be great if the NGS data is made public, then I guess other groups can actually do the de novo assembly comparison for them.