Low cost short read sequencing technology has revolutionised genomics, though it is only just
becoming practical for the high quality de novo assembly of a novel large genome. We describe
the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in
de novo assembly methods when applied to current sequencing technologies. In a collaborative
effort teams were asked to assemble a simulated Illumina HiSeq dataset of an unknown,
simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling and copy number
regions of uncertainty. assembly problem there are a number of key considerations, notably (1) the length of the reads,
were made. We establish that within this benchmark (1) it is possible to assemble the genome to
a high level of coverage and accuracy, and that (2) large differences exist between the
assemblies, suggesting room for further improvements in current methods. The simulated
benchmark, including the correct answer, the assemblies and the code that was used to evaluate
the assemblies is now public and freely available from http://www.assemblathon.org/.
excerpted from Introduction
As the field of sequencing has changed so has the field of sequence assembly, for a recent
review see Miller et al. (2010). In brief, using Sanger sequencing, contigs were initially built using
overlap or string graphs (Myers 2005) (or data structures closely related to them), in tools such
as Phrap (http://www.phrap.org/), GigAssembler (Kent and Haussler, 2001), Celera (Myers et al.
2000) (Venter et al. 2001), ARACHNE (Batzoglou et al. 2002), and Phusion (Mullikin and Ning
2003), which were used for numerous high quality assemblies such as human (Lander et al.
2001) and mouse (Mouse Genome Sequencing Consortium et al. 2002). However, these
programs were not generally efficient enough to handle the volume of sequences produced by the
While some maintained the overlap graph approach, e.g. Edena (Hernandez et al. 2008) and
Newbler (http://www.454.com/), others used word look-up tables to greedily extend reads, e.g.
SSAKE (Warren et al. 2007), SHARCGS (Dohm et al. 2007), VCAKE (Jeck et al. 2007) and
OligoZip (http://linux1.softberry.com/berry.phtml?topic=OligoZip). These word look-up tables were
then extended into de Bruijn graphs to allow for global analyses (Pevzner et al. 2001), e.g. Euler
(Chaisson and Pevzner 2008), AllPaths (Butler et al. 2008) and Velvet (Zerbino and Birney 2008).
As projects grew in scale further engineering was required to fit large whole genome datasets into
memory ((ABySS (Simpson et al. 2009), Meraculous (in submission)), (SOAPdenovo (Li et al.
2010), Cortex (in submission)). Now, as improvements in sequencer technology are extending the
length of "short reads", the overlap graph approach is being revisited, albeit with optimized
programming techniques, e.g. SGA (Simpson and Durbin 2010), as are greedy contig extension
In general, most sequence assembly programs are multi stage pipelines, dealing with correcting
measurement errors within the reads, constructing contigs, resolving repeats (i.e. disambiguating
false positive alignments between reads) and scaffolding contigs in separate phases. Since a
number of solutions are available for each task, several projects have been initiated to explore the
parameter space of the assembly problem, in particular in the context of short read sequencing
((Phillippy et al. 2008), (Hubis et al. 2011), (Alkan et al. 2011), (Narzisi and Mishra 2011), (Zhang et al. 2011) and (Lin et al. 2011)).