Monday, 19 September 2011

Assemblathon 1: A competitive assessment of de novo short read assembly methods


    Low cost short read sequencing technology has revolutionised genomics, though it is only just
    becoming practical for the high quality de novo assembly of a novel large genome. We describe
    the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in
    de novo assembly methods when applied to current sequencing technologies. In a collaborative
    effort teams were asked to assemble a simulated Illumina HiSeq dataset of an unknown,
    simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling and copy number
    regions of uncertainty.  assembly problem there are a number of key considerations, notably (1) the length of the reads,
    were made. We establish that within this benchmark (1) it is possible to assemble the genome to
    a high level of coverage and accuracy, and that (2) large differences exist between the
    assemblies, suggesting room for further improvements in current methods. The simulated
    benchmark, including the correct answer, the assemblies and the code that was used to evaluate
    the assemblies is now public and freely available from

    excerpted from Introduction

    As the field of sequencing has changed so has the field of sequence assembly, for a recent
    review see Miller et al. (2010). In brief, using Sanger sequencing, contigs were initially built using
    overlap or string graphs (Myers 2005) (or data structures closely related to them), in tools such
    as Phrap (, GigAssembler (Kent and Haussler, 2001), Celera (Myers et al.
    2000) (Venter et al. 2001), ARACHNE (Batzoglou et al. 2002), and Phusion (Mullikin and Ning
    2003), which were used for numerous high quality assemblies such as human (Lander et al.
    2001) and mouse (Mouse Genome Sequencing Consortium et al. 2002). However, these
    programs were not generally efficient enough to handle the volume of sequences produced by the
    assembly software.

    While some maintained the overlap graph approach, e.g. Edena (Hernandez et al. 2008) and
    Newbler (, others used word look-up tables to greedily extend reads, e.g.
    SSAKE (Warren et al. 2007), SHARCGS (Dohm et al. 2007), VCAKE (Jeck et al. 2007) and
    OligoZip ( These word look-up tables were
    then extended into de Bruijn graphs to allow for global analyses (Pevzner et al. 2001), e.g. Euler
    (Chaisson and Pevzner 2008), AllPaths (Butler et al. 2008) and Velvet (Zerbino and Birney 2008).
    As projects grew in scale further engineering was required to fit large whole genome datasets into
    memory ((ABySS (Simpson et al. 2009), Meraculous (in submission)), (SOAPdenovo (Li et al.
    2010), Cortex (in submission)). Now, as improvements in sequencer technology are extending the
    length of "short reads", the overlap graph approach is being revisited, albeit with optimized
    programming techniques, e.g. SGA (Simpson and Durbin 2010), as are greedy contig extension
    In general, most sequence assembly programs are multi stage pipelines, dealing with correcting
    measurement errors within the reads, constructing contigs, resolving repeats (i.e. disambiguating
    false positive alignments between reads) and scaffolding contigs in separate phases. Since a
    number of solutions are available for each task, several projects have been initiated to explore the
    parameter space of the assembly problem, in particular in the context of short read sequencing
    ((Phillippy et al. 2008), (Hubis et al. 2011), (Alkan et al. 2011), (Narzisi and Mishra 2011), (Zhang et al. 2011) and (Lin et al. 2011)).

    No comments:

    Post a Comment

    Datanami, Woe be me