Showing posts with label assembler. Show all posts

Wednesday, 26 September 2012

Next-generation Phylogenomics Using a Target Restricted Assembly Method.

Very interesting to turn the assembly problem backwards ... though it has limited applications outside of phylogenomics I suppose since you need to have the protein sequences avail in the first place.

I am not sure if there are tools that can easily extract mini-assemblies from BAM files i.e. extract aligned reads (in their entirety instead of being trimmed by the region you specify)

which should be nice / useful to do when trying to look at assemblies in regions and trying to add new reads or info to them (Do we need a phrap/consed for NGS de novo assembly? )

Mol Phylogenet Evol. 2012 Sep 18. pii: S1055-7903(12)00364-8. doi: 10.1016/j.ympev.2012.09.007. [Epub ahead of print]

Next-generation Phylogenomics Using a Target Restricted Assembly Method.

Johnson KP, Walden KK, Robertson HM.

Source

Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, Champaign, IL 61820, USA. Electronic address: kjohnson@inhs.uiuc.edu.

Abstract

Next-generation sequencing technologies are revolutionizing the field of phylogenetics by making available genome scale data for a fraction of the cost of traditional targeted sequencing. One challenge will be to make use of these genomic level data without necessarily resorting to full-scale genome assembly and annotation, which is often time and labor intensive. Here we describe a technique, the Target Restricted Assembly Method (TRAM), in which the typical process of genome assembly and annotation is in essence reversed. Protein sequences of phylogenetically useful genes from a species within the group of interest are used as targets in tblastn searches of a data set from a lane of Illumina reads for a related species. Resulting blast hits are then assembled locally into contigs and these contigs are then aligned against the reference "cDNA" sequence to remove portions of the sequences that include introns. We illustrate the Target Restricted Assembly Method using genomic scale datasets for 20 species of lice (Insecta: Psocodea) to produce a test phylogenetic data set of 10 nuclear protein coding gene sequences. Given the advantages of using DNA instead of RNA, this technique is very cost effective and feasible given current technologies.

PMID:: 23000819; [PubMed - as supplied by publisher]

CORTEX update contains scripts for assembly of large numbers of samples with large genomes - i.e. for the 1000 Genomes project.

cortex_var is a tool for genome assembly and variation analysis from sequence data. You can use it to discover and genotype variants on single or multiple haploid or diploid samples. If you have multiple samples, you can use Cortex to look specifically for variants that distinguish one set of samples (eg phenotype=X, cases, parents, tumour) from another set of samples (eg phenotype=Y, controls, child, normal). See our Nature Genetics paper and the documentation for detailed descriptions.

http://cortexassembler.sourceforge.net/index_cortex_var.html

The Cortex paper is now out in Nature Genetics!

cortex_var features

Variant discovery by de novo assembly - no reference genome required
Supports multicoloured de Bruijn graphs - have multiple samples loaded into the same graph in different colours, and find variants that distinguish them.
Capable of calling SNPs, indels, inversions, complex variants, small haplotypes
Extremely accurate variant calling - see our paper for base-pair-resolution validation of entire alleles (rather than just breakpoints) of SNPs, indels and complex variants by comparison with fully sequenced (and finished) fosmids - a level of validation beyond that demanded of any other variant caller we are aware of - currently cortex_var is the most accurate variant caller for indels and complex variants.
Capable of aligning a reference genome to a graph and using that to call variants
Support for comparing cases/controls or phenotyped strains
Typical memory use: 1 high coverage human in under 80Gb of RAM, 1000 yeasts in under 64Gb RAM, 10 humans in under 256 Gb RAM

23rd August 2012: Bugfix release v1.0.5.11. Get it here.. The main change in this release is in the scripts/1000genomes directory, which I have not advertised previously. It contains scripts for running Cortex on large numbers (tens, hundreds) of samples with large genomes - i.e. for the 1000 Genomes project. These are to allow collaborators across the world to reliably run a consistent Cortex pipeline on human populations. However this is the first time people other than me have done this, so I expect there may be some smoothing-out of issues in the near future. You can see aPDF describing the pipeline here. I've had enough people ask me about running Cortex on lots of samples with big genomes, that I thought people would find it useful to see the process. This release is a bugfix for a script in that 1000 Genomes directory, plus fixes for a few potential bugs-in-waiting (array overflow errors) in Cortex itself.

Friday, 15 July 2011

Performance of Ray @ Assemblathon 2

Ray is one of the assemblers that I watch closely but sadly lack the time to experiment with. Here's the candid email from Sebastien to the Ray mailing list on Ray's performance on the test data

For those who follow Assemblathon 2, my last run on my testbed (Illumina data from BGI and from Illumina UK for the Bird/Parrot):

(all mate-pairs failed detection because of the many peaks in each library, I will modify Ray to consider that)

Total number of unfiltered Illumina TruSeq v3 sequences: Total: 3 072 136 294, that is ~3 G sequences !

512 compute cores (64 computers * 8 cores/computer = 512)

Typical communication profile for one compute core:

[1,0]:Rank 0: sent 249841326 messages, received 249840303 messages.

Yes, each core sends an average of 250 M messages during the 18 hours !

Peak memory usage per core: 2.2 GiB

Peak memory usage (distributed in a peer-to-peer fashion): 1100 GiB

The peak occurs around 3 hours and goes down to 1.1 GiB per node immediately because the pool of defragmentation groups for k-mers occuring once is freed.

The compute cluster I use has 3 GiB per compute core. So using 2048 compute cores would give me 6144 GiB of distributed memory.

Number of contigs: 550764
Total length of contigs: 1672750795
Number of contigs >= 500 nt: 501312
Total length of contigs >= 500 nt: 1656776315
Number of scaffolds: 510607
Total length of scaffolds: 1681345451
Number of scaffolds >= 500 nt: 463741
Total length of scaffolds >= 500: 1666464367

k-mer length: 31
Lowest coverage observed: 1
MinimumCoverage: 42
PeakCoverage: 171
RepeatCoverage: 300
Number of k-mers with at least MinimumCoverage: 2453479388 k-mers
Estimated genome length: 1226739694 nucleotides
Percentage of vertices with coverage 1: 83.7771 %
DistributionFile: parrot-Testbed-A2-k31-

20110712.CoverageDistribution.txt

[1,0]: Sequence partitioning: 1 hours, 54 minutes, 47 seconds
[1,0]: K-mer counting: 5 hours, 47 minutes, 20 seconds
[1,0]: Coverage distribution analysis: 30 seconds
[1,0]: Graph construction: 2 hours, 52 minutes, 27 seconds
[1,0]: Edge purge: 57 minutes, 55 seconds
[1,0]: Selection of optimal read markers: 1 hours, 38 minutes, 13 seconds
[1,0]: Detection of assembly seeds: 16 minutes, 7 seconds
[1,0]: Estimation of outer distances for paired reads: 6 minutes, 26 seconds
[1,0]: Bidirectional extension of seeds: 3 hours, 18 minutes, 6 seconds
[1,0]: Merging of redundant contigs: 15 minutes, 45 seconds
[1,0]: Generation of contigs: 1 minutes, 41 seconds
[1,0]: Scaffolding of contigs: 54 minutes, 3 seconds
[1,0]: Total: 18 hours, 3 minutes, 50 seconds

10 largest scaffolds:

257646
266905
268737
272828
281502
294105
294106
296978
333171
397201

# average latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
# Message passing interface rank Name Latency in microseconds
0 r107-n24 138
1 r107-n24 140
2 r107-n24 140
3 r107-n24 140
4 r107-n24 141
5 r107-n24 141
6 r107-n24 140
7 r107-n24 140
8 r107-n25 140
9 r107-n25 139
10 r107-n25 138
11 r107-n25 139
Sébastien

Wednesday, 13 April 2011

ZORRO is an hybrid sequencing technology assembler:tested with Solexa 454

Typos in the header aside... you have to love the name!

waiting for the name to become a verb... " I zorroed the NGS reads the other and i had a fantastic assembly!" lol..

Here goes: http://lge.ibi.unicamp.br/zorro/

Overview

ZORRO is an hybrid sequencing technology assembler. It takes 2 sets of pre-assembled contigs and merge them into a more contiguous and consistent assembly. We have already tested Zorro with Illumina Solexa and 454 from some of organisms varying from 3Mb to 100Mb. The main caracteristic of Zorro is the treatment before and after assembly to avoid errors.
The ZORRO project is maintained by Gustavo Lacerda, Ramon Vidal and Marcelo Carazzole and were first used in this Yeast assembly: Genome structure of a Saccharomyces cerevisiae strain widely used in bioethanol production
ZORRO needs to be better documented and has not undergone enough testing. If you want to discuss the pipeline you can join the mailing list: zorro-google group

p.s. the typo is here
"ZORROthe masked assember "

Wednesday, 23 March 2011

A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies.

    A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies.
    Zhang W, Chen J, Yang Y, Tang Y, Shang J, Shen B.
    PLoS One. 2011 Mar 14;6(3):e17915.
    PMID: 21423806 [PubMed - in process]

Should be interesting..

Tuesday, 22 March 2011

de novo assembly of Illumina CEO genome in 11.5 h - new ver of Ray

Kevin:You can't ignore an email with that subject header.. but 512 compute cores? Shall have a chat with my HPC vendor..
Also am waiting for public release of Cortex http://sourceforge.net/projects/cortexassembler/
Strange that courses that teach the software are available but the software ain't ...
http://www.ebi.ac.uk/training/onsite/NGS_120510.html

Velvet and Curtain seems promising for de novo assembly as well.

Ray 1.3.0 is now available online.
http://sourceforge.net/projects/denovoassembler/files/Ray-1.3.0.tar.bz2

The most important change is the correction of a major bug that caused
parallel infinite loop on the human genome.

This, along concepts incorporated in Ray 1.2.4, allowed Ray to assemble
the genome of Illumina's CEO in 11.5 hours using 512 compute cores (see
below for the link).

What's new?

1.3.0

2011-03-22

* Vertices with less than 1 of coverage are ignored during the
computation of seeds and during the computation of extensions.
* Computation of library outer distances relies on the virtual
communicator.
* Expiry positions are used to toss away reads that are out-of-range
* When only one choice is given during the extension and some reads
are in-range, then the sole choice is picked up.
* Fixed a bug for empty reads.
* A read is not added in the active set if it is marked on a
repeated vertex and its mate was not encountered yet.
* Grouped messages in the extension of seeds.
* Reads marked on repeated vertices are cached during the extension.
* Paths are cached in the computation of fusions.
* Fixed an infinite loop in the extension of seeds.
* When fetching read markers for a vertex, send a list of mates to
meet if the vertex is repeated in order to reduce the communication.
* Updated the Instruction Manual
* Added a version of the logo without text.

I fixed a bug that caused an infinite loop. Now Ray can assemble large
genomes. See my blog post for more detail about that.
http://dskernel.blogspot.com/2011/03/de-novo-assembly-of-illumina-ceo-genome.html

Version 1.2.4 of Ray incorporated also new concepts that I will present
at RECOMB-Seq 2011.

The talk is available online:
http://boisvert.info/dropbox/recomb-seq-2011-talk.pdf

Sébastien Boisvert

Kevin's GATTACA World

Wednesday, 26 September 2012