Showing posts with label assembly. Show all posts
Showing posts with label assembly. Show all posts

Wednesday, 4 November 2020

de novo assembly of Ion Torrent Reads

 I am intrigued that this Genome assembly guide includes mention of SOLID and Ion Torrent ... although not much information is given on how to work on them for genome assembly. 

That said, perhaps the SPADES plugin provided with the sequencer solves most of everyone's immediate needs ... Wondering how improve on evaluating the assemblies 



Wednesday, 24 July 2013

Illumina produces 3k of 8500 bp reads on HiSeq using Moleculo Technology

Keith blogged about how super long read sequencing methods would be a threat to Illumina in Jan 2013. Today, Illumina can now openly acknowledge the shortcomings of their short reads for various applications like
  • assembly of complex genomes (polyploid, containing excessive long repeat regions, etc.), 
  • accurate transcript assembly, 
  • metagenomics of complex communities, 
  • and phasing of long haplotype blocks.


the reason?
This latest set of data released on BaseSpace
Read length distribution of synthetic long reads for a D. melanogaster library
The data set, available as a single project in BaseSpace, can be accessed here.

image source: http://blog.basespace.illumina.com/2013/07/22/first-data-set-from-fasttrack-long-reads-early-access-service/

with the integration of Moleculo they have managed to generate ~30 gb of raw sequence data. They have refrained from talking about 'key analysis metrics' that's available in the pdf report. Perhaps it's much easier to let the blogosphere and data scientists dissect the new data themselves.

Am wondering when the 454 versus Illumina Long Reads side-by-side comparison will pop up

UPDATE:

Can't find the 'key analysis metrics' in the pdf report files. Perhaps it's still being uploaded? *shrugs*
so please update me if you see it  otherwise I just have to run something on it


These are the files that I have now

total 512M
 259M Jul 18 01:01 mol-32-2832.fastq.gz
  44K Jul 24  2013 FastTrackLongReads_dmelanogaster_281c.pdf
 149K Jul 24  2013 mol-32-281c-scaffolds.txt
  44K Jul 24  2013 FastTrackLongReads_dmelanogaster_2832.pdf
 151K Jul 24  2013 mol-32-2832-scaffolds.txt
 253M Jul 24  2013 mol-32-281c.fastq.gz

md5sums
6845fc3a4da9f93efc3a52f288e2d7a0  FastTrackLongReads_dmelanogaster_281c.pdf
02f5de4f7e15bbcd96ada6e78f659fdb  FastTrackLongReads_dmelanogaster_2832.pdf
586599bb7fca3c20ba82a82921e8ba3f  mol-32-281c-scaffolds.txt
b25010e9e5e13dc7befc43b5dff8c3d6  mol-32-281c.fastq.gz
6822cfbd3eb2a535a38a5022c1d3c336  mol-32-2832-scaffolds.txt
873f09080cdf59ed37b3676cddcbe26f  mol-32-2832.fastq.gz


I have ran FastQC (FastQC v0.10.1) on both samples the images below are from 281c.
you can download the full HTML report here
https://www.dropbox.com/sh/5unu3zba9u21ywj/JT4HdkzfOP/mol-32-281c_fastqc.zip
https://www.dropbox.com/s/mpxa5wx51iqmiz3/mol-32-2832_fastqc.zip

Reading about the Moleculo sample prep method, it seems like it's just a rather ingenious way to stitch short reads which are barcoded to form a single long contig. if that is the case, then I am not sure if the base quality scores here are meaningful anymore since it's a mini-assembly. Also this takes out any quantitative value of the number of reads I presume. So accurate quantification of long RNA molecules or splice variants isn't possible. Nevertheless it's an interesting development on the Illumina platform. Looking forward to seeing more news about it.













Other links

Illumina Long-Read Sequencing Service
Moleculo technology: synthetic long reads for genome phasing, de novo sequencing
CoreGenomics: Genome partitioning: my moleculo-esque idea
Moleculo and Haplotype Phasing - The Next Generation TechnologistNext Generation Technologist
Abstract: Production Of Long (1.5kb – 15.0kb), Accurate, DNA Sequencing Reads Using An Illumina HiSeq2000 To Support De Novo Assembly Of The Blue Catfish Genome (Plant and Animal Genome XXI Conference)
http://www.moleculo.com/ (no info on this page though)
Illumina Announces Phasing Analysis Service for Human Whole-Genome Sequencing - MarketWatch
Patent information on the Long Read technology
https://docs.google.com/viewer?url=patentimages.storage.googleapis.com/pdfs/US20130079231.pdf









Monday, 25 February 2013

Michael Schatz:Assembling Crop Genomes With SMS

PDF of the presentation on Feb 22, 2013 AGBT, Marco Island, FL

http://schatzlab.cshl.edu/presentations/2013-02-20.AGBT.Assembling%20Crop%20Genomes.pdf

if you need an intro

"In a talk during the evening session, Mike Schatz, an assistant professor at Cold Spring Harbor Laboratory, spoke about “Assembling Crop Genomes with Single Molecule Sequencing.” Crops are important to sequence — 15 crops represent 90% of the world’s food, Schatz said — but are notoriously difficult to study because of their large genome size, high repeat content, and higher ploidy. Along with Sergey Koren and Adam Phillippy, he has built a pipeline to create hybrid genome assemblies using PacBio long reads combined with shorter-read sequence — either CCS reads from PacBio or data from another sequencing platform. In an example he offered of a rice strain, an attempted genome assembly using just Illumina reads yielded an N50 contig of 16Kb, but adding PacBio long reads to that boosted the N50 contig to 25Kb. Ultimately, Schatz said, he expects that as PacBio's readlength improves, this kind of approach could routinely generate megabase-size contigs or even pull plant chromosomes into single contigs.

For more information on Mike Schatz’s work using SMRT Sequencing, check out this case studydescribing an automated pipeline for genome finishing with PacBio long reads."

source: http://blog.pacificbiosciences.com/2013/02/notes-from-agbt-long-read-sequence-data.html

He includes a snippet of code to answer this question from twitter
'What's the longest single contig from a de Bruijn assembler without PE or a jumping library?'


$ perl -e 'print ">random\n"; @D=split //,"ACGT"; \for (1...100000000){print $D[int(rand(4))];} \print "\n"’ | fold > random.fa$ wgsim –r 0 -e 0 -N 50000000 -1 100 -2 1 \random.fa random.reads.fq /dev/null$ SOAPdenovo-63mer all –s random.cfg -K 63 -o random.63$ getlengths random.63.contig           1 99999990

Wednesday, 26 September 2012

Next-generation Phylogenomics Using a Target Restricted Assembly Method.


Very interesting to turn the assembly problem backwards ... though it has limited applications outside of phylogenomics I suppose since you need to have the protein sequences avail in the first place. 

I am not sure if there are tools that can easily extract mini-assemblies from BAM files i.e. extract aligned reads (in their entirety instead of being trimmed by the region you specify) 
which should be nice / useful to do when trying to look at assemblies in regions and trying to add new reads or info to them (Do we need a phrap/consed for NGS de novo assembly? ) 


 2012 Sep 18. pii: S1055-7903(12)00364-8. doi: 10.1016/j.ympev.2012.09.007. [Epub ahead of print]

Next-generation Phylogenomics Using a Target Restricted Assembly Method.

Source

Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, Champaign, IL 61820, USA. Electronic address: kjohnson@inhs.uiuc.edu.

Abstract

Next-generation sequencing technologies are revolutionizing the field of phylogenetics by making available genome scale data for a fraction of the cost of traditional targeted sequencing. One challenge will be to make use of these genomic level data without necessarily resorting to full-scale genome assembly and annotation, which is often time and labor intensive. Here we describe a technique, the Target Restricted Assembly Method (TRAM), in which the typical process of genome assembly and annotation is in essence reversed. Protein sequences of phylogenetically useful genes from a species within the group of interest are used as targets in tblastn searches of a data set from a lane of Illumina reads for a related species. Resulting blast hits are then assembled locally into contigs and these contigs are then aligned against the reference "cDNA" sequence to remove portions of the sequences that include introns. We illustrate the Target Restricted Assembly Method using genomic scale datasets for 20 species of lice (Insecta: Psocodea) to produce a test phylogenetic data set of 10 nuclear protein coding gene sequences. Given the advantages of using DNA instead of RNA, this technique is very cost effective and feasible given current technologies.
Copyright © 2012. Published by Elsevier Inc.
Icon for Elsevier Science

PMID:
 
23000819
 
[PubMed - as supplied by publisher]

Wednesday, 5 September 2012

[pub] SEED: efficient clustering of next-generation sequences.


 2011 Sep 15;27(18):2502-9. Epub 2011 Aug 2.

SEED: efficient clustering of next-generation sequences.

Source

Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA.

Abstract

MOTIVATION:

Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.

RESULTS:

Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 10-fold="10-fold" 12-27="12-27" 2-="2-" 21-41="21-41" 60-85="60-85" a="a" able="able" addition="addition" also="also" and="and" area="area" as="as" asis="asis" assembler="assembler" assemblies="assemblies" assembly="assembly" best="best" better="better" by="by" cluster="cluster" clustering="clustering" clusters="clusters" compared="compared" contained="contained" contigs="contigs" data="data" datasets="datasets" demonstrate="demonstrate" discovering="discovering" efficiency="efficiency" fall="fall" for="for" from="from" generating="generating" genome="genome" h="h" in="in" indicated="indicated" into="into" it="it" its="its" larger="larger" linear="linear" longer="longer" memory="memory" most="most" n50="n50" ngs="ngs" non-preprocessed="non-preprocessed" of="of" on="on" organisms.="organisms." other="other" our="our" p="p" performance.="performance." performance="performance" preprocessing="preprocessing" reduce="reduce" requirements="requirements" respectively.="respectively." results="results" rna="rna" s="s" seed="seed" sequences="sequences" showed="showed" similar="similar" small="small" stand-alone="stand-alone" study="study" tests="tests" than="than" the="the" this="this" time="time" to="to" tool="tool" tools="tools" transcriptome="transcriptome" true="true" unsequenced="unsequenced" used="used" using="using" utilities="utilities" values.="values." velvet="velvet" was="was" when="when" while="while" with="with">

AVAILABILITY:

The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed.

CONTACT:

thomas.girke@ucr.edu

SUPPLEMENTARY INFORMATION:

Supplementary data are available at Bioinformatics online.
PMID:
 
21810899
 
[PubMed - indexed for MEDLINE] 
PMCID:
 
PMC3167058
 
Free PMC Article
Icon for HighWire Press Icon for PubMed Central

Friday, 12 August 2011

is 12 million 90 bp transcriptome reads enough for transcriptome assembly?

Posted a pubmed link recently, the authors "report the use of next-generation massively parallel sequencing technologies and de novo transcriptome assembly to gain a comprehensive overview of the H. brasiliensis transcriptome. The sequencing output generated more than 12 million reads with an average length of 90 nt. In total 48,768 unigenes (mean size = 436 bp, median size = 328 bp) were assembled through de novo transcriptome assembly."

Do you think such an assembly truly is useful for research? or would a higher coverage been better? 

Saturday, 16 July 2011

International Crowdsourcing Initiative to Combat the E. Coli Breakout in Germany

Guest Post: International Crowdsourcing Initiative to Combat the E. Coli Breakout in Germany

 Permanent link
June 8, 2011 
Editor’s Note: In light of the recent E coli outbreak in Germany, NGS Leaders invited Joyce Peng from BGI to comment on the organization’s efforts to understand the culprit. Below Joyce describes BGI’s efforts to rally the international community in combating the outbreak. – Eric Glazer
In response to the recent E. coli outbreak in Germany, BGI and its collaborators at the University Medical Centre Hamburg-Eppendorf have released their third version of the assembled genome, which includes new data from this E. coli O104 (ftp://ftp.genomics.org.cn/pub/Ecoli_TY-2482/Escherichia_coli_TY-2482.contig.20110606.fa.gz ). The FTP site contains a file that provides the PCR primer sequences which researchers have used to create diagnostic kits for rapid identification of this highly infectious bacterium.

Thursday, 7 July 2011

OpGen Touts Technology's Ability to Improve De Novo Assembly, Correct Errors in Finished Genomes


By Monica Heger
When paired with next-gen sequencing technology, OpGen's Argus optical mapping technology can correct errors in assembled genomes and help close gaps, a company official said last week at a presentation during a one-day conference of BGI users in Rockville, Md.
Trevor Wagner, a senior scientist at OpGen, presented data on how the company has used the Argus platform to find errors in microbial assemblies from the Human Microbiome Project, as well as in finished genomes, and to close introduced gaps in sequenced human genomes.
While the platform has mostly been used for smaller genomes like bacteria and microbes, Wagner said that the company is now also moving into mammalian and plant genomes, using a "hybrid approach" that combines next-gen sequencing with single-molecule restriction maps.
Full article

Tuesday, 22 March 2011

de novo assembly of Illumina CEO genome in 11.5 h - new ver of Ray

Kevin:You can't ignore an email with that subject header.. but 512 compute cores? Shall have a chat with my HPC vendor.. 
Also am waiting for public release of Cortex http://sourceforge.net/projects/cortexassembler/
Strange that courses that teach the software are available but the software ain't ... 
http://www.ebi.ac.uk/training/onsite/NGS_120510.html


Velvet and Curtain seems promising for de novo assembly as well.

Ray 1.3.0 is now available online.
http://sourceforge.net/projects/denovoassembler/files/Ray-1.3.0.tar.bz2

The most important change is the correction of a major bug that caused
parallel infinite loop on the human genome.

This, along concepts incorporated in Ray 1.2.4, allowed Ray to assemble
the genome of Illumina's CEO in 11.5 hours using 512 compute cores (see
below for the link).

What's new?

1.3.0

2011-03-22

   * Vertices with less than 1 of coverage are ignored during the
computation of seeds and during the computation of extensions.
   * Computation of library outer distances relies on the virtual
communicator.
   * Expiry positions are used to toss away reads that are out-of-range
   * When only one choice is given during the extension and some reads
are in-range, then the sole choice is picked up.
   * Fixed a bug for empty reads.
   * A read is not added in the active set if it is marked on a
repeated vertex and its mate was not encountered yet.
   * Grouped messages in the extension of seeds.
   * Reads marked on repeated vertices are cached during the extension.
   * Paths are cached in the computation of fusions.
   * Fixed an infinite loop in the extension of seeds.
   * When fetching read markers for a vertex, send a list of mates to
meet if the vertex is repeated in order to reduce the communication.
   * Updated the Instruction Manual
   * Added a version of the logo without text.


I fixed a bug that caused an infinite loop. Now Ray can assemble large
genomes. See my blog post for more detail about that.
http://dskernel.blogspot.com/2011/03/de-novo-assembly-of-illumina-ceo-genome.html


Version 1.2.4 of Ray incorporated also new concepts that I will present
at RECOMB-Seq 2011.

The talk is available online:
http://boisvert.info/dropbox/recomb-seq-2011-talk.pdf


Sébastien Boisvert

Tuesday, 16 November 2010

Pending release of Contrail, Hadoop de novo assembler?

Jermdemo on Twitter

Just noticed the source code for Contrail, the first Hadoop based de-novo assembler, has been uploaded http://bit.ly/96pSbw 26 days ago


Oh the suspense!

Wednesday, 27 October 2010

de novo assembly of large genomes

Here's a informative post by Ewan Birney on the velvet user list about de novo assembly of large genomes.

Velvet's algorithms in theory work for any size. However, the engineering aspects
of Velvet, in particular memory consumption, means it's unable to handle read sets
of a particular size. This of course depends on how big a real memory machine
you have.

I know we have "routinely" (ie, for multiple strains) done Drosophila sized genomes
(~120MB) on a 125GB machine.

I've heard of Velvet being used into the 200-300MB region, but rarely further. Memory
size is not just about the size of the genome but also how error prone you reads
are (though sheer size is important).


Beyond this there are a variety of strategies:

  "Raw" de Bruijn graphs, without a tremendously aggressive use of read pairs can
be made using Cortex (unpublished, from Mario Cacamo and Zam Iqbal) or ABySS (published,
well understood, from the BC genome centre).

   Curtain (unpublished, but available, from Matthias Haimel at EBI) can do a
smart partition of the reads given an initial de Bruijn graph, run Velvet on the paritions
and thus provide an improved more read-pair aware graph. This can be iterated and in
at least some cases, the Curtain approach gets close to what Velvet can produce alone
(in the scenarios where Velvet can be run on a single memory machine to understand
Curtain's performance)


   SOAP de novo from the BGI is responsible for a number of the published assemblies
(eg, Panda, YH) although like many assemblers, tuning it seems quite hard, and I would
definitely be asking the BGI guys for advice.

   A new version of ALLPATHS (from the Broad crew) looks extremely interesting, but
is not quite released yet.

In all above the cases I know of successes, but also quite a few failures, and untangling
data quality/algorithm/choice of parameters/running bugs is really complex. So - whereas
assemblies < 100MB are "routine", currently assemblies 100MB-500MB are "challenging" and
>500MB are theoretically doable, and have been done by specific groups, but I think still
are at the leading edge of development and one should not be confident of success for
"any particular genome".


Thanks Ewan for letting me reproduce his post here


Velvet-users mailing list
http://listserver.ebi.ac.uk/mailman/listinfo/velvet-users



Cortex seems very promising for de novo assembly of human reads using reasonable amounts of ram ( 128 Gb ) based on the mailing list. I know I be watching out for it on Sourceforge!

Monday, 11 October 2010

Installing SOLiD™ System de novo Accessory Tools 2.0 with Velvet and MUMmer

howto install on CentOS 5.4 

 wget http://solidsoftwaretools.com/gf/project/denovo/ #just to keep a record
 wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_0.7.55.tgz
 wget http://downloads.sourceforge.net/project/mummer/mummer/3.22/MUMmer3.22.tar.gz
 tar zxvf denovo2.tgz
 cp velvet_0.7.55.tgz denovo2 #you can use mv if you don’t mind downloading again
 cp MUMmer3.22.tar.gz denovo2
 cd denovo2
 tar zxvf velvet_0.7.55.tgz
 tar zxvf MUMmer3.22.tar.gz
 cd MUMmer3.22/src/kurtz #this was the part where I deviated from instructions
 gmake mummer #Might be redundant but running gmake at root dir gave no binary
 gmake |tee gmake-install.log
Next step:
download the example data to run through the pipeline
http://solidsoftwaretools.com/gf/project/ecoli50x50/
http://download.solidsoftwaretools.com/denovo/ecoli_600x_F3.csfasta.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_F3.qual.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_R3.csfasta.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_R3.qual.gz

Description
This is a 50X50 Mate pair library from DH10B produced by SOLiD™ system.The set includes .csfasta and .qual files for F3 and R3. The insert size of the library is 1300bp and it is about 600X coverage of the DH10B genome. The results from MP library in the de novo documents are generated from this dataset.



YMMV


GCC: The Complete Reference

pitfalls for SAET for de novo assembly

Spotted in manual for

SOLiD™ System de novo Accessory Tools 2.0


Usage of pre-assembly error correction: This is an optional tool which was
demonstrated to increase contigs length in de novo assembly by factor of 2 to 3. Do not use this tool if coverage is less than 20x. Overcorrection and under-correction are equally bad for de novo assembly; therefore use balanced number of local and global rounds of error correction. For example, the pipeline will use 1 global and 3 local rounds if reads are 25bp long, and 2 global and 5 local rounds if reads are 50bp long.


Is it just me? I would think it is trivial to implement the correction tool to correct only when the coverage is > 20x. Not sure why you would need human intervention.

Wednesday, 26 May 2010

Coral Transcriptomics-a budget NGS approach?

Was surprised I didn't blog about this earlier.
Dr Mikhail Matz is a researcher in the field of coral genomics. His approach to doing de novo transcriptomics for an organism whose genome is unavailable.


his compute cluster is basically
"two Dell PowerEdge 1900 servers joined together with ROCKS clustering software v5.0. Each server had: two Intel Quad Core E5345 (2.33 Ghz, 1333 Mhz FSB, 2x4MB L2 Cache) CPU’s and 16 GB of 667 Mhz DDR2 RAM. The cluster had a combined total of 580 GB disk space."




Tools used are
- Blast executables from NCBI, including blast, blastcl3, and blastclust
- Washington University blast (Wu-blast)
- ESTate sequence clustering software
- Perl

He admits that the assembled transcriptome might be incomplete (~40,000 contigs with five-fold average sequencing see Figure 2 for the size distribution of the assembled contigs
But it is "good enough" to use as a reference transcriptome to align SOLiD reads accurately and to generate the coverage that 454 can't give for the same amount of grant money.

the results are published in BMC Genomics

Not sure if you have heard of just in time inventory. But I think "good enough" science takes a bit of dare to spend that money to ask those what-ifs.


Friday, 26 February 2010

Thursday, 24 December 2009

De novo assembly with ABI SOLiD reads

Ran a trial run with sample data!

Workflow
perl solid_denovo_preprocessor.pl --run fragment --file fixed-saet/reads.csfasta
./velveth_de output_directory/ 21 -short /home/kev/bin/source/solid-denovo-acc-tools/output/doubleEncoded_input.de -strand_specific

./velvetg_de output_directory/ -read_trkg yes -amos_file yes
./solid_denovo_postprocessor.pl --afgfile Velvet_asm.afg --csfasta colorspace_input.csfasta --run fragment

denovoadp sample_input 200 > sample_output
Voila! base space!
doing more testing will update post with remarks later

@ Victor,
you may be able to find denovoadp here but its a new version and I have yet to test it  
SOLiD™ System de Novo Assembly Tools 2.0 The tools in this project provide the ability to create de novo assemblies from SOLiD™ colorspace reads.

Datanami, Woe be me