Showing posts with label Next Generation Sequencing. Show all posts
Showing posts with label Next Generation Sequencing. Show all posts

Wednesday, 2 October 2013

Biome | Q&A with Rich Roberts on single-molecule sequencing technology

The exciting part about single molecule sequencing for me was the ability to sequence low abundance transcripts or have phased haplotypes for human sequencing. Having a high error rate nullifies any advantage in these areas. But I guess it's a tool in the end and how you use it to get meaning results.


"As the previously rapid climb in cost efficiency brought about by next-generation sequencing plateaus, the failure of single-molecule sequencing to deliver might leave some genomics aficionados despondent about the prospects for their field. But a recentCorrespondence article in Genome Biology saw Nobel laureate Richard Roberts, together with Cold Spring Harbor’s Mike Schatz and Mauricio Carneiro of the Broad Institute, argue that the latest iteration of Pacific Biosciences’ SMRT platform is a powerful tool, whose value should be reassessed by a skeptical community.
In this Q&A, Roberts tells us why he thinks there’s a need for re-evaluation, and what sparked his interest in genomics in the first place."

http://www.biomedcentral.com/biome/rich-roberts-discusses-single-molecule-sequencing-technology/


Correspondence
Article has an altmetric score of 55

The advantages of SMRT sequencing

Roberts RJ, Carneiro MO and Schatz MC
Genome Biology 2013, 14:405

Go to article >>

Friday, 28 June 2013

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Each time when I change jobs, I will have to go through the adventure (and sometimes pain) to relearn about the computing resources available to me (personal), lab (small sharing pool), and the entire institute/company/school (Not enought to go around usually).
Depending on the job scope / number of cores / length of the job I would then setup the computing resources to run on either of the 3 resources available to me.
Sometimes, grant money appears magically and I am asked by my boss what do I need to buy (ok TBH  this is rare). Hence it's always nice to keep a lookout on what's available on the market and who's using what to do what. So that one day when grant money magically appears, I won't be stumped for an answer.

excerpted from the provisional PDF are three points which I agree fully

Three GiB of RAM per core is not enough
you won't believe the number of things I tried to do to outsmart the 'system' just to squeeze enough ram for my jobs. Like looking for parallel queues which often have a bigger amount of RAM allocation. Doing tests for small jobs to make sure it runs ok before scaling it up and have it fail after two days due to insufficient RAM.
MPI is not widely used in NGS analysis
A lot of the queues in the university shared resource has ample resources for my jobs but were reserved for MPI jobs. Hence I can't touch those at all.
A central file system helps keep redundancy to a minimum
balancing RAM / compute cores to make the job splitting efficient was one thing. The other pain in the aXX was having to move files out of the compute node as soon as the job is done and clear all intermediate files. There were times where the job might have failed but as I deleted the intermediate files in the last step of the pipeline bash script, I wasn't able to be sure it ran to completion. In the end I had to rerun the job and keeping the intermediate files


anyway for more info you can check out the below

http://www.gigasciencejournal.com/content/2/1/9/abstract

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Samuel LampaMartin DahlöPall I OlasonJonas Hagberg and Ola Spjuth
For all author emails, please log on.
GigaScience 2013, 2:9 doi:10.1186/2047-217X-2-9
Published: 25 June 2013

Abstract (provisional)

Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made.

The complete article is available as a provisional PDF. The fully formatted PDF and HTML versions are in production.

Wednesday, 26 September 2012

Next-generation Phylogenomics Using a Target Restricted Assembly Method.


Very interesting to turn the assembly problem backwards ... though it has limited applications outside of phylogenomics I suppose since you need to have the protein sequences avail in the first place. 

I am not sure if there are tools that can easily extract mini-assemblies from BAM files i.e. extract aligned reads (in their entirety instead of being trimmed by the region you specify) 
which should be nice / useful to do when trying to look at assemblies in regions and trying to add new reads or info to them (Do we need a phrap/consed for NGS de novo assembly? ) 


 2012 Sep 18. pii: S1055-7903(12)00364-8. doi: 10.1016/j.ympev.2012.09.007. [Epub ahead of print]

Next-generation Phylogenomics Using a Target Restricted Assembly Method.

Source

Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, Champaign, IL 61820, USA. Electronic address: kjohnson@inhs.uiuc.edu.

Abstract

Next-generation sequencing technologies are revolutionizing the field of phylogenetics by making available genome scale data for a fraction of the cost of traditional targeted sequencing. One challenge will be to make use of these genomic level data without necessarily resorting to full-scale genome assembly and annotation, which is often time and labor intensive. Here we describe a technique, the Target Restricted Assembly Method (TRAM), in which the typical process of genome assembly and annotation is in essence reversed. Protein sequences of phylogenetically useful genes from a species within the group of interest are used as targets in tblastn searches of a data set from a lane of Illumina reads for a related species. Resulting blast hits are then assembled locally into contigs and these contigs are then aligned against the reference "cDNA" sequence to remove portions of the sequences that include introns. We illustrate the Target Restricted Assembly Method using genomic scale datasets for 20 species of lice (Insecta: Psocodea) to produce a test phylogenetic data set of 10 nuclear protein coding gene sequences. Given the advantages of using DNA instead of RNA, this technique is very cost effective and feasible given current technologies.
Copyright © 2012. Published by Elsevier Inc.
Icon for Elsevier Science

PMID:
 
23000819
 
[PubMed - as supplied by publisher]

Wednesday, 5 September 2012

[pub] SEED: efficient clustering of next-generation sequences.


 2011 Sep 15;27(18):2502-9. Epub 2011 Aug 2.

SEED: efficient clustering of next-generation sequences.

Source

Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA.

Abstract

MOTIVATION:

Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.

RESULTS:

Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 10-fold="10-fold" 12-27="12-27" 2-="2-" 21-41="21-41" 60-85="60-85" a="a" able="able" addition="addition" also="also" and="and" area="area" as="as" asis="asis" assembler="assembler" assemblies="assemblies" assembly="assembly" best="best" better="better" by="by" cluster="cluster" clustering="clustering" clusters="clusters" compared="compared" contained="contained" contigs="contigs" data="data" datasets="datasets" demonstrate="demonstrate" discovering="discovering" efficiency="efficiency" fall="fall" for="for" from="from" generating="generating" genome="genome" h="h" in="in" indicated="indicated" into="into" it="it" its="its" larger="larger" linear="linear" longer="longer" memory="memory" most="most" n50="n50" ngs="ngs" non-preprocessed="non-preprocessed" of="of" on="on" organisms.="organisms." other="other" our="our" p="p" performance.="performance." performance="performance" preprocessing="preprocessing" reduce="reduce" requirements="requirements" respectively.="respectively." results="results" rna="rna" s="s" seed="seed" sequences="sequences" showed="showed" similar="similar" small="small" stand-alone="stand-alone" study="study" tests="tests" than="than" the="the" this="this" time="time" to="to" tool="tool" tools="tools" transcriptome="transcriptome" true="true" unsequenced="unsequenced" used="used" using="using" utilities="utilities" values.="values." velvet="velvet" was="was" when="when" while="while" with="with">

AVAILABILITY:

The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed.

CONTACT:

thomas.girke@ucr.edu

SUPPLEMENTARY INFORMATION:

Supplementary data are available at Bioinformatics online.
PMID:
 
21810899
 
[PubMed - indexed for MEDLINE] 
PMCID:
 
PMC3167058
 
Free PMC Article
Icon for HighWire Press Icon for PubMed Central

Tuesday, 15 May 2012

NATURE BIOTECHNOLOGY | Performance comparison of benchtop high-throughput sequencing platforms


Performance comparison of benchtop high-throughput sequencing platforms

Nature Biotechnology
 
30,
 
434–439
 
(2012)
 
doi:10.1038/nbt.2198
Received
 
Accepted
 
Published online
 
Corrected online
 

Abstract

Three benchtop high-throughput sequencing instruments are now available. The 454 GS Junior (Roche), MiSeq (Illumina) and Ion Torrent PGM (Life Technologies) are laser-printer sized and offer modest set-up and running costs. Each instrument can generate data required for a draft bacterial genome sequence in days, making them attractive for identifying and characterizing pathogens in the clinical setting. We compared the performance of these instruments by sequencing an isolate of Escherichia coli O104:H4, which caused an outbreak of food poisoning in Germany in 2011. The MiSeq had the highest throughput per run (1.6 Gb/run, 60 Mb/h) and lowest error rates. The 454 GS Junior generated the longest reads (up to 600 bases) and most contiguous assemblies but had the lowest throughput (70 Mb/run, 9 Mb/h). Run in 100-bp mode, the Ion Torrent PGM had the highest throughput (80–100 Mb/h). Unlike the MiSeq, the Ion Torrent PGM and 454 GS Junior both produced homopolymer-associated indel errors (1.5 and 0.38 errors per 100 bases, respectively).

Figures at a glance

Sunday, 6 May 2012

slides:2012: Trends from the Trenches


2012: Trends from the Trenches

by Chris Dagdigian on Apr 26, 2012
Talk slides as delivered at the 2012 Bio-IT World Conference in Boston, MA 

2012: Trends from the Trenches
View more presentations from Chris Dagdigian

Thursday, 5 April 2012

CONTRA: copy number analysis for targeted res... [Bioinformatics. 2012] - PubMed - NCBI

http://www.ncbi.nlm.nih.gov/pubmed/22474122
Bioinformatics. 2012 Apr 2. [Epub ahead of print]
CONTRA: copy number analysis for targeted resequencing.
Li J, Lupat R, Amarasinghe KC, Thompson ER, Doyle MA, Ryland GL,
Tothill RW, Halgamuge SK, Campbell IG, Gorringe KL.
Source
Bioinformatics Core Facility, Victorian Breast Cancer Research
Consortium Cancer Genetics Laboratory and Molecular Genomics Core
Facility, Peter MacCallum Cancer Centre, VIC 3002, Australia, 3Dept.
of Mechanical Engineering, Sir Peter MacCallum Dept. of Oncology, and
Dept. of Pathology, University of Melbourne, Parkville, VIC 3010,
Australia.
Abstract
MOTIVATION:
In light of the increasing adoption of targeted resequencing as a
cost-effective strategy to identify disease-causing variants, a robust
method for copy number variation (CNV) analysis is needed to maximize
the value of this promising technology.
RESULTS:
We present a method for CNV detection for targeted resequencing data,
including whole-exome capture data. Our method calls copy number gains
and losses for each target region based on normalized depth of
coverage. Our key strategies include the use of base-level log-ratios
to remove GC-content bias, correction for an imbalanced library size
effect on log-ratios, and the estimation of log-ratio variations via
binning and interpolation. Our methods are made available via CONTRA
(COpy Number Targeted Resequencing Analysis), a software package that
takes standard alignment formats (BAM/SAM) and outputs in variant call
format (VCF4.0), for easy integration with other next-generation
sequencing analysis packages. We assessed our methods using samples
from seven different target enrichment assays, and evaluated our
results using simulated data and real germline data with known CNV
genotypes.Availability and implementation: Source code and sample data
are freely available under GNU license (GPLv3) at
http://contra-cnv.sourceforge.net/

Wednesday, 29 February 2012

Translational Genomics Research Institute, personalised genomics to improve chemotherapy, cloud computing for pediatric cancer


I think it's fantastic that this is happening right now. Given that the cost of sequencing and computing is still relatively high, I can see how the first wave of personalized medicine will be lead by non-profit organizations. I am personally curious how this might pan out and would this be cost-effective for the patients ultimately? Would they be able to quantify it? 
Kudos for Dell for being a part of this exercise, though I wondered if they could have donated more to the data center or alternatively setup a mega cloud center and donate compute resources instead. Since i think the infrastructure and knowledge gleaned will be useful for their marketing and sales. 




http://www.hpcinthecloud.com/hpccloud/2012-02-29/cloud_computing_helps_fight_pediatric_cancer.html

Cloud technology is being used to speed computation, as well as manage and store the resulting data. Cloud also enables the high degree of collaboration that is necessary for science research at this level. The scientists have video-conferences where they work off of "tumor boards" to make clinical decisions for the patients in real-time. Before they'd have to ship hard drives to each other to have that degree of collaboration and now the data is always accessible through the cloud platform.


"We expect to change the way that the clinical medicine is delivered to pediatric cancer patients, and none of this could be done without the cloud," Coffin says emphatically. "With 12 cancer centers collaborating, you have to have the cloud to exchange the data."


Dell relied on donations to build the initial 8.2 teraflop high-performance machine. A second round of donations has meant a doubling in resources for this important work, up to an estimated 13 teraflops of sustained performance.


"Expanding on the size of the footprint means we can treat more and more patients in the clinic trial so this is an exciting time for us. This is the first pediatric clinic trial using genomic data ever done. And Dell is at the leading edge driving this work from an HPC standpoint and from a science standpoint."


The donated platform is comprised of Dell PowerEdge Blade Servers, PowerVault Storage Arrays, Dell Compellent Storage Center arrays and Dell Force10 Network infrastructure. It features 148 CPUs, 1,192 cores, 7.1 TB of RAM, and 265 TB Disk (Data Storage). Dell Precision Workstations are available for data analysis and review. TGen's computation and collaboration capacity has increased by 1,200 percent compared to the site's previous clinical cluster. In addition, the new system has reduced tumor mapping and analysis time from a matter of months to days.

Wednesday, 22 February 2012

Amazon S3 for temporary storage of large datasets?

Just did a rough calculation on AWS calculator, the numbers are quite scary!

For a hypothetical 50 TB dataset (haven't found out the single S3 object max file size yet, seem to recall it's 1 Gbytes)
it costs $4160.27 to store it for a month!

to transfer it out it costs $4807.11!

For 3 years, the cost of storage is $149,000 which I guess you can pay for an enterprise storage solution and transfer costs are zero.

At this point in time, I guess one can't really use AWS S3 for sequence archival. I wonder if data deduplication can help reduce cloud storage costs ... I am sure in terms of bytes, BAM files should be quite similar .. no?


Saturday, 18 February 2012

Oxford Nanopore megaton announcement: “Why do you need a machine?” – exclusive interview for this blog!


http://pathogenomics.bham.ac.uk/blog/2012/02/oxford-nanopore-megaton-announcement-why-do-you-need-a-machine-exclusive-interview-for-this-blog/

woke up this morning to see a whole bunch of excited tweets on Oxford Nanopore and I can totally understand why. This is the real democratization of DNA sequencing. Move over benchtop / desktop sequencers for 'laptop sequencers'!

Hmmm or a cluster of sequencers, on your compute cluster ... !

Using USB powered sequencers, and a pipette to put in the dsDNA and you might have your sequence read to FASTQ directly to your laptop. 
No known limit to read length. 
4% seq error (the good thing is that the form of error is known and therefore correctable)


Do read the url above for more info, here's the excerpted executive summary for the impatient

Executive Summary
  • Nanopore have announced a strand sequencing method, made possible by a heavily modified biological nanopore and an industrially-fabricated polymer
  • DNA passes through the nanopore and tri-nucleotides in contact with the pore are detected through electrochemistry
  • Demonstrated 2x50kb sense & anti-sense of same molecules (lambda phage) – no theoretical read length limit
  • Can sequence direct from blood without need for sample preparation
  • Two products announced:
    • MinIon – USB disposable sequencer for ~ $900 has 512 nanopores – target 150mb/hour
    • MinIon can run at 120-1000 bases/minute per pore for up to 6 hours
    • GridIon – two versions of rack-mountable sequencer with 2000 nanopores (2nd half 2012), 8000 nanopores (2013)
    • GridIons can be racked in parallel, 20 could do a whole human genome in 15 minutes
    • Each GridIon can do "tens of gigabases" over 24 hours
  • Both machines commercially available 2nd half 2012
  • Sequencing can be paused, sample recovered, replaced, started again
  • Accuracy is 96%, errors are deletions, error profile will improve through software


Check out Forbes interview with 454 / PGM inventor Jon Rothberg

"Rothberg noted that Ion Torrent’s new machine, the Proton, the company showed three completed human genomes yesterday at AGBT. More importantly, he had the machine – not a mock-up or a design – on the stage. “That’s where you need to be to ship mid-year,” he writes."


Over at Genomes Unzipped 
Oxford Nanopore CTO Clive Brown related how sequencing library prep is as simple as diluting rabbit's blood with water. Now that is impressive!




This post is getting too long because I keep updating it. 
Over at the BioITWorld, there's an interview with Clive Brown which cites other interesting info. 
First of which is the opening paragraph which is amusing in the light of ONT's rivals comments
"Clive Brown, vice president of development and informatics for Oxford Nanopore Technologies (ONT), a.k.a “the most honest guy in all of next-gen sequencing,” as dubbed by The Genome Center's David Dooling, is hoping to catch lightning in a bottle again. "


Oxford Nanopore has not yet revealed details of its future platform, but in early 2009, published a lovely paper in Nature Nanotechnology showing that its alpha-hemolysin nanopores can discriminate between the four bases of DNA (not to mention a fifth, methyl C)




Directly get methylation information from your sequencing sans complicated sample prep? That has to be another selling point. 


Not sure whether Nanopore is truly vaporware. However, gauging by the excitement over the blogosphere and the hit rates for the first to blog about it. I think Nanopore is upping the ante for the next IT sequencer. 
maybe we can only survive 2 more AGBT like this and AGBT might fizzle out as new sequencing technologies fade as our computation advances trails behind the ability to generate more data. 
Maybe you will see scientists start attending Big Data tech conferences or AGBT's  main draw will  fancy new software to assemble, align and make sense out of all the data being generated ... 




This picture tells quite a story (Wordle constructed from 3,386 tweets and retweets tagged #AGBT with @s removed).
No prizes for guessing the winner ... 

Friday, 17 February 2012

a tour of various bioinformatics functions in Avadis NGS

Not affliated with Avadis but this might be useful for you 




We are hosting an online seminar series on the alignment and analysis of genomics data from “benchtop” sequencers, i.e. MiSeq and Ion Torrent. Our webinar panelists will give a tour of various bioinformatics functions in Avadis NGS that will enable researchers and clinicians to derive biological insights from their benchtop sequencing data.

Seminar #1: MiSeq Data Analysis

Avadis NGS 1.3 provides special support for analyzing data generated by MiSeq™ sequencers. In this webinar, we will describe how the data in a MiSeq generated “run folder” is automatically loaded into the Avadis NGS software during small RNA alignment and DNA variant analysis. This is especially helpful in processing the large number of files generated when the TruSeq™ Amplicon Kits are used. We will describe how to use the Quality Control steps in Avadis NGS to check if the amplicons have sufficient coverage in all the samples. Regions with unexpected coverages can easily be identified using the new region list clustering feature. Webinar attendees will learn how to use the “Find Significant SNPs” feature to quickly identify high-confidence SNPs present in a majority of the samples, rare variants, etc.


Seminar #2: Ion Torrent Data Analysis

Avadis NGS 1.3 includes a new aligner – COBWeb – that is fully capable of aligning the long, variable-length reads generated by Ion Torrent sequencers. In this webinar, we will show the pre-alignment QC plots and illustrate how they can be used to set appropriate alignment parameters for aligning Ion Torrent reads. For users who choose to import the BAM format files generated by the Ion Torrent Server, we will describe the steps needed for importing amplicon sequencing data into Avadis NGS. Users of the Ion AmpliSeq™ Cancer Panel will learn how to easily import the targeted mutation list and verify the genotype call at the mutation sites. We will also show the new “Find Significant SNPs” feature which helps quickly identify high-confidence SNPs present in a majority of the samples, rare variants, etc.


Free registration - http://www.avadis-ngs.com/webinar

Datanami, Woe be me