Wednesday, 5 September 2012

[pub] SEED: efficient clustering of next-generation sequences.


 2011 Sep 15;27(18):2502-9. Epub 2011 Aug 2.

SEED: efficient clustering of next-generation sequences.

Source

Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA.

Abstract

MOTIVATION:

Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.

RESULTS:

Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 10-fold="10-fold" 12-27="12-27" 2-="2-" 21-41="21-41" 60-85="60-85" a="a" able="able" addition="addition" also="also" and="and" area="area" as="as" asis="asis" assembler="assembler" assemblies="assemblies" assembly="assembly" best="best" better="better" by="by" cluster="cluster" clustering="clustering" clusters="clusters" compared="compared" contained="contained" contigs="contigs" data="data" datasets="datasets" demonstrate="demonstrate" discovering="discovering" efficiency="efficiency" fall="fall" for="for" from="from" generating="generating" genome="genome" h="h" in="in" indicated="indicated" into="into" it="it" its="its" larger="larger" linear="linear" longer="longer" memory="memory" most="most" n50="n50" ngs="ngs" non-preprocessed="non-preprocessed" of="of" on="on" organisms.="organisms." other="other" our="our" p="p" performance.="performance." performance="performance" preprocessing="preprocessing" reduce="reduce" requirements="requirements" respectively.="respectively." results="results" rna="rna" s="s" seed="seed" sequences="sequences" showed="showed" similar="similar" small="small" stand-alone="stand-alone" study="study" tests="tests" than="than" the="the" this="this" time="time" to="to" tool="tool" tools="tools" transcriptome="transcriptome" true="true" unsequenced="unsequenced" used="used" using="using" utilities="utilities" values.="values." velvet="velvet" was="was" when="when" while="while" with="with">

AVAILABILITY:

The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed.

CONTACT:

thomas.girke@ucr.edu

SUPPLEMENTARY INFORMATION:

Supplementary data are available at Bioinformatics online.
PMID:
 
21810899
 
[PubMed - indexed for MEDLINE] 
PMCID:
 
PMC3167058
 
Free PMC Article
Icon for HighWire Press Icon for PubMed Central

No comments:

Post a Comment

Datanami, Woe be me