Showing posts with label cluster. Show all posts

Wednesday, 5 September 2012

[pub] SEED: efficient clustering of next-generation sequences.

Bioinformatics. 2011 Sep 15;27(18):2502-9. Epub 2011 Aug 2.

SEED: efficient clustering of next-generation sequences.

Bao E, Jiang T, Kaloshian I, Girke T.

Source

Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA.

Abstract

MOTIVATION:

Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.

RESULTS:

Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 10-fold="10-fold" 12-27="12-27" 2-="2-" 21-41="21-41" 60-85="60-85" a="a" able="able" addition="addition" also="also" and="and" area="area" as="as" asis="asis" assembler="assembler" assemblies="assemblies" assembly="assembly" best="best" better="better" by="by" cluster="cluster" clustering="clustering" clusters="clusters" compared="compared" contained="contained" contigs="contigs" data="data" datasets="datasets" demonstrate="demonstrate" discovering="discovering" efficiency="efficiency" fall="fall" for="for" from="from" generating="generating" genome="genome" h="h" in="in" indicated="indicated" into="into" it="it" its="its" larger="larger" linear="linear" longer="longer" memory="memory" most="most" n50="n50" ngs="ngs" non-preprocessed="non-preprocessed" of="of" on="on" organisms.="organisms." other="other" our="our" p="p" performance.="performance." performance="performance" preprocessing="preprocessing" reduce="reduce" requirements="requirements" respectively.="respectively." results="results" rna="rna" s="s" seed="seed" sequences="sequences" showed="showed" similar="similar" small="small" stand-alone="stand-alone" study="study" tests="tests" than="than" the="the" this="this" time="time" to="to" tool="tool" tools="tools" transcriptome="transcriptome" true="true" unsequenced="unsequenced" used="used" using="using" utilities="utilities" values.="values." velvet="velvet" was="was" when="when" while="while" with="with">

AVAILABILITY:

The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed.

CONTACT:

thomas.girke@ucr.edu

SUPPLEMENTARY INFORMATION:

Supplementary data are available at Bioinformatics online.

PMID:: 21810899; [PubMed - indexed for MEDLINE]
PMCID:: PMC3167058

Free PMC Article

Friday, 15 July 2011

Is a big ass server a want or a need?

"Big-Ass Servers™ and the myths of clusters in bioinformatics"

a topic title like that has to catch your attention ...

I think that

it is useful to have loads of ram and loads of cores for one person's use. But when it is shared (on the university's HPC), you have a hard time juggling resources in a fair manner especially in Bioinformatics where walltimes and ram requirements are known post analysis. A HPC engineer once told me that HPC for biologist means selfish hogging of resources. I can only shrug and concede at her comment.

I don't know if there's a better way to do the things I do with more RAM and faster disks, but I do know that it will probably cost more in development time.

That said Cloud computing is having trouble keeping up with I/O bound stuff like bioinformatics, and smaller cloud computing services are all trying to show that they have faster interconnects, but you can't really beat a BAS that's on a local network.

Performance of Ray @ Assemblathon 2

Ray is one of the assemblers that I watch closely but sadly lack the time to experiment with. Here's the candid email from Sebastien to the Ray mailing list on Ray's performance on the test data

For those who follow Assemblathon 2, my last run on my testbed (Illumina data from BGI and from Illumina UK for the Bird/Parrot):

(all mate-pairs failed detection because of the many peaks in each library, I will modify Ray to consider that)

Total number of unfiltered Illumina TruSeq v3 sequences: Total: 3 072 136 294, that is ~3 G sequences !

512 compute cores (64 computers * 8 cores/computer = 512)

Typical communication profile for one compute core:

[1,0]:Rank 0: sent 249841326 messages, received 249840303 messages.

Yes, each core sends an average of 250 M messages during the 18 hours !

Peak memory usage per core: 2.2 GiB

Peak memory usage (distributed in a peer-to-peer fashion): 1100 GiB

The peak occurs around 3 hours and goes down to 1.1 GiB per node immediately because the pool of defragmentation groups for k-mers occuring once is freed.

The compute cluster I use has 3 GiB per compute core. So using 2048 compute cores would give me 6144 GiB of distributed memory.

Number of contigs: 550764
Total length of contigs: 1672750795
Number of contigs >= 500 nt: 501312
Total length of contigs >= 500 nt: 1656776315
Number of scaffolds: 510607
Total length of scaffolds: 1681345451
Number of scaffolds >= 500 nt: 463741
Total length of scaffolds >= 500: 1666464367

k-mer length: 31
Lowest coverage observed: 1
MinimumCoverage: 42
PeakCoverage: 171
RepeatCoverage: 300
Number of k-mers with at least MinimumCoverage: 2453479388 k-mers
Estimated genome length: 1226739694 nucleotides
Percentage of vertices with coverage 1: 83.7771 %
DistributionFile: parrot-Testbed-A2-k31-

20110712.CoverageDistribution.txt

[1,0]: Sequence partitioning: 1 hours, 54 minutes, 47 seconds
[1,0]: K-mer counting: 5 hours, 47 minutes, 20 seconds
[1,0]: Coverage distribution analysis: 30 seconds
[1,0]: Graph construction: 2 hours, 52 minutes, 27 seconds
[1,0]: Edge purge: 57 minutes, 55 seconds
[1,0]: Selection of optimal read markers: 1 hours, 38 minutes, 13 seconds
[1,0]: Detection of assembly seeds: 16 minutes, 7 seconds
[1,0]: Estimation of outer distances for paired reads: 6 minutes, 26 seconds
[1,0]: Bidirectional extension of seeds: 3 hours, 18 minutes, 6 seconds
[1,0]: Merging of redundant contigs: 15 minutes, 45 seconds
[1,0]: Generation of contigs: 1 minutes, 41 seconds
[1,0]: Scaffolding of contigs: 54 minutes, 3 seconds
[1,0]: Total: 18 hours, 3 minutes, 50 seconds

10 largest scaffolds:

257646
266905
268737
272828
281502
294105
294106
296978
333171
397201

# average latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
# Message passing interface rank Name Latency in microseconds
0 r107-n24 138
1 r107-n24 140
2 r107-n24 140
3 r107-n24 140
4 r107-n24 141
5 r107-n24 141
6 r107-n24 140
7 r107-n24 140
8 r107-n25 140
9 r107-n25 139
10 r107-n25 138
11 r107-n25 139
Sébastien

Wednesday, 16 December 2009

Petascale Tools and Genomic Evolution

Abstract from post:
Technological advances in high-throughput DNA sequencing have opened up the possibility of determining how living things are related by analyzing the ways in which their genes have been rearranged on chromosomes. However, inferring such evolutionary relationships from rearrangement events is computationally intensive on even the most advanced computing systems available today.

Research recently funded by the American Recovery and Reinvestment Act of 2009 aims to develop computational tools that will utilize next-generation petascale computers to understand genomic evolution. The four-year $1 million project, supported by the National Science Foundation's PetaApps program, was awarded to a team of universities that includes the Georgia Institute of Technology, the University of South Carolina, and The Pennsylvania State University.

Author: Hmmm Petascale Tools that's a new term for me! But I dun really get the details of what the author plan to do. AFAIK, Computational problems are always based on the present.
So if the biggest baddest computer that you have isn't enough for you, you basically have 2 options
a) Buy more/new computers
b) improve your algorithm

So are they are developing tools that will be used on Petascale computers which doesn't exist yet? Or are they developing algorithm for tools that will need petascale computers but can run on present computing powers?

Ahhh the vagaries of grant application

Tuesday, 1 December 2009

New Job new distro

Have started in a new job!
but basically am doing Next Generation Sequencing Bioinformatics.
Sounds like a mouthful but hope it goes well.

1st week was spent on sourcing a cheap cluster for analysis. but 'cheap cluster' is an oxymoron!

Playing around with CentOS now. So far, its less than enjoyable compared to Ubuntu. Especially the 7 CDs or single DVD downloading.
I can't understand why making people download so many RPMs would be a good thing for bandwidth or convenience.

I miss my Ubuntu box. but setting up a HPC cluster using Ubuntu might be tricky without tech support.

Any advice for those familiar with ABI Solid's offline cluster setup?

Kevin's GATTACA World