Kevin's GATTACA World: cloud computing

Showing posts with label cloud computing. Show all posts

Wednesday, 29 February 2012

Translational Genomics Research Institute, personalised genomics to improve chemotherapy, cloud computing for pediatric cancer

I think it's fantastic that this is happening right now. Given that the cost of sequencing and computing is still relatively high, I can see how the first wave of personalized medicine will be lead by non-profit organizations. I am personally curious how this might pan out and would this be cost-effective for the patients ultimately? Would they be able to quantify it?

Kudos for Dell for being a part of this exercise, though I wondered if they could have donated more to the data center or alternatively setup a mega cloud center and donate compute resources instead. Since i think the infrastructure and knowledge gleaned will be useful for their marketing and sales.

http://www.hpcinthecloud.com/hpccloud/2012-02-29/cloud_computing_helps_fight_pediatric_cancer.html

Cloud technology is being used to speed computation, as well as manage and store the resulting data. Cloud also enables the high degree of collaboration that is necessary for science research at this level. The scientists have video-conferences where they work off of "tumor boards" to make clinical decisions for the patients in real-time. Before they'd have to ship hard drives to each other to have that degree of collaboration and now the data is always accessible through the cloud platform.

"We expect to change the way that the clinical medicine is delivered to pediatric cancer patients, and none of this could be done without the cloud," Coffin says emphatically. "With 12 cancer centers collaborating, you have to have the cloud to exchange the data."

Dell relied on donations to build the initial 8.2 teraflop high-performance machine. A second round of donations has meant a doubling in resources for this important work, up to an estimated 13 teraflops of sustained performance.

"Expanding on the size of the footprint means we can treat more and more patients in the clinic trial so this is an exciting time for us. This is the first pediatric clinic trial using genomic data ever done. And Dell is at the leading edge driving this work from an HPC standpoint and from a science standpoint."

The donated platform is comprised of Dell PowerEdge Blade Servers, PowerVault Storage Arrays, Dell Compellent Storage Center arrays and Dell Force10 Network infrastructure. It features 148 CPUs, 1,192 cores, 7.1 TB of RAM, and 265 TB Disk (Data Storage). Dell Precision Workstations are available for data analysis and review. TGen's computation and collaboration capacity has increased by 1,200 percent compared to the site's previous clinical cluster. In addition, the new system has reduced tumor mapping and analysis time from a matter of months to days.

Thursday, 25 August 2011

Samsung Genome SDS: Not content with just creating your mobile phone

http://www.samsunggenome.com

Samsung has been in the news recently for being in a patent war with Apple. Now they are creating waves of a different kind.

They have just launched a web based analysis for Illumina and Solid data.

For RNA-seq, Solid is analysed using Bioscope and Tophat/Cufflinks is used for Illumina.

One thing they have over the edge over competitors is their ability to do web transfer of data at 10x the speed of FTP

They have some preliminary benchmark data here

SOLiD 5500 and 454 data is notably missing from the platforms supported. Nevertheless still worth a browse I guess.

Wednesday, 10 August 2011

Pauline Ng expects a genome analysis to cost $500.

Pauline Ng is planning open source, open access analytics for the genomes to come.

By Allison Proffitt

August 2, 2011 | SINGAPORE—Pauline Ng’s office is the Genome building of the Biopolis science park in Singapore, a fitting home for one of the authors of the first published personal genome, that of J. Craig Venter, published in 2007 while Ng was a senior scientist at the J. Craig Venter Institute.

Now Ng leads an expanding group of three bioinformaticists (she’s hiring!) at the Genome Institute of Singapore (GIS). Before her stint at the Venter Institute, Ng worked for Illumina as well as the Fred Hutchinson Cancer Center in Seattle, where she wrote the powerful SIFT algorithm (http://sift-dna.org), a widely used tool to predict the effect of a given amino acid substitution on protein function.

But sequencing and analysis—today at least—cost the same. “The problem is that right now, companies like Knome are actually charging the same amount for bioinformatics as they are for sequencing. If you sequence more individuals, I’d expect the bioinformatics to go down, but it’s the same price. That means the price is double! If we can make these tools online, accessible for free or at least at cost, I think I can get it to a tenth of the cost.”

Ng plans to do the computation on the Amazon Cloud and, at today’s rates, expects a genome analysis to cost $500. She hopes that these price points will enable doctors and individuals to use genomics. “If we could say, OK, outsource [the sequencing] to these companies. You’re going to get a hard disk. Mail it to Amazon and get your results in a week.”
Ng is not promising a magic cure, and doesn’t even think that this model should be the only one. She just hopes to drive prices down and open the market. “There’s never a guarantee of an answer,” she says. “Even with the software we write, there may not be a guarantee of an answer, but at least…” she pauses and begins again, emphatically. “We can definitely give you the basic annotation and provide the tools that everyone uses. And if it doesn’t work, then you go to an expensive company that really uses the same tools as the academics but with a couple of more bells and whistles. If you try our stuff first, at least you’ve invested only $500 instead of $5,000.”

Full article here

Friday, 15 July 2011

Is a big ass server a want or a need?

"Big-Ass Servers™ and the myths of clusters in bioinformatics"

a topic title like that has to catch your attention ...

I think that

it is useful to have loads of ram and loads of cores for one person's use. But when it is shared (on the university's HPC), you have a hard time juggling resources in a fair manner especially in Bioinformatics where walltimes and ram requirements are known post analysis. A HPC engineer once told me that HPC for biologist means selfish hogging of resources. I can only shrug and concede at her comment.

I don't know if there's a better way to do the things I do with more RAM and faster disks, but I do know that it will probably cost more in development time.

That said Cloud computing is having trouble keeping up with I/O bound stuff like bioinformatics, and smaller cloud computing services are all trying to show that they have faster interconnects, but you can't really beat a BAS that's on a local network.

Thursday, 7 July 2011

BGI Announces Cloud Genome Assembly Service

I am very excited about cloud solutions for de novo assembly as they are quite computational intensive and with parameters tweaking, you have a massive parallelization problem that just begs for computer cores. I do wonder if there's a need for a cloud solution for resequencing pipelines, especially when it involves BWA which can be run rather efficiently on desktop or in house clusters. Only whole genome reseq might require more compute hours, but I would think that any center that does WGS regularly would at least have a genome reseq capable cluster at the very minimum to just store the data before it is analyzed.

Anyway let's see if BGI will change the computational cloud scene ...

By Allison Proffitt
July 6, 2011 | SHENZHEN, CHINA—At the BGI Bioinformatics Software Release Conference today, researchers announced two new Cloud-based software-as-a-service offerings for next-gen data analysis. Hecate and Gaea (named for Greek gods) are “flexible computing” solutions for do novo assembly and genome resequencing.
These are “cloud-based services for genetic researchers” so that researchers don’t need to “purchase your own cloud clusters,” said Evan Xiang, part of the flexible computing group at BGI Shenzhen. Hecate will do de novo assembly, and Gaea will run the SOAP2, BWA, Samtools, DIndel, and BGI’s realSFS algorithms. Xiang expects an updated version of Gaea to be released later this year with more algorithms available. .......full article

Thursday, 24 February 2011

Maybe we have to sequence everybody! Every fish! BGI Cloud

Bio-IT world ran an interesting article with this quote

“The data are growing so fast, the biologists have no idea how to handle this data,” says Li. “I think the Cloud will be the solution. We have to sequence more and more data. Maybe we have to sequence everybody! Every fish! The data keep growing and we need a lot of compute power to process.”
For Chen, there are three priorities for BGI Cloud:

Connectivity: With partners across China and the world, “we’ve connected all the people and resources—the sequencers, the samples, the ideas, the compute power, and the storage together to make a greater contribution.”
Scalability: Calling the explosion in next-gen sequencing (NGS) a “data tsunami,” Chen says BGI aims to provide the parallel computing resources to help users manage and process these datasets. “If you can’t do the analysis, it’s pointless. We use distributed computing technology in the bioinformatics area. We’re confident we can solve the scalability problem.”
Reproducibility: Chen says bioinformatics researchers are happy to show their data and their pet program—SOAP, BWA, and so on. “That’s fine. But analysis is very complicated. The methodology he is actually using is a homemade pipeline. It’s very difficult to reproduce that result. We built this platform not only to solve the capability and connectivity of computing, we want to resolve the problems in reproducing designs and procedures.”

With new NGS gene assembly and SNP calling programs such as Hecate and Gaea about to be released (see, “In the Name of Gods”), Li says it was essential to develop a “run-time environment, a Web-based platform for Cloud storage and reference data, with a feature-rich GUI, and effective bioinformatics analysis software.”

Kevin: It would be interesting to see how Amazon and other cloud providers together with Galaxy (usegalaxy.org) will take to BGI's offering to produce reproducible data analysis. (commercial software providers aside). Also their offering comes at a strange time when NCBI is discontinuing SRA. Might BGI cloud fill up the void where SRA left?

Everyone is trying to come up with a 'standard' workflow that everyone will adopt but I feel that the ecology of bioinformatics is that there's always another 'better' way to tweak that analysis. Custom analysis is a pet phrase of a lot of bench biologists.

Every bioinformatician will know and remember their treasure trove of throw away scripts that worked beautifully but only once for that set of data.

Wednesday, 18 August 2010

Playing with NFS & GlusterFS on Amazon cc1.4xlarge EC2 instance types

I wished I had time to do stuff like what they do at bioteam.
Benchmarking the Amazon cc1.4xlarge EC2 instance.

These are the questions they aimed to answer

We are asking very broad questions and testing assumptions along the lines of:

Does the hot new 10 Gigabit non-blocking networking fabric backing up the new instance types really mean that “legacy” compute farm and HPC cluster architectures which make heavy use of network filesharing possible?
How does filesharing between nodes look and feel on the new network and instance types?
Are the speedy ephemeral disks on the new instance types suitable for bundling into NFS shares or aggregating into parallel or clustered distribtued filesystems?
Can we use the replication features in GlusterFS to mitigate some of the risks of using ephemeral disk for storage?
Should the shared storage built from ephermeral disk be assigned to “/scratch” or other non-critical duties due to the risks involved? What can we do to mitigate the risks?
At what scale is NFS the easiest and most suitable sharing option? What are the best NFS server and client tuning parameters to use?
When using parallel or cluster filesystems like GlusterFS, what rough metrics can we use to figure out how many data servers to dedicate to a particular cluster size or workflow profile?

Friday, 14 May 2010

Lincoln Stein makes his case for moving genome informatics to the Cloud

Matthew Dublin summarizes Lincoln's paper in Making the Case for Cloud Computing & Genomics in genomeweb

excerpt "....
Stein walks the reader through an nice explanation of what exactly cloud computing is, the benefits of using a compute solution that grows and shrinks as needed, and makes an attempt at tackling the question of the cloud's economic viability when compared to purchasing and managing local compute resources.
The take away is that Moore's Law and its effect on sequencing technology will soon force researchers to analyze their mountains of sequencing data in a paradigm where the software comes to the data rather than the current, and opposite, approach. Stein says that this means now more than ever, cloud computing is a viable and attractive option..... "

Yet to read it (my weekend bedtime story) will post comments here.

Tuesday, 13 April 2010

ABI's pipeline Bioscope possible cloud service?

Hmmm saw an obscure reference to possible cloud hosting for ABI's Bioscope software from a poster on the main page.

BioScope™ 1.2: An Applications Framework for SOLiD™ Sequence Data Analysis. (PDF, 172 KB) Suri, P., et al. (AGBT 2010)

"...Similar performance has been observed for BioScope™software deployed on cloud (SOLiDBioScope.com)."

But googling for the link yielded nothing..

Monday, 1 February 2010

Cloudburst Contrail using cloud computing to speed up your NGS data analysis

Been having problems of all sorts trying to do de novo assembly of transcriptome data on my cluster. It might be possible that not enough RAM is the problem.. apparently at BGI they have 512 GB ram beasts.

I think it might be worthwhile to explore computing algorithm changes rather than hardware upgrades.
after all there comes a point when the cost far exceeds the "worthiness" of an experiment.

Contrail: Assembly of Large Genomes using Cloud Computing

[excerpt .... Preliminary results show Contrail’s contigs are of similar size and quality to those generated by Velvet when applied to small (bacterial) genomes, but provides vastly superior scaling capabilities when applied to large genomes....]

CloudBurst: Highly Sensitive Short Read Mapping with MapReduce
[excerpt ...CloudBurst's running time scales linearly with the number of reads mapped, and with near linear speedup as the number of processors increases. In a 24-processor core configuration, CloudBurst is up to 30 times faster than RMAP executing on a single core...]

Wednesday, 16 December 2009

48 cores in a single Chip??

This is not a dream. I WOULD like to get my hands on one of these!
abstract:
The SCC has 48 cores hooked together in network that mimics cloud computing on a chip level, and support highly parallel "scale-out" programming models. Intel Labs expects to build 100 or more experimental chips for use by dozens of industrial and academic research collaborators around the world with the goal of developing new software applications and programming models for future many-core processors.
For more information, see Exploring programming models with the Single-chip Cloud Computer research prototype.

To Intel: Would you like my mailing address?

Kevin's GATTACA World