Showing posts with label storage. Show all posts

Sunday, 6 May 2012

slides:2012: Trends from the Trenches

2012: Trends from the Trenches

by Chris Dagdigian on Apr 26, 2012

Talk slides as delivered at the 2012 Bio-IT World Conference in Boston, MA

2012: Trends from the Trenches

View more presentations from Chris Dagdigian

Wednesday, 22 February 2012

Amazon S3 for temporary storage of large datasets?

Just did a rough calculation on AWS calculator, the numbers are quite scary!

For a hypothetical 50 TB dataset (haven't found out the single S3 object max file size yet, seem to recall it's 1 Gbytes)
it costs $4160.27 to store it for a month!

to transfer it out it costs $4807.11!

For 3 years, the cost of storage is $149,000 which I guess you can pay for an enterprise storage solution and transfer costs are zero.

At this point in time, I guess one can't really use AWS S3 for sequence archival. I wonder if data deduplication can help reduce cloud storage costs ... I am sure in terms of bytes, BAM files should be quite similar .. no?

Monday, 6 February 2012

Public Poll: What data storage solutions are you using now?

BooRoo

Just a public poll created curious as to what systems are out there and what is being used day to day. Results are immediately viewable. Order of choices is randomized.

Which data storage platform are you using?

http://www.bluearc.com/ BlueArc is your Life Sciences Solution

http://www.isilon.com/ Isilon scale-out NAS, BIG data is no longer a challenge, it’s an opportunity.

http://www.panasas.com/ PANASAS SOLVES BIG DATA PROBLEMS

http://www.ddn.com/ DataDirect Networks’ (DDN) Big Data Storage Technology Powers More

http://aws.amazon.com/ Store data and build dependable backup solutions using AWS’s highly reliable, inexpensive data storage services.

Inhouse solutions using commodity hardware

http://www-03.ibm.com/systems/software/gpfs/ IBM GPFS

Other: (Please specify)

Create an online survey quiz or web poll

Friday, 8 October 2010

Re-Defining Storage for the Next Generation

Do have a go at this article citing David Dooling, assistant director of informatics at The Genome Center at Washington University and a few others
Looking ahead, as genome sequencing costs lower there's going to be more data than ever generated. And the article rightly states that every postdoc with a different analysis method will have a copy of a canonical dataset. Personally, I think this is a calling for tools and data to be moved to the cloud.
Getting the data up there in the first is a choke point.
but using the cloud will most definitely force everyone to only use a single copy of shared data.
Google solved the problem of tackling large datasets with slow interconnects with map reduce paradigms.
There are tools available that make use of this already but they are not popular yet. I still get weird stares when I tell them about hadoop filesystems. Sigh. More education is needed!

In summary, my take on the matter would be to have a local hadoop FS for storing data with redundancy for analysis. and move a copy of the data to your favourite cloud as archival and sharing and possibly data analysis as well (mapreduce avail on Amazon as well)
Another issue is whether researchers are keeping data based on sentimentality or if there's a real scientific need.
I have kept copies of my PAGE gel scans from my Bsc days archived in some place where the sun doesn't shine. but honestly, I can't forsee myself going back to the data. Part of the reason I kept it was that I spent a lot of time and effort to get those.

Storage, large datasets, and computational needs are not new problems for the world. They are new to biologists however. I am afraid that because of miscommunication, alot of researchers out there are going to rush to overspec their cluster and storage when the money can be better spent on sequencing. I am sure that would make some IT vendors very happy though especially in this financial downturn for IT companies.

I don't know if I am missing anything though.. comments welcome!

Friday, 30 April 2010

BioTeam Inc. - Slides from HPC Trends Talk

BioITWorld – Slides from HPC Trends Talk

love the Storage war stories from 2009-2010! It's sooo true!
excerpt from the slides

#1 - Unchecked Enterprise Architects
•  Scientist: “My work is priceless, I must be able to
    access it at all times”
•  Storage Guru: “Hmmm…you want H/A, huh?”!
•  System delivered:
    !  Small (< 50TB) Enterprise FC SAN
    !  Asynchronous replication to remote DR site
    !  Can’t scale, can’t do NFS easily
    !  ~$500K/year in support & operational costs

•  Lessons learned
•  Corporate storage architects may not fully
   understand the needs of HPC and research
   informatics users
•  End-users may not be precise with terms:
   !  “Extremely reliable” means “no data loss”, not
      99.999% uptime at a cost of millions
•  When true costs are explained:
   !  Many research users will trade a small amount of
      uptime or availability for more capacity or
      capabilities

Tuesday, 2 March 2010

Image files from NGS sequencers

Good points raised in this article about how keeping your image files can be very expensive!

Do you keep them?
Post your comments!

Kevin's GATTACA World