Do have a go at t
his article citing David Dooling, assistant director of informatics at The Genome Center at Washington University and a few others
Looking ahead, as genome sequencing costs lower there's going to be more data than ever generated. And the article rightly states that every postdoc with a different analysis method will have a copy of a canonical dataset. Personally, I think this is a calling for tools and data to be moved to the cloud.
Getting the data up there in the first is a choke point.
but using the cloud will most definitely force everyone to only use a single copy of shared data.
Google solved the problem of tackling large datasets with slow interconnects with map reduce paradigms.
There are tools available that make use of this already but they are not popular yet. I still get weird stares when I tell them about hadoop filesystems. Sigh. More education is needed!
In summary, my take on the matter would be to have a local hadoop FS for storing data with redundancy for analysis. and move a copy of the data to your favourite cloud as archival and sharing and possibly data analysis as well (mapreduce avail on Amazon as well)
Another issue is whether researchers are keeping data based on sentimentality or if there's a real scientific need.
I have kept copies of my PAGE gel scans from my Bsc days archived in some place where the sun doesn't shine. but honestly, I can't forsee myself going back to the data. Part of the reason I kept it was that I spent a lot of time and effort to get those.
Storage, large datasets, and computational needs are not new problems for the world. They are new to biologists however. I am afraid that because of miscommunication, alot of researchers out there are going to rush to overspec their cluster and storage when the money can be better spent on sequencing. I am sure that would make some IT vendors very happy though especially in this financial downturn for IT companies.
I don't know if I am missing anything though.. comments welcome!