Sunday 30 May 2010

The State of computational genomics

It's good to know that things are moving ahead at Sean Eddy's lab.
this blog post makes some very good recommendation .. of which I wonder how much is helped by *chuckle the fact that he is "paid by a dead billionaire, not so much by federal tax dollars."

Just kidding! His recommendations are made from his vast experience for sure. So read on!

My fav parts


Plan explicitly for sustainable exponential growth. We keep using metaphors of data “tsunamis” or “explosions”, but these metaphors are misleading. Big data in biology is not an unexpected disastrous event that we have to clean up after. The volume of data will continue to increase exponentially for the foreseeable future. We must make sober plans for sustainable exponential growth (this is not an oxymoron).

Datasets should be “tiered” into different appropriate representations of different volumes. Currently we tend to default to archiving raw data – which not only maximizes storage/communication challenges, but also impedes downstream analyses that require processed datasets (genome assemblies, sequence alignments, or short reads mapped to a reference genome, rather than raw reads).


Data structures for histograms of short reads mapped to a reference genome coordinate system for a particular ChIP-seq or RNA-seq experiment. Many analyses of ChIP-seq and RNA-seq data don’t need the actual reads, only the histogram.

Reduce waste on subscaled computing infrastructure. NIH-funded investigators with large computing needs are typically struggling. Many are building inefficient computing clusters in their individual labs. This is the wrong scale of computing, and it is wasting NIH money. Several competing forces are at work:


Clusters have short lives: typically about three years. They are better accounted for as recurring consumables, not a one-time capital equipment expense. There are many stories of institutions being surprised when an expensive cluster needs to be thrown away, and the same large quantity of money spent to buy a new one.


in actual experience that computational biologists simply do not use remote computing facilities, preferring to use local ones. HPC experts in other fields tend to assume that this reflects a lack of education in HPC, but many computational genomicists have extensive experience in HPC. I assert that it is in fact an inherent structural issue, that computing in biology simply has a different workflow in how it manipulates datasets. For example, a strikingly successful design decision in our HHMI Janelia Farm HPC resource was to make the HPC filesystem the same as the desktop filesystem, minimizing data transfer operations internally.

NIH funding mechanisms are good at funding individual investigators, or at one-time capital equipment expenses, or at charging services (there are line items on grant application forms that seem to have originated in the days of mainframe computing!), or at large (regional or national) technology resources. Institutions have generally struggled to find appropriate ways to fund equipment, facilities, and personnel for mid-scale technology cores at the department or institute level, and computing clusters are probably the most dysfunctional example.

Cloud computing deserves development, but is not yet a substitute for local infrastructure. Cloud computing will have an increasing impact. It offers the prospect of offloading the difficult infrastructural critical mass issues of clustered computing to very large facilities — including commercial clouds such as Amazon EC2 and Microsoft Azure, but also academic clouds, perhaps even clouds custom-built for genomics applications.

Make better integrated informatics and analysis plans in NHGRI big science projects. NHGRI planning of big science projects is generally a top-down, committee-driven process that excels at the big picture goal but is less well-suited for arriving at a fully detailed and internally consistent experimental design. This is becoming a weakness now that NHGRI is moving beyond data sets of simple structure and enduring utility (such as the Human Genome Project), and into large science projects that ask more focused, timely questions. Without more detailed planning up front, informatics and analysis is reactive and defensive, rather than proactive and efficient. The result tends to be a default into “store everything, who knows what we’ll need to do!”, which of course exacerbates all our data problems. Three suggestions:
  1. A large project should have a single “lead project scientist” who takes responsibility for overall direction and planning. This person takes input from the committee-driven planning process, but is responsible for synthesizing a well-defined plan.
  2. That plan should be written down, at a level of detail comparable to an R01 application. Forcing a consensus written plan will enable better cost/benefit analysis and more advance planning for informatics and analysis.
  3. That plan should be peer reviewed by outside experts, including informatics and analysis experts, before NHGRI approves funding.
 

No comments:

Post a Comment

Datanami, Woe be me