Friday, 28 June 2013

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Each time when I change jobs, I will have to go through the adventure (and sometimes pain) to relearn about the computing resources available to me (personal), lab (small sharing pool), and the entire institute/company/school (Not enought to go around usually).
Depending on the job scope / number of cores / length of the job I would then setup the computing resources to run on either of the 3 resources available to me.
Sometimes, grant money appears magically and I am asked by my boss what do I need to buy (ok TBH this is rare). Hence it's always nice to keep a lookout on what's available on the market and who's using what to do what. So that one day when grant money magically appears, I won't be stumped for an answer.

excerpted from the provisional PDF are three points which I agree fully

Three GiB of RAM per core is not enough
you won't believe the number of things I tried to do to outsmart the 'system' just to squeeze enough ram for my jobs. Like looking for parallel queues which often have a bigger amount of RAM allocation. Doing tests for small jobs to make sure it runs ok before scaling it up and have it fail after two days due to insufficient RAM.
MPI is not widely used in NGS analysis
A lot of the queues in the university shared resource has ample resources for my jobs but were reserved for MPI jobs. Hence I can't touch those at all.
A central ﬁle system helps keep redundancy to a minimum
balancing RAM / compute cores to make the job splitting efficient was one thing. The other pain in the aXX was having to move files out of the compute node as soon as the job is done and clear all intermediate files. There were times where the job might have failed but as I deleted the intermediate files in the last step of the pipeline bash script, I wasn't able to be sure it ran to completion. In the end I had to rerun the job and keeping the intermediate files

anyway for more info you can check out the below

http://www.gigasciencejournal.com/content/2/1/9/abstract

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Samuel Lampa, Martin Dahlö, Pall I Olason, Jonas Hagberg and Ola Spjuth

For all author emails, please log on.

GigaScience 2013, 2:9 doi:10.1186/2047-217X-2-9

Published: 25 June 2013

Abstract (provisional)

Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made.

Kevin's GATTACA World

Friday, 28 June 2013