Showing posts with label journal. Show all posts
Showing posts with label journal. Show all posts

Tuesday, 28 September 2021

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers | Nature Methods

 Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers | Nature Methods 

 


https://github.com/GoekeLab/bioinformatics-workflows    


Friday, 28 June 2013

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Each time when I change jobs, I will have to go through the adventure (and sometimes pain) to relearn about the computing resources available to me (personal), lab (small sharing pool), and the entire institute/company/school (Not enought to go around usually).
Depending on the job scope / number of cores / length of the job I would then setup the computing resources to run on either of the 3 resources available to me.
Sometimes, grant money appears magically and I am asked by my boss what do I need to buy (ok TBH  this is rare). Hence it's always nice to keep a lookout on what's available on the market and who's using what to do what. So that one day when grant money magically appears, I won't be stumped for an answer.

excerpted from the provisional PDF are three points which I agree fully

Three GiB of RAM per core is not enough
you won't believe the number of things I tried to do to outsmart the 'system' just to squeeze enough ram for my jobs. Like looking for parallel queues which often have a bigger amount of RAM allocation. Doing tests for small jobs to make sure it runs ok before scaling it up and have it fail after two days due to insufficient RAM.
MPI is not widely used in NGS analysis
A lot of the queues in the university shared resource has ample resources for my jobs but were reserved for MPI jobs. Hence I can't touch those at all.
A central file system helps keep redundancy to a minimum
balancing RAM / compute cores to make the job splitting efficient was one thing. The other pain in the aXX was having to move files out of the compute node as soon as the job is done and clear all intermediate files. There were times where the job might have failed but as I deleted the intermediate files in the last step of the pipeline bash script, I wasn't able to be sure it ran to completion. In the end I had to rerun the job and keeping the intermediate files


anyway for more info you can check out the below

http://www.gigasciencejournal.com/content/2/1/9/abstract

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Samuel LampaMartin DahlöPall I OlasonJonas Hagberg and Ola Spjuth
For all author emails, please log on.
GigaScience 2013, 2:9 doi:10.1186/2047-217X-2-9
Published: 25 June 2013

Abstract (provisional)

Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made.

The complete article is available as a provisional PDF. The fully formatted PDF and HTML versions are in production.

Monday, 10 September 2012

[pub]: Genome-Wide Association Analysis of Imputed Rare Variants: Application to Seven Common Complex Diseases


http://onlinelibrary.wiley.com/doi/10.1002/gepi.21675/abstract;jsessionid=0E67E391238867DA8CC7EDD1FAABCE88.d03t01

 2012 Sep 5. doi: 10.1002/gepi.21675. [Epub ahead of print]

Genome-Wide Association Analysis of Imputed Rare Variants: Application to Seven Common Complex Diseases.

Source

Estonian Genome Centre, University of Tartu, Tartu, Estonia.

Abstract

Genome-wide association studies have been successful in identifying loci contributing effects to a range of complex human traits. The majority of reproducible associations within these loci are with common variants, each of modest effect, which together explain only a small proportion of heritability. It has been suggested that much of the unexplained genetic component of complex traits can thus be attributed to rare variation. However, genome-wide association study genotyping chips have been designed primarily to capture common variation, and thus are underpowered to detect the effects of rare variants. Nevertheless, we demonstrate here, by simulation, that imputation from an existing scaffold of genome-wide genotype data up to high-density reference panels has the potential to identify rare variant associations with complex traits, without the need for costly re-sequencing experiments. By application of this approach to genome-wide association studies of seven common complex diseases, imputed up to publicly available reference panels, we identify genome-wide significant evidence of rare variant association in PRDM10 with coronary artery disease and multiple genes in the major histocompatibility complex (MHC) with type 1 diabetes. The results of our analyses highlight that genome-wide association studies have the potential to offer an exciting opportunity for gene discovery through association with rare variants, conceivably leading to substantial advancements in our understanding of the genetic architecture underlying complex human traits.
© 2012 Wiley Periodicals, Inc.

Tuesday, 4 September 2012

PLoS ONE: Adaptive Ridge Regression for Rare Variant Detection

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0044173

Abstract Top

It is widely believed that both common and rare variants contribute to the risks of common diseases or complex traits and the cumulative effects of multiple rare variants can explain a significant proportion of trait variances. Advances in high-throughput DNA sequencing technologies allow us to genotype rare causal variants and investigate the effects of such rare variants on complex traits. We developed an adaptive ridge regression method to analyze the collective effects of multiple variants in the same gene or the same functional unit. Our model focuses on continuous trait and incorporates covariate factors to remove potential confounding effects. The proposed method estimates and tests multiple rare variants collectively but does not depend on the assumption of same direction of each rare variant effect. Compared with the Bayesian hierarchical generalized linear model approach, the state-of-the-art method of rare variant detection, the proposed new method is easy to implement, yet it has higher statistical power. Application of the new method is demonstrated using the well-known data from the Dallas Heart Study.

Thursday, 2 August 2012

Geneticists eye the potential of arXiv

Population biologists turn to pre-publication server to gain wider readership and rapid review of results.
excerpted 

The preprint server arXiv.org is perhaps best known as the preserve of theoretical physicists and astrophysicists. But 2008 saw an influx of submissions of unpublished manuscripts, or preprints, by condensed-matter physicists who wanted to stake claims to the fast-moving subject of iron-based superconductors called pnictides. Now the life sciences may be on the cusp of their own ‘pnictide moment’, with population geneticists leading the charge.
In the past month, leading research groups have posted to arXiv high-profile papers on the genetic history of southern Africans1 and Europeans2. Other prominent population geneticists have submitted methods-based papers to the server, which is hosted by Cornell University in Ithaca, New York. The number of biology papers on the server is still small in comparison with physical-sciences preprints (see ‘Biology opens up’), but Paul Ginsparg, a theoretical physicist at Cornell who founded arXiv in 1991 (ref. 3), welcomes what he hopes could be a sea change.
“It’s wonderful if biologists are belatedly joining the late twentieth century,” he quips. “Welcome to the party; better late than never.”
.
.
.
.
Another attention-grabbing submission by prominent geneticists, posted on 23 July, compares genomic variation in 22 African populations to suggest an ancient genetic link between people in southern and eastern Africa1. One of the paper’s senior authors, geneticist David Reich of Harvard Medical School in Boston, Massachusetts, publishes routinely in Nature and the Public Library of Science journals, and co-author Carlos Bustamante, of Stanford University School of Medicine in California, is a leader in the field. Reich says that first author Joseph Pickrell, also at Harvard Medical School, suggested using arXiv. Reich and the other co-authors saw no good reason not to post the manuscript there. “It could be an example of the younger generation coming in and finding this sort of thing natural,” says Ginsparg....

http://www.nature.com/news/geneticists-eye-the-potential-of-arxiv-1.11091

Tuesday, 29 May 2012

How Not To Be A Bioinformatician Source Code for Biology and Medicine 2012, 7:3 doi:10.1186/1751-0473-7-3

How Not To Be A Bioinformatician
Source Code for Biology and Medicine 2012, 7:3 doi:10.1186/1751-0473-7-3

abstract 
Although published material exists about the skills required for a successful bioinformatics career, strangely enough no work to date has addressed the matter of how to excel at not being a bioinformatician. A set of basic guidelines and a code of conduct is hereby presented to re-address that imbalance for fellow-practitioners whose aim is to not to succeed in their chosen bioinformatics field. By scrupulously following these guidelines one can be sure to regress at a highly satisfactory rate.

http://www.scfbm.org/content/pdf/1751-0473-7-3.pdf


LMAO

"Be unreachable and isolated. Configure your contact email to either bounce back or
permanently set it to vacation. Miss key meetings or seminars where other colleagues may be presenting their seminal results and never, ever make any attempt at remembering their names or where they work. Reinvent the wheel. Do not keep up with the literature on current methods of research if you possibly can. "


was this even neccessary to be in the paper?

Tuesday, 15 May 2012

NATURE BIOTECHNOLOGY | Performance comparison of benchtop high-throughput sequencing platforms


Performance comparison of benchtop high-throughput sequencing platforms

Nature Biotechnology
 
30,
 
434–439
 
(2012)
 
doi:10.1038/nbt.2198
Received
 
Accepted
 
Published online
 
Corrected online
 

Abstract

Three benchtop high-throughput sequencing instruments are now available. The 454 GS Junior (Roche), MiSeq (Illumina) and Ion Torrent PGM (Life Technologies) are laser-printer sized and offer modest set-up and running costs. Each instrument can generate data required for a draft bacterial genome sequence in days, making them attractive for identifying and characterizing pathogens in the clinical setting. We compared the performance of these instruments by sequencing an isolate of Escherichia coli O104:H4, which caused an outbreak of food poisoning in Germany in 2011. The MiSeq had the highest throughput per run (1.6 Gb/run, 60 Mb/h) and lowest error rates. The 454 GS Junior generated the longest reads (up to 600 bases) and most contiguous assemblies but had the lowest throughput (70 Mb/run, 9 Mb/h). Run in 100-bp mode, the Ion Torrent PGM had the highest throughput (80–100 Mb/h). Unlike the MiSeq, the Ion Torrent PGM and 454 GS Junior both produced homopolymer-associated indel errors (1.5 and 0.38 errors per 100 bases, respectively).

Figures at a glance

Thursday, 16 February 2012

How Identical are Identical Twins? | Read Through Transcription

How Identical are Identical Twins?
December 19, 2011 by Ramesh Hariharan
We're looking at exome sequencing data on whole peripheral blood DNA of monozygotic twins (this data was generated by our collaborators, Jan Dumanski and his group at Uppsala University in Sweden). Monozygotic twins were earlier thought to be genetically identical; now we know that isn't completely true. How does one identify small mutations (SNPs and small InDels)  that are present in one of the twins but not in the other? Or in general, how does one compare two different samples, for instance, to find somatic mutations that are present in a tumor sample but not present in the paired normal sample.
...
The 1000 Genomes Project estimated that a child has only around 50 new mutations relative to its parents. Monozygotic twins ought to be closer than that. And we are observing only the exomes (and some neighborhood) of these twins. So the real answer probably lies close to the bottom of the above table. However, as Jan Dumanski points out, much of the 1000 Genomes effort involved sequencing of oligoclonal/monoclonal lymphoblastoid cell lines, not quite directly comparable with whole peripheral blood.

http://blog.avadis-ngs.com/2011/12/how-identical-are-identical-twins/


gosh who knew calling SNPs on identical twins can be a complicated task??


update:
related links I found
Different sides of the same coin; twins and epigenetics
http://blogs.dnalc.org/2011/09/23/different-sides-of-the-same-coin-twins-and-epigenetics/


Tuesday, 12 July 2011

A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly

A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly

Motivation: Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require (n2) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.
Results: For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ovmax(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight =–|ovmax(x, y)| if and only if ≤, where |ovmax(x, y)| is the length of ovmax(x, y) and is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2–1)(2logn+log)n bits with a guarantee that the basic operation of accessing an edge takes O(log ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(nlogn) worse-case time and requires O() extra memory. The second one runs in O(n) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory.
Availability: Our DNA sequence assembler that incorporates the data structure is freely available on the web at http://www.engr.uconn.edu/~htd06001/assembler/leap.zip

Saturday, 30 April 2011

Evaluation of next-generation sequencing software in mapping and assembly.

Evaluation of next-generation sequencing software in mapping and assembly.

J Hum Genet. 2011 Apr 28;

Authors: Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ

Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields, and further provided advices on selecting suitable tools for specific biological applications.Journal of Human Genetics advance online publication, 28 April 2011; doi:10.1038/jhg.2011.43.

PMID: 21525877 [PubMed - as supplied by publisher]



More...

Thursday, 14 April 2011

Tumour evolution inferred by single-cell sequencing

Tumour evolution inferred by single-cell sequencing

Tumour evolution inferred by single-cell sequencing
Nature 472, 7341 (2011). doi:10.1038/nature09807
Authors: Nicholas Navin, Jude Kendall, Jennifer Troge, Peter Andrews, Linda Rodgers, Jeanne McIndoo, Kerry Cook, Asya Stepansky, Dan Levy, Diane Esposito, Lakshmi Muthuswamy, Alex Krasnitz, W. Richard McCombie, James Hicks & Michael Wigler
Genomic analysis provides insights into the role of copy number variation in disease, but most methods are not designed to resolve mixed populations of cells. In tumours, where genetic heterogeneity is common, very important information may be lost that would be useful for reconstructing evolutionary history. Here we show that with flow-sorted nuclei, whole genome amplification and next generation sequencing we can accurately quantify genomic copy number within an individual nucleus. We apply single-nucleus sequencing to investigate tumour population structure and evolution in two human breast cancer cases. Analysis of 100 single cells from a polygenomic tumour revealed three distinct clonal subpopulations that probably represent sequential clonal expansions. Additional analysis of 100 single cells from a monogenomic primary tumour and its liver metastasis indicated that a single clonal expansion formed the primary tumour and seeded the metastasis. In both primary tumours, we also identified an unexpectedly abundant subpopulation of genetically diverse ‘pseudodiploid’ cells that do not travel to the metastatic site. In contrast to gradual models of tumour progression, our data indicate that tumours grow by punctuated clonal expansions with few persistent intermediates.

Thursday, 17 March 2011

Common numbers / statistics for Uniquely mapped reads?

Was asked if there was a commonly reported numbers for
uniquely mapped reads (which is troublesome to define with bowtie)
vs
total mapped reads

Not sure also if the numbers differ for applications
e.g.
WGS
exome reseq

human vs other organisms.
Got this figure from a 2009 paper. Not sure if anyone collates data like this
http://bioinformatics.oxfordjournals.org/content/25/7/969.full.pdf

Saturday, 12 March 2011

Quality control and preprocessing of metagenomic datasets


Quality control and preprocessing of metagenomic datasets

Summary: Here, we present PRINSEQ for easy and rapid quality control and data preprocessing of genomic and metagenomic datasets. Summary statistics of FASTA (and QUAL) or FASTQ files are generated in tabular and graphical form and sequences can be filtered, reformatted and trimmed by a variety of options to improve downstream analysis.
Availability and Implementation: This open-source application was implemented in Perl and can be used as a stand alone version or accessed online through a user-friendly web interface. The source code, user help and additional information are available at http://prinseq.sourceforge.net/

Improving Detection of Genome Structural Variation


Improving Detection of Genome Structural Variation

Friday, 11 March 2011

A comparison of single molecule and amplification based sequencing of cancer transcriptomes.

1. PLoS One. 2011 Mar 1;6(3):e17305.

A comparison of single molecule and amplification based sequencing of cancer transcriptomes.
Sam LT, Lipson D, Raz T, Cao X, Thompson J, Milos PM, Robinson D, Chinnaiyan AM, 
Kumar-Sinha C, Maher CA.

Michigan Center for Translational Pathology, University of Michigan, Ann Arbor,
Michigan, United States of America.

The second wave of next generation sequencing technologies, referred to as
single-molecule sequencing (SMS), carries the promise of profiling samples
directly without employing polymerase chain reaction steps used by
amplification-based sequencing (AS) methods. To examine the merits of both
technologies, we examine mRNA sequencing results from single-molecule and
amplification-based sequencing in a set of human cancer cell lines and tissues.
We observe a characteristic coverage bias towards high abundance transcripts in
amplification-based sequencing. A larger fraction of AS reads cover highly
expressed genes, such as those associated with translational processes and
housekeeping genes, resulting in relatively lower coverage of genes at low and
mid-level abundance. In contrast, the coverage of high abundance transcripts
plateaus off using SMS. Consequently, SMS is able to sequence lower- abundance
transcripts more thoroughly, including some that are undetected by AS methods;
however, these include many more mapping artifacts. A better understanding of the
technical and analytical factors introducing platform specific biases in high
throughput transcriptome sequencing applications will be critical in cross
platform meta-analytic studies.


PMID: 21390249 [PubMed - in process]

Wednesday, 2 March 2011

Papers on Comparison of microRNA profiling platforms

Systematic Evaluation of Three microRNA Profiling Platforms: Microarray, Beads Array, and Quantitative Real-Time PCR Array

Background

A number of gene-profiling methodologies have been applied to microRNA research. The diversity of the platforms and analytical methods makes the comparison and integration of cross-platform microRNA profiling data challenging. In this study, we systematically analyze three representative microRNA profiling platforms: Locked Nucleic Acid (LNA) microarray, beads array, and TaqMan quantitative real-time PCR Low Density Array (TLDA).


Systematic comparison of microarray profiling, real-time PCR, and next-generation sequencing technologies for measuring differential microRNA expression

Abstract
RNA abundance and DNA copy number are routinely measured in high-throughput using microarray and next-generation sequencing (NGS) technologies, and the attributes of different platforms have been extensively analyzed. Recently, the application of both microarrays and NGS has expanded to include microRNAs (miRNAs), but the relative performance of these methods has not been rigorously characterized. We analyzed three biological samples across six miRNA microarray platforms and compared their hybridization performance. We examined the utility of these platforms, as well as NGS, for the detection of differentially expressed miRNAs. We then validated the results for 89 miRNAs by real-time RT-PCR and challenged the use of this assay as a “gold standard.” Finally, we implemented a novel method to evaluate false-positive and false-negative rates for all methods in the absence of a reference method.

Friday, 25 February 2011

Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries

In Genomeweb February 24, 2011

Broad Team IDs, Improves PCR Amplification Bias in Illumina Sequencing Libraries

Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries

Daniel Aird, Michael G Ross, Wei-Sheng Chen, Maxwell Danielsson, Timothy Fennell, Carsten Russ, David B Jaffe, Chad Nusbaum and Andreas Gnirke
Genome Biology 2011, 12:R18 doi:10.1186/gb-2011-12-2-r18
Published: 21 February 2011
 
Despite the ever-increasing output of Illumina sequencing data, loci with extreme base compositions are often under-represented or absent. To evaluate sources of base-composition bias, we traced genomic sequences ranging from 6% to 90% GC through the process by qPCR. We identified PCR during library preparation as a principal source of bias and optimized the conditions. Our improved protocol significantly reduces amplification bias and minimizes the previously severe effects of PCR instrument and temperature ramp rate.

Datanami, Woe be me