Thursday, 31 March 2011

samtools upgrade for exome & target region SNP calling

off the samtools helplist, a new feature that's very helpful for exome / target region NGS studies

BTW, samtools is now able to compute read depth and call SNPs in regions specified by an input BED file. For both "samtools mpileup" (not pileup) and "bcftools view", you may provide the BED via the "-l" option. If the input has two numeric columns, it is parsed as a BED (region list); if has one numeric column, parsed as a position list file, so the "-l" option is backward compatible with old versions. It is also possible to retrieve alignments overlapping a BED file with 

samtools view -L in.bed 

. For mpileup, using "-l" to call SNPs in target/exome regions can be much faster than doing whole-genome calling and then filtering.


Convert SAM / BAM to fasta / fastq

Probably one of the most freq FAQ

latest thread in biostar

Samtofastq using Picard

Samtools and awk to make fasta from sam
samtools view filename.bam | awk '{OFS="\t"; print ">"$1"\n"$10}' - > filename.fasta
Biopython and pysam (code contributed by Brad Chapman) 

Tuesday, 22 March 2011

de novo assembly of Illumina CEO genome in 11.5 h - new ver of Ray

Kevin:You can't ignore an email with that subject header.. but 512 compute cores? Shall have a chat with my HPC vendor.. 
Also am waiting for public release of Cortex
Strange that courses that teach the software are available but the software ain't ...

Velvet and Curtain seems promising for de novo assembly as well.

Ray 1.3.0 is now available online.

The most important change is the correction of a major bug that caused
parallel infinite loop on the human genome.

This, along concepts incorporated in Ray 1.2.4, allowed Ray to assemble
the genome of Illumina's CEO in 11.5 hours using 512 compute cores (see
below for the link).

What's new?



   * Vertices with less than 1 of coverage are ignored during the
computation of seeds and during the computation of extensions.
   * Computation of library outer distances relies on the virtual
   * Expiry positions are used to toss away reads that are out-of-range
   * When only one choice is given during the extension and some reads
are in-range, then the sole choice is picked up.
   * Fixed a bug for empty reads.
   * A read is not added in the active set if it is marked on a
repeated vertex and its mate was not encountered yet.
   * Grouped messages in the extension of seeds.
   * Reads marked on repeated vertices are cached during the extension.
   * Paths are cached in the computation of fusions.
   * Fixed an infinite loop in the extension of seeds.
   * When fetching read markers for a vertex, send a list of mates to
meet if the vertex is repeated in order to reduce the communication.
   * Updated the Instruction Manual
   * Added a version of the logo without text.

I fixed a bug that caused an infinite loop. Now Ray can assemble large
genomes. See my blog post for more detail about that.

Version 1.2.4 of Ray incorporated also new concepts that I will present
at RECOMB-Seq 2011.

The talk is available online:

S├ębastien Boisvert

Cheat Sheets Galore-bioinformatics, biology, linux,perl, python, R

started with Keith's post here

and a thread at

I have soooo many of them! *this is going to be a long post
R (pdf)
Hmmmm where did that go to?
AWK one liners
Sed examples
Linux common tasks

I have these too 
  • IUPAC ambiguity codes for nucleotides:
  • Amino acid single letter codes.

FAQ-SNPs missing when called with more samples

Using mpileup called with 2 different samples, Im getting this
particular SNP which has a gd coverage (DP=55) only in one particular
sample. This is good.

However when mpileup was called with 10 samples, the SNP got lost. Im
just trying to figure out if the SNP got 'drowned' out by the other 9
samples which doesnt have the SNP and hence, wasnt called. Is this how
mpileup works

Excellent answer by Heng Li 

With more samples, you gain power on SNPs shared between samples, but lose power on singleton SNPs. Here is a way of thinking of this. Suppose we have 1% false positive rate (FPR) for one sample. If we call SNPs from 10 samples separately and then combine the calls, the FPR would be around 5% (not 10% because more SNPs are found given 10 samples). To retain a low FPR on singletons we have to be more stringent. Nonetheless, with more samples, we can usually get overall better calls than calling SNPs in each sample separately because information between samples is used more effectively.

Thursday, 17 March 2011

Common numbers / statistics for Uniquely mapped reads?

Was asked if there was a commonly reported numbers for
uniquely mapped reads (which is troublesome to define with bowtie)
total mapped reads

Not sure also if the numbers differ for applications
exome reseq

human vs other organisms.
Got this figure from a 2009 paper. Not sure if anyone collates data like this

Sunday, 13 March 2011

Using Galaxy for NGS sample submission and tracking for service providers

Over at the Blue Collar Bioinformatics 
Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy.
This post will show videos of the interface in action, describe installation and extension of the system, and detail the implementation architecture.

RNA-seq of a 12GB dataset in less than 7 hours

Folks at CLCbio have updated the CLC Genomics Machine benchmarks with newer datasets and the new hardware configuration. There are two RNA-Seq data sets and a full genome mapping data set, and more will come. All the benchmarks are now using the CLC Genomics Server rather than the Assembly Cell.

script 4 filter to unique FASTQ reads using a bloom-filter in front of a python set

from the hackmap blog 

a simple script that filters to unique FASTQ reads using a bloom-filter in front of a python set. Basically only stuff that is flagged as appearing in the bloom-filter is added to the set. This trades speed--it iterates over the file 3 times--for memory. The amount of memory is tuneable by the specified error-rate. It's not pretty, but it should be simple enough to demonstrate what's going on. It only reads from stdin and writes to stdout, with some information about total reads an number of false positives in the bloom-filter sent to stderr.
usage looks like:

python > in.fastq < out.unique.fastq

Updated overview of NGS and SMS platforms

Range of NGS Applications Rises Quickly

Advanced Technological Approach Generates Genomic Data Better, Faster, and Cheaper


PacBio applied its single molecule real-time (SMRT™) DNA sequencing technology to decode two samples from the recent Haitian outbreak and three other strains of V. cholerae and compared them to DNA sequence....

Key advantages of MiSeq are its fast turnaround time, ease of use, and simple sample prep, said David Bentley, Ph.D. Dr. Bentley, chief scientist at Illumina, envisions customers using the system for various types of applications: to check a small amount of sample before running it on HiSeq, to analyze large numbers of poor-quality DNA samples isolated from FFPE tissues, and to detect specific mutations in patient samples from clinical trial populations.


Life Technologies’ Ion Personal Genome Machine (PGM™) is based on Ion Torrent’s semiconductor sequencing chips that translate chemical signals into digital information. The 314 sequencing chip contains an array of 1.3 million wells; each is the site of an individual sequencing reaction. A pH change is detected when incorporation of a new base onto a growing DNA strand produces hydrogen ions.

Complete Genomics, which leverages its human genome sequencing capabilities through a service delivery platform, employs a sequencing method based on DNA nanoball (DNB™) arrays and combinatorial probe-anchor ligation read technology. The company optimized its sequencing technology specifically for the human genome and delivers to its customers annotated sequence data, identifying key sites of sequence variation.


Intelligent Bio-Systems’ three-step sequencing-by-synthesis technology involves amplifying DNA fragments, attaching them to a DNA sequence primer, and then immobilizing them in a high-density array on a glass chip. Fluorescently labeled bases (a different color for A, C, G, and T) are then introduced and attach to the growing DNA strand. The array is scanned and the fluorescent signal emitted by each replicating strand indicates with base was incorporated at the completion of each base addition step.

Saturday, 12 March 2011

RStudio IDE for R ... looks promising!

RStudio, released yesterday, is a new open-source IDE for R. It’s getting a lot of attention at R-bloggers and it’s easy to see why: this is open-source software development done right.

What You’re Doing Is Rather Desperate

Notes from the life of a bioinformatics researcher

The Tea Transcriptome

The Tea Transcriptome

Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Using high-throughput Illumina RNA-seq, the transcriptome from C. sinensis was analyzed at an unprecedented depth.
(
Shi C et al. (2011) Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compoundsBMC Genomics [Epub ahead of print]. [article]
The Tea Transcriptome is a post from: RNA-Seq Blog More information about RNA-Seq can be here.

Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons [RESOURCES]

Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons [RESOURCES]

Bacterial diversity among environmental samples is commonly assessed with PCR-amplified 16S rRNA gene (16S) sequences. Perceived diversity, however, can be influenced by sample preparation, primer selection, and formation of chimeric 16S amplification products. Chimeras are hybrid products between multiple parent sequences that can be falsely interpreted as novel organisms, thus inflating apparent diversity. We developed a new chimera detection tool called Chimera Slayer (CS). CS detects chimeras with greater sensitivity than previous methods, performs well on short sequences such as those produced by the 454 Life Sciences (Roche) Genome Sequencer, and can scale to large data sets. By benchmarking CS performance against sequences derived from a controlled DNA mixture of known organisms and a simulated chimera set, we provide insights into the factors that affect chimera formation such as sequence abundance, the extent of similarity between 16S genes, and PCR conditions. Chimeras were found to reproducibly form among independent amplifications and contributed to false perceptions of sample diversity and the false identification of novel taxa, with less-abundant species exhibiting chimera rates exceeding 70%. Shotgun metagenomic sequences of our mock community appear to be devoid of 16S chimeras, supporting a role for shotgun metagenomics in validating novel organisms discovered in targeted sequence surveys.

Tracking the astounding pace of digital storage

Tracking the astounding pace of digital storage

Ivan Smith maintains a page tracking the price of digital storage over the years. This is one of technology's least appreciated growth stories -- we hear a lot about Moore's Law and the doubling of processing capacity, but storage-density's growth makes the pace of processor improvements look glacial. Every now and then I realize that the 32GB SD card in my camera costs less than the 16k memory upgrade I put in my Apple ][+ in 1980, even without accounting for inflation, and I am croggled. Here are David Isenberg's benchmarks, calculated from Smith's records:
YEAR -- Price of a Gigabyte1981 -- $300,000
1987 -- $50,000
1990 -- $10,000
1994 -- $1000
1997 -- $100
2000 -- $10
2004 -- $1
2010 -- $0.10
It would be interesting to do the same chart for a megabyte -- you'd go from six figures to fractional pennies in a damned short period.
Cost of Hard Drive Storage Space (via

Quality control and preprocessing of metagenomic datasets

Quality control and preprocessing of metagenomic datasets

Summary: Here, we present PRINSEQ for easy and rapid quality control and data preprocessing of genomic and metagenomic datasets. Summary statistics of FASTA (and QUAL) or FASTQ files are generated in tabular and graphical form and sequences can be filtered, reformatted and trimmed by a variety of options to improve downstream analysis.
Availability and Implementation: This open-source application was implemented in Perl and can be used as a stand alone version or accessed online through a user-friendly web interface. The source code, user help and additional information are available at

OICR's Genomics Pathway-Sequencing Project Highlights Genomic Data's Steady March Into Clinics

OICR's Genomics Pathway-Sequencing Project Highlights Genomic Data's Steady March Into Clinics

The Ontario Institute for Cancer Research has kicked off a project that promises to address some key informatics challenges related to clinical sequencing.
Lincoln Stein, platform leader of informatics and biocomputing at the OICR, described the project in a presentation at the fourth Summit on Translational Bioinformatics in San Francisco this week.
Stein said that the project, conducted in partnership with Canada's Princess Margaret Hospital and dubbed the Genomics Pathway Sequencing project, or GPS, will sequence genes in normal and tumor samples excised from patients.

“The biggest challenge,” Stein said, “is trying to keep the amount of information in this report to a minimum without keeping potentially important information back [from physicians].”

For each individual, OICR researchers will sequence about 1,000 cancer-related genes in control and tumor samples and attempt to identify mutations that are of immediate relevance to the patient’s cancer care, or that are targeted by drugs that are currently being tested in trials.
The genes in the study — which include the usual suspects such as KRAS, p53, and B-Raf — were selected based on suggestions from oncologists in the Toronto area as well as data stored in knowledgebases at the Memorial Sloan Kettering Cancer Center, the Wellcome Trust Sanger Institute, and the National Cancer Institute.
To sequence patient samples, OICR will use Pacific Biosystems' single-molecule sequencing as a first step to discover novel mutations. It will then sequence known mutations on Sequenom's MassArray platform. The sequence variants will be confirmed using a Sanger sequencer housed at the clinical sequencing lab at Princess Margaret hospital.
Stein explained that the OICR selected the PacBio system because “it allows us to do very long reads at high coverage … currently 1,000-base-pair reads in a circular consensus, which gives high accuracy for targeted genes.”
In addition, he said, the platform has a rapid turnaround time, taking only “15 minutes per run to completely analyze … a single typical gene.”

Kevin: fascinating! Would their's become the de facto standard for clinical sequencing?

Would the closing of SRA affect you?

Short Read Archive Canned

This email apparently from NCBI head-honcho David Lipman was posted in the comments section atTree of Life:
Dear Staff Members of NCBI,
As you are aware, the federal government as well as NIH is facing a period of budgetary uncertainty that is resulting in ongoing program reviews throughout the government.   At NCBI our senior staff have been giving serious consideration to our own projects and staffing levels in order to prepare for and adjust to new fiscal constraints.
NCBI had received a significant adjustment in its appropriated funding in the proposed FY2011 President’s Budget.  The President’s Budget, however, has not been enacted and we are being required to operate at last year’s (FY2010) level under a Continuing Resolution (CR) from Congress.  Upon the CR’s expiration on March 4, 2011, there is little likelihood the budget picture will improve.  The NIH Office of the Director has provided us with stop-gap funding to alleviate some of our FY2011 and FY2012 funding needs.

Wow, big news. And not just because the fact that NCBI is downsizing is a worry in terms of funding priorities – given the ongoing explosion in production of genomics data.
I, like many others, won’t necessarily be sorry to see the back of the Short Read Archive as it was a bit of a pain to upload to and a massive pain to retrieve from.
But the question is now, what will become the de facto place to find short-read data, or pointers to data? Certainly the SRA was useful – particularly for getting data to test bioinformatics applications against.

Kevin: like others, the SRA was a useful place to get demo data for testing software, but moving forward I haven't gone back to SRA (partly as my work takes me away from testing more stuff ) but I think it points to the fact that  raw sequencing data might be too expensive for anyone to archive for long.

Statistics Meeting Focused on RNA-Seq Data Handling from RNA-Seq Blog

Statistics Meeting Focused on RNA-Seq Data Handling

Joint Statistical Meeting July 30th-August 4th, 2011
Miami Beach, Fl
Sharing Information Across Genes to Estimate Overdispersion in RNA-seq Data 
Steven Peder Lund, Iowa State University; Dan Nettleton, Iowa State University31/2011
Differential expression analysis for paired RNA-seq data 
Lisa M Chung, Department of Epidemiology and Public Health, Yale University ; John Ferguson, Department of Epidemiology and Public Health, Yale University ; Hongyu Zhao, Yale University
How to characterize dynamic bayesian networks across multiple species from time series mRNA-Seq count gene expression profiles:An intelligent Dynamic Bayesian Networks (IDBNs) 
sunghee OH, Yale University; Hongyu Zhao, Yale University; James P. Noonan, Yale University
Statistical methods for the analysis of next-generation sequencing data 
Karthik Devarajan, Fox Chase Cancer Center
Yihui Zhou, Univ North Carolina; Fred Andrew Wright, Univ North Carolina
On Differential Gene Expression Using RNA-Seq Data 
Ju Hee Lee, University of Texas, MD Anderson Cancer Center; Yuan Ji, MD Anderson Cancer Centre – University of Texas; Shoudan Liang, University of Texas, MD Anderson Cancer Center; Guoshuai Cai, University of Texas, MD Anderson Cancer Center; Peter Mueller, MD Anderson Cancer Center
A Bayesian nonparametric method for differential expression analysis of RNA-seq data 
Yiyi Wang, Department of Statistics, Texas A&M University; David B. Dahl, Department of Statistics, Texas A&M University
Statistical strategy for eQTL mapping using RNA-seq data 
Wei Sun, University of North Carolina, Chapel Hill
Joint analyses of high-throughput DNA and RNA-seq data from cancer samples 
Su Yeon Kim, University of California, Berkeley; Terence Speed, University of California, Berkeley
Significance Analysis of time-series gene expression profiles :via differential/trajectory models in temporal mRNA-Seq data 
sunghee OH, Yale University; Hongyu Zhao, Yale University; James P. Noonan, Yale University
Model-Based Clustering for RNA-Seq Data 
Yaqing Si, Iowa State University; Peng Liu, Iowa State University
An integrative approach to comparing and normalizing gene expression data generated from RNA-seq, microarray, and RT-PCR technologies 
Zhaonan Sun, Department of Statistics, Purdue University; Yu Zhu, Department of Statistics, Purdue University
Normalization, testing, and false discovery rate estimation for RNA-sequencing data 
Jun Li, Department of Statistics, Stanford University; Daniela Witten, University of Washington; Iain M Johnstone, Stanford University; Robert Tibshirani, Dept of Health Research and Policy, & Statistics, Stanford University
Statistics Meeting Focused on RNA-Seq Data Handling is a post from: RNA-Seq Blog More information about RNA-Seq can be here.

JCVI Supports Human Mircrobiome Body Site Experts with Shotgun Data Analysis from JCVI Blog by Johannes Goll

The current survey comprises more than 700 samples from hundreds of individuals taken from up to 16 distinct body sites. Illumina sequencing has yielded more than 20 billion Illumina reads and annotation data produced from the sequences exceeds 10 terabytes. In anticipation of such data volumes, we have developed JCVI Metagenomics Reports (METAREP), an open source tool for high-performance comparative analysis, in 2010. The tool enables users to slice and dice data using a combination of taxonomic and functional/pathway signatures. To demonstrate how the tool can be used by body site experts, we picked and loaded sample data from 17 oral samples and presented a quick tutorial on how users can view, search, browse individual samples and compare multiple samples (see video). The functionality was very well received and body site experts asked JCVI to make all the 700+ samples available. As a result of the Jamboree, JCVI in agreement/collaboration with the HMP Data Analysis and Coordination Center and the rest of the HMP consortium, will soon set-up a dedicated HMP METAREP instance that will allow body-site experts and eventually other users to analyze the DAWG data in a user-friendly way via the web.

Tip of the Week: World Tour of Genomics Resources from The OpenHelix Blog

Tip of the Week: World Tour of Genomics Resources

Most weeks our tip is a five-minute movie that quickly introduces you to a new resource, or a cool new function at an established resource. Occasionally we feature one of our full resource tutorial that is being made freely available through resource sponsorship of our training suite. In this week’s tip we provide access to one of our tutorials that is especially near and dear to our heart. It is a World Tour of Genomics Resources in which we explore a variety of publicly-available biomedical, bioinformatics and bioscience databases and other resources.
This tutorial is quite different from our usual ones. Generally we focus on a specific software resource and describe step-by-step how to use its functions such as how to do basic and advanced searches, how to understand and modify displays, where to find specific types of data such as FASTA sequences, etc. and even provide tips on ‘hidden features’ that power users even find useful and informative.  This type of software training is absolutely critical.
But many people need an even earlier step: just the *awareness* that resources are available that might serve their needs. This tutorial fills that niche. We present a sampling of resources, all free to use, from each of 9 categories including: Analysis & AlgorithmsExpression, Genome Browsers (for Eukaryotes and for Prokaryotes and Viruses), Genome Variation,  LiteratureNucleotidesPathways and Proteins. After the World Tour, which is the majority of the tutorial, we then describe how to use OpenHelix’s free search and learn portal to find bioscience resources most appropriate for your research needs. From this the tour transitions into a brief discussion of the format of our training materials and how to use them, and then ends with information about other learning resources that we provide.
This tutorial has been wildly popular whenever we’ve done it as a live seminar. At the NIH they actually had to lock the doors because we’d hit the capacity of the room, and people were turned away. In fact, it has been so popular that we decided to produce it as a full tutorial suite and release it as one of our free trainings so that anyone and everyone could learn about the breadth of great public software options available for free use.
In addition to this free tutorial, we also have published a paper entitled “OpenHelix: bioinformatics education outside of a different box” in a special issue of Briefings in Bioinformatics entitled “Special Issue: Education in Bioinformatics“. This paper describes a plethora of sources where researchers can access informal educational sources of learning on publicly available bioinformatics resources. The sources of information include a wide variety of formats including lists of resources, journals that regularly feature tool descriptions, and eLearning resources sources such as the MIT OpenCourseWare effort. If you know of other such resources that aren’t covered in our tour or paper, comment & let us know about them – we love to learn as much as we love to teach!:)
Quick link to World Tour of Genomics Resources tutorial here.
  • Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010). OpenHelix: bioinformatics education outside of a different box Briefings in Bioinformatics, 11 (6), 598-609 DOI:10.1093/bib/bbq026

A pipeline for RNA-seq data processing and quality assessment from Bioinformatics - current issue

A pipeline for RNA-seq data processing and quality assessment

Summary: We present an R based pipeline, ArrayExpressHTS, for pre-processing, expression estimation and data quality assessment of high-throughput sequencing transcriptional profiling (RNA-seq) datasets. The pipeline starts from raw sequence files and produces standard Bioconductor R objects containing gene or transcript measurements for downstream analysis along with web reports for data quality assessment. It may be run locally on a user's own computer or remotely on a distributed R-cloud farm at the European Bioinformatics Institute. It can be used to analyse user's own datasets or public RNA-seq datasets from the ArrayExpress Archive.
Availability: The R package is available at with online documentation at, also available as supplementary material.

RNA-Seq Analysis Tools from the Broad Institute from RNA-Seq Blog

RNA-Seq Analysis Tools from the Broad Institute

GenePattern offers a suite of tools to support a wide variety of RNA-seq analyses, including short-read mapping, identification of splice junctions, transcript and isoform detection, quantitation, and differential expression. The modules have been adapted from widely-used tools. GenePattern also provides pipelines that allow you to perform a number of multi-step RNA-seq analyses automatically.
RNA-Seq Analysis Tools from the Broad Institute is a post from: RNA-Seq Blog More information about RNA-Seq can be here.

Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection [RESEARCH]

Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection [RESEARCH]

We search the complete orangutan genome for regions where humans are more closely related to orangutans than to chimpanzees due to incomplete lineage sorting (ILS) in the ancestor of human and chimpanzees. The search uses our recently developed coalescent hidden Markov model (HMM) framework. We find ILS present in ~1% of the genome, and that the ancestral species of human and chimpanzees never experienced a severe population bottleneck. The existence of ILS is validated with simulations, site pattern analysis, and analysis of rare genomic events. The existence of ILS allows us to disentangle the time of isolation of humans and orangutans (the speciation time) from the genetic divergence time, and we find speciation to be as recent as 9–13 million years ago (Mya; contingent on the calibration point). The analyses provide further support for a recent speciation of human and chimpanzee at ~4 Mya and a diverse ancestor of human and chimpanzee with an effective population size of about 50,000 individuals. Posterior decoding infers ILS for each nucleotide in the genome, and we use this to deduce patterns of selection in the ancestral species. We demonstrate the effect of background selection in the common ancestor of humans and chimpanzees. In agreement with predictions from population genetics, ILS was found to be reduced in exons and gene-dense regions when we control for confounding factors such as GC content and recombination rate. Finally, we find the broad-scale recombination rate to be conserved through the complete ape phylogeny.

Improving Detection of Genome Structural Variation

Improving Detection of Genome Structural Variation

Friday, 11 March 2011

A comparison of single molecule and amplification based sequencing of cancer transcriptomes.

1. PLoS One. 2011 Mar 1;6(3):e17305.

A comparison of single molecule and amplification based sequencing of cancer transcriptomes.
Sam LT, Lipson D, Raz T, Cao X, Thompson J, Milos PM, Robinson D, Chinnaiyan AM, 
Kumar-Sinha C, Maher CA.

Michigan Center for Translational Pathology, University of Michigan, Ann Arbor,
Michigan, United States of America.

The second wave of next generation sequencing technologies, referred to as
single-molecule sequencing (SMS), carries the promise of profiling samples
directly without employing polymerase chain reaction steps used by
amplification-based sequencing (AS) methods. To examine the merits of both
technologies, we examine mRNA sequencing results from single-molecule and
amplification-based sequencing in a set of human cancer cell lines and tissues.
We observe a characteristic coverage bias towards high abundance transcripts in
amplification-based sequencing. A larger fraction of AS reads cover highly
expressed genes, such as those associated with translational processes and
housekeeping genes, resulting in relatively lower coverage of genes at low and
mid-level abundance. In contrast, the coverage of high abundance transcripts
plateaus off using SMS. Consequently, SMS is able to sequence lower- abundance
transcripts more thoroughly, including some that are undetected by AS methods;
however, these include many more mapping artifacts. A better understanding of the
technical and analytical factors introducing platform specific biases in high
throughput transcriptome sequencing applications will be critical in cross
platform meta-analytic studies.

PMID: 21390249 [PubMed - in process]

Monday, 7 March 2011

Solution to unmappable NGS reads - as a web service!

UMARS: Un-MAppable Reads Solution.
Li SC, Chan WC, Lai CH, Tsai KW, Hsu CN, Jou YS, Chen HC, Chen CH, Lin WC.
BMC Bioinformatics. 2011 Feb 15;12 Suppl 1:S9.
PMID: 21342592 [PubMed - in process]

Kevin: Don't you just love programs that say what they do?


ABSTRACT : BACKGROUND : Un-MAppable Reads Solution (UMARS) is a user-friendly web service focusing on retrieving valuable information from sequence reads that cannot be mapped back to reference genomes. Recently, next-generation sequencing (NGS) technology has emerged as a powerful tool for generating high-throughput sequencing data and has been applied to many kinds of biological research. In a typical analysis, adaptor-trimmed NGS reads were first mapped back to reference sequences, including genomes or transcripts. However, a fraction of NGS reads failed to be mapped back to the reference sequences. Such un-mappable reads are usually imputed to sequencing errors and discarded without further consideration. METHODS : We are investigating possible biological relevance and possible sources of un-mappable reads. Therefore, we developed UMARS to scan for virus genomic fragments or exon-exon junctions of novel alternative splicing isoforms from un-mappable reads. For mapping un-mappable reads, we first collected viral genomes and sequences of exon-exon junctions. Then, we constructed UMARS pipeline as an automatic alignment interface. RESULTS : By demonstrating the results of two UMARS alignment cases, we show the applicability of UMARS. We first showed that the expected EBV genomic fragments can be detected by UMARS. Second, we also detected exon-exon junctions from un-mappable reads. Further experimental validation also ensured the authenticity of the UMARS pipeline. The UMARS service is freely available to the academic community and can be accessed via CONCLUSIONS : In this study, we have shown that some un-mappable reads are not caused by sequencing errors. They can originate from viral infection or transcript splicing. Our UMARS pipeline provides another way to examine and recycle the un-mappable reads that are commonly discarded as garbage.

Next-Generation Sequencing without a Reference: Interview w/ Frank You of U.C. Davis/USDA-ARS

Chanced upon this interview in the SOLiD community but the thread was missing two days after. Oh well here's the google cache link
P.S. if there's a valid reason why it was taken off please inform me, will do the same..
Update: the page is up again

Recently we had the chance to talk with Frank You from the Department of Plant Sciences at U. C. Davis and the Genomics and Gene Dsicovery Research Unit of the USDA-ARS about his publication in BMC Genomics, Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence.*

You are discovering genome wide SNPs in plant species. What is the reason you want to discover these SNPs?

Genome-wide SNPs are important resources for marker-assisted selection in breeding and high-dense genetic mapping construction which is required for map-based cloning and whole genome sequencing. In our study, we discovered genome-wide SNPs in Aegilops tauschii, the diploid ancestor of the wheat D genome, with a genome size of 4.02 Gb, of which 90% is repetitive sequences. We are using these SNPs to construct a high-dense genetic map for wheat D genome sequencing.

Please briefly explain the challenges you faced, and the solution you came to, in order to do this in the absence of a reference.

We have no reference sequences available for Ae. tauschii SNP discovery. For short reads generated by next-generation sequencing platforms, especially SOLiD™ and Solexa, the major challenge is mapping errors, when short reads in one genotype are mapped to short reads in another genotype in highly repetitive, complex genomes. Our idea is to reduce the complexity of the genome.

It is assumed that most genes are in a single-copy dose in a genome, and sequences of duplicated genes are usually diverged to such an extent that most of their reads do not cluster together. Therefore, the read depth (number of reads of the same nucleotide position) mapped to coding sequences of known genes estimates the expected read depth of all single-copy sequences in a genome. Sequences showing greater read depth are assumed to be from duplicated or repeated sequences. To implement this rationale, shallow genome coverage by long Roche 454 sequences is used to identify genic sequences by homology search against gene databases. Multiple genome coverages of short SOLiD™ or Solexa sequences are then used to estimate the read depth of genic sequences in a population of SOLiD™ or Solexa reads. The estimate is in turn used to identify (annotate) the remaining single-copy Roche 454 reads.

This combination of Roche 454 and SOLiD™ or Solexa platforms combines the long length of Roche 454 reads with the high coverage of the SOLiD™/Solexa sequencing platforms, thus reducing costs associated with the development of reference sequence. Short SOLiD™ or Solexa reads are mapped and aligned to the Roche 454 reads or contigs with short-read mapping tools. After the annotation of all sequences, SNPs are called and filtered.
An important part of your pipeline is the ability to call SNPs in repeat junctions. What is this, and why is it important?

Transposable elements (TE) make up large proportions of many eukaryotic genomes. For example, they represent ~35% of the rice genome, and ~90% of the hexaploid wheat genome, and significantly contribute to the size, organization and evolution of plant genomes. Repeat junctions (RJs) are created by insertions of TEs into each other, into genes, or into other DNA sequences. Previous studies showed that those repeat junctions are commonly unique and genome-specific. They can be therefore treated as single copy markers in the genome.

The genome specificity of TE junction-based markers makes them particularly useful for mapping of polyploid species including many important crops, such as wheat and cotton. Because repeat junctions are also abundant and randomly distributed along chromosomes, they have a great potential in development of genome-wide molecular markers for high-throughput mapping and diversity studies in large and complex genomes.

In this paper you used both base space and color space data. Did you find any challenges with mixing these data types?

We had difficulty using base space and color space data together in read mapping. No academic command-line-based programs for hybrid read mapping are currently available. Thus, this is still a challenge for hybrid data mapping. Instead, we can perform short read mapping separately for color-space and base-space data, and then merge the results in the pipeline.

Given the highly repetitive sequence you were working with, will this method work even better with less repetitive genomes?

Yes. I can expect that the method proposed in this paper will work even better with less repetitive genomes.

*Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence. You FM, Huo N, Deal KR, Gu YQ, Luo MC, McGuire PE, Dvorak J, Anderson OD BMC Genomics 2011, 12:59 (25 January 2011)

Datanami, Woe be me