Friday, 25 February 2011

Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries

In Genomeweb February 24, 2011

Broad Team IDs, Improves PCR Amplification Bias in Illumina Sequencing Libraries

Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries

Daniel Aird, Michael G Ross, Wei-Sheng Chen, Maxwell Danielsson, Timothy Fennell, Carsten Russ, David B Jaffe, Chad Nusbaum and Andreas Gnirke
Genome Biology 2011, 12:R18 doi:10.1186/gb-2011-12-2-r18
Published: 21 February 2011
Despite the ever-increasing output of Illumina sequencing data, loci with extreme base compositions are often under-represented or absent. To evaluate sources of base-composition bias, we traced genomic sequences ranging from 6% to 90% GC through the process by qPCR. We identified PCR during library preparation as a principal source of bias and optimized the conditions. Our improved protocol significantly reduces amplification bias and minimizes the previously severe effects of PCR instrument and temperature ramp rate.

Thursday, 24 February 2011

Maybe we have to sequence everybody! Every fish! BGI Cloud

Bio-IT world ran an interesting article with this quote

“The data are growing so fast, the biologists have no idea how to handle this data,” says Li. “I think the Cloud will be the solution. We have to sequence more and more data. Maybe we have to sequence everybody! Every fish! The data keep growing and we need a lot of compute power to process.”
For Chen, there are three priorities for BGI Cloud:
  • Connectivity: With partners across China and the world, “we’ve connected all the people and resources—the sequencers, the samples, the ideas, the compute power, and the storage together to make a greater contribution.”
  • Scalability: Calling the explosion in next-gen sequencing (NGS) a “data tsunami,” Chen says BGI aims to provide the parallel computing resources to help users manage and process these datasets. “If you can’t do the analysis, it’s pointless. We use distributed computing technology in the bioinformatics area. We’re confident we can solve the scalability problem.”
  • Reproducibility: Chen says bioinformatics researchers are happy to show their data and their pet program—SOAP, BWA, and so on. “That’s fine. But analysis is very complicated. The methodology he is actually using is a homemade pipeline. It’s very difficult to reproduce that result. We built this platform not only to solve the capability and connectivity of computing, we want to resolve the problems in reproducing designs and procedures.”
With new NGS gene assembly and SNP calling programs such as Hecate and Gaea about to be released (see, “In the Name of Gods”), Li says it was essential to develop a “run-time environment, a Web-based platform for Cloud storage and reference data, with a feature-rich GUI, and effective bioinformatics analysis software.”

Kevin: It would be interesting to see how Amazon and other cloud providers together with Galaxy ( will take to BGI's offering to produce reproducible data analysis. (commercial software providers aside). Also their offering comes at a strange time when NCBI is discontinuing SRA. Might BGI cloud fill up the void where SRA left? 
Everyone is trying to come up with a 'standard' workflow that everyone will adopt but I feel that the ecology of bioinformatics is that there's always another 'better' way to tweak that analysis. Custom analysis is a pet phrase of a lot of bench biologists. 
Every bioinformatician will know and remember their treasure trove of throw away scripts that worked beautifully but only once for that set of data. 

RNA seq analysis workflow on Galaxy (Bristol workflow)

Dr David Matthews has posted a starter thread to discuss RNA seq analysis workflow on Paired End Seq with Tophat on Galaxy. in the mailling list.

His post and the discussion thread is here.

I thought I'd write to get a discussion of a workflow for people doing RNA seq that I have found very useful and addresses some issues in mapping mRNA derived RNA-seq paired end data to the genome using tophat. Here is the approach I use (I have a human mRNA sample deep sequenced with a 56bp paired end read on an illumina generating 29 million reads):

Bristol Method
  • 1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for each sequence read
  • 2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is mapped in a proper pair"
  • 3. Use "group" to group the filtered sam file on c1 (which is the "bio-sequencer" read number) and set an operation to count on c1 as well. This provides a list of the reads and how many times they map to the human genome, because you have filtered the set for reads that have a mate pair there will be an even number for each read. For most of the reads the number will be 2 (indicating the forward read maps once and the reverse read maps once and in a proper pair) but for reads that map ambiguously the number will be multiples of 2. If you count these up I find that 18 million reads map once, 1.3 million map twice, 400,000 reads map 3 times and so on until you get down to 1 read mapping 30 times, 1 read mapping 31 times and so on...
  • 4. Filter the reads to remove any reads that map more than 2 times.
  • 5. Use "compare two datasets" to compare your new list of reads that map only twice to pull out all the reads in your sam file that only map twice (i.e. the mate pairs).
  • 6. You'll need to sort the sam file before you can use it with other applications like IGV.
What you end up with is a sam file where all the reads map to one site only and all the reads map as a proper pair. This may seem similar to setting tophat to ignore non-unique reads. However, it is not. This approach gives you 10-15% more reads. I think it is because if tophat finds (for example) that the forward read maps to one site but the reverse read maps to two sites it throws away the whole read. By filtering the sam file to restrict it to only those mappings that make sense you increase the number of unique reads by getting rid of irrational mappings.

Has anyone else found this? Does this make sense to anyone else? Am I making a huge mistake somewhere?

A nice aspect of this (or at least I think so!) is that by filtering in this manner you can also create a sam file of non-unique mappings which you can monitor. This can be useful if one or more genes has a problem of generating a lot of non-unique maps which may give problems accurately estimating its expression. Also, you also get a list of how many multi hits you have in your data so you know the scale of the problem.

Best Wishes,

Dr David A. Matthews
Senior Lecturer in Virology
Room E49
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University Walk,
University of Bristol

Sunday, 20 February 2011

Ion torrent and DNAstar partnership

Ion Torrent to Provide Assembly and Analysis Solutions from DNAStar

Working with commercial software providers appear to be Life Tech's direction. Earlier they announced a partnership with Partek. I think it is a good move since coming up with an inhouse software that works is not enough. There has to be more reasons for using one over another. Unless Life Tech is moving into providing it's own software solutions commercially, it is not cost effective to have a software development team working on solutions. ( unless of course if u r trying to use free software to close the gap between competitors. ) NGS has spawned a demand for bioinfo jobs and software to tackle the deluge of data. With new sequencers and new file formats I think the scene will be still have more vibrant yet confusing times to come.

Friday, 18 February 2011

It's official:NCBI to discontinue SRA

What started as floating rumours in AGBT and posts by various bloggers. Now it's in NCBI news.

Monday, 14 February 2011

How much data for Metagenomics?

Interesting read for bioinformaticians doing metagenomics (the title says it all).
Within it, it asks the very pertinent question of How much data? However, the suggested data size of 100 Mbp is for Sanger Sequencing. Any readers with comments on the suggested data size for NGS?

How Much Sequence Data?

A common question asked by researchers embarking on their first metagenomic analysis is how much sequence data they should request or allocate for their project. Unlike genome projects, metagenomes have no fixed end point, i.e., a completed genome.......For example, if a dominant population represents 10% of the total community and 100 Mbp is obtained, then this population is expected to be represented by 10 Mbp, assuming completely random sampling of the community. If the average genome size of individuals in this population is 2 Mbp, then an average of 5x coverage of the composite population genome will be expected. To place this in perspective, 6x to 8x coverage of microbial isolates is a common target to obtain a draft genome suitable for finishing.

Wednesday, 9 February 2011

1st feedback from Ion Torrent at AGBT

Definitely exciting! Wished I was there instead of reading twitter and blog reports on the event. 
During a Life Tech conference workshop, Kevin McKernan, vice president of advanced research R&D, said that PGM customers can expect a 10-fold increase in output about every six months.
Since Life Tech launched the system in December, it has announced its first chip upgrade, from the Ion 314 Chip to the Ion 316 Chip. The upgrade promises to increase the output from 10 to 100 megabases per run and will be available to early-access customers this quarter, and more generally in the second quarter (IS 1/11/2011).
The best internal run today has yielded 300,000 reads 100 base pairs in length of quality Q17, according to Maneesh Jain, Ion Torrent's vice president of marketing and business development, who spoke during a separate Ion Torrent conference workshop. He said that the company plans to "address" RNA-seq as an additional application "later in the year."
The system's read length is currently about 100 base pairs. According to information provided during the Life Tech and Ion Torrent workshops, read length is expected to increase to 200 base pairs in the fourth quarter and to 400 base pairs in 2012.
McKernan said that in a single run to sequence the E. coli genome, the system provided "uniform genome coverage regardless of GC content." Coverage of human genes has also been "very even" and has included areas that were missed by both SOLiD and Illumina sequencing, he said.
The PGM's per-base accuracy also continues to improve. According to McKernan, based on 50-base reads, it was about 98.7 percent at the end of 2010 and has since improved to 99.6 percent. In the second quarter of this year, it is projected to improve further, he added.
The company has also improved the accuracy for homopolymer regions, he said, largely based on better software. While the per base accuracy for a stretch of four identical bases was 94 percent at the end of last year, it has increased to 98 percent for five identical bases today, and is expected to go up to 99 percent during the second half of this year.

Tuesday, 8 February 2011

Complete Genomics on the $10,000 Human genome

Read the complete article here

By Kevin Davies 
February 7, 2011 | MARCO ISLAND, FL – It is a testament to the remarkable progress in next-generation sequencing and analysis that when neurobiologist Tim Yu described the complete sequencing of 40 human genomes in a successful search for gene mutations that cause autism, it barely registered a ripple from the large audience. 

“We’re still the only company that’s published a 10-5 error-rate [human] genome,” Reid says (average 1 error/100,000 bases). He asserts that Illumina’s current system consumes $5,000 in reagents, and that cost swells to $20-25,000 when the full cost of informatics and labor is included.  
After claiming last year that CGI had cracked the $1,000 genome threshold for reagent costs, Reid now says that CGI’s all-in cost for a complete human genome is under $10,000. “With all of it added in, we’re below $10,000 now. We’ve got a 2-3X cost advantage [over Illumina], and a 10X quality advantage.”   
CGI currently charges $9,500 per genome for a minimum order of eight genomes. “You can’t pay $20,000 [per genome] any more, even if you try. We just send the money back!”   

Monday, 7 February 2011

Do you think NIH favors large labs that do translational research over smaller ones that don't?

Check out this web poll started by Genomeweb on the question
Do you think NIH favors large labs that do translational research over smaller ones that don't?

Sunday, 6 February 2011

Structural variation in the chicken genome identified by paired-end next-generation DNA sequencing of reduced representation libraries.

BMC Genomics. 2011 Feb 3;12(1):94. [Epub ahead of print]

Structural variation in the chicken genome identified by paired-end next-generation DNA sequencing of reduced representation libraries.


BACKGROUND: Variation within individual genomes ranges from single nucleotide polymorphisms (SNPs) to kilobase, and even megabase, sized structural variants (SVs), such as deletions, insertions, inversions, and more complex rearrangements. Although much is known about the extent of SVs in humans and mice, species in which they exert significant effects on phenotypes, very little is known about the extent of SVs in the 2.5-times smaller and less repetitive genome of the chicken.
RESULTS: We identified hundreds of shared and divergent SVs in four commercial chicken lines relative to the reference chicken genome. The majority of SVs were found in intronic and intergenic regions, and we also found SVs in the coding regions. To identify the SVs, we combined high-throughput short read paired-end sequencing of genomic reduced representation libraries (RRLs) of pooled samples from 25 individuals and computational mapping of DNA sequences from a reference genome.
CONCLUSION: We provide a first glimpse of the high abundance of small structural genomic variations in the chicken. Extrapolating our results, we estimate that there are thousands of rearrangements in the chicken genome, the majority of which are located in non-coding regions. We observed that structural variation contributes to genetic differentiation among current domesticated chicken breeds and the Red Jungle Fowl. We expect that, because of their high abundance, SVs might explain phenotypic differences and play a role in the evolution of the chicken genome. Finally, our study exemplifies an efficient and cost-effective approach for identifying structural variation in sequenced genomes.
PMID: 21291514 [PubMed - as supplied by publisher]

Tuesday, 1 February 2011

Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format.

Bioinformatics. 2011 Jan 28. [Epub ahead of print]

Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format.

Laboratory of Population Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.


SUMMARY: Bambino is a variant detector and graphical alignment viewer for next-generation sequencing data in the SAM/BAM format, which is capable of pooling data from multiple source files. The variant detector takes advantage of SAM-specific annotations, and produces detailed output suitable for genotyping and identification of somatic mutations. The assembly viewer can display reads in the context of a either a user-provided or automatically-generated reference sequence, retrieve genome annotation features from a UCSC genome annotation database, display histograms of non-reference allele frequencies, and predict protein coding changes caused by SNPs.
AVAILABILITY: Bambino is written in platform-independent Java and available from, along with documentation and example data. Bambino may be launched online via Java Web Start or downloaded and run locally.

Datanami, Woe be me