Monday, 5 September 2011

Excerpted: A Guide for Deep Sequencing of Human Genomes (from MassGenomics by Dan Koboldt)

A Guide for Deep Sequencing of Human Genomes

The incredible throughput of current second-generation sequencing platforms makes it possible to sequence a complete human genome to high coverage, with a single instrument run, in less than 2 weeks. As whole-genome sequencing becomes more routine, it is increasingly important to understand the accuracy of sequence-level analyses, such as SNP detection, and its relationship to overall sequence depth. Enter a recent study from the lab of Elliott Margulies at NHGRI. As part of the NIH Undiagnosed Diseases Program, the authors generated over 380 gigabases of sequence data from the blood sample of a male patient. This is an astonishing amount of sequence for one sample, roughly 126-fold theoretical redundancy genome-wide.

Perhaps just as importantly, the dataset comprised four runs on two different but related platforms: the Illumina GAIIx, and the Illumina HiSeq2000. Here is a brief summary of the dataset.

Dataset Total Gbp Map Rate Dup. Rate Mapped Depth % Genome Callable
GAIIx (14 lanes) 118 95.3% 3.9% 34.2x 88.82%
HiSeq A (8 lanes) 122 94.0% 13.7% 32.7x 90.99%
HiSeq B (8 lanes) 144 92.6% 8.7% 40.4x 93.10%
All (30 lanes) 384 93.9% 13.6% 102x 95.88%

With this impressive dataset in hand, the authors undertook a detailed examination of the technical aspects of sequence analysis: coverage uniformity, platform comparisons, genotyping accuracy, etc. and seek to answer two questions:

  1. Given a specific amount of sequencing data, what fraction of the genome is "callable"?
  2. How many SNVs can be accurately identified?

The results, I think, are critically important in the near future as whole-genome sequencing becomes routine and widely accessible to investigators.

No comments:

Post a Comment

Datanami, Woe be me