The incredible throughput of current second-generation sequencing platforms makes it possible to sequence a complete human genome to high coverage, with a single instrument run, in less than 2 weeks. As whole-genome sequencing becomes more routine, it is increasingly important to understand the accuracy of sequence-level analyses, such as SNP detection, and its relationship to overall sequence depth. Enter a recent study from the lab of Elliott Margulies at NHGRI. As part of the NIH Undiagnosed Diseases Program, the authors generated over 380 gigabases of sequence data from the blood sample of a male patient. This is an astonishing amount of sequence for one sample, roughly 126-fold theoretical redundancy genome-wide.
Perhaps just as importantly, the dataset comprised four runs on two different but related platforms: the Illumina GAIIx, and the Illumina HiSeq2000. Here is a brief summary of the dataset.
|Dataset||Total Gbp||Map Rate||Dup. Rate||Mapped Depth||% Genome Callable|
|GAIIx (14 lanes)||118||95.3%||3.9%||34.2x||88.82%|
|HiSeq A (8 lanes)||122||94.0%||13.7%||32.7x||90.99%|
|HiSeq B (8 lanes)||144||92.6%||8.7%||40.4x||93.10%|
|All (30 lanes)||384||93.9%||13.6%||102x||95.88%|
With this impressive dataset in hand, the authors undertook a detailed examination of the technical aspects of sequence analysis: coverage uniformity, platform comparisons, genotyping accuracy, etc. and seek to answer two questions:
- Given a specific amount of sequencing data, what fraction of the genome is "callable"?
- How many SNVs can be accurately identified?
The results, I think, are critically important in the near future as whole-genome sequencing becomes routine and widely accessible to investigators.