Wednesday, 24 July 2013

Illumina produces 3k of 8500 bp reads on HiSeq using Moleculo Technology

Keith blogged about how super long read sequencing methods would be a threat to Illumina in Jan 2013. Today, Illumina can now openly acknowledge the shortcomings of their short reads for various applications like
  • assembly of complex genomes (polyploid, containing excessive long repeat regions, etc.), 
  • accurate transcript assembly, 
  • metagenomics of complex communities, 
  • and phasing of long haplotype blocks.

the reason?
This latest set of data released on BaseSpace
Read length distribution of synthetic long reads for a D. melanogaster library
The data set, available as a single project in BaseSpace, can be accessed here.

image source:

with the integration of Moleculo they have managed to generate ~30 gb of raw sequence data. They have refrained from talking about 'key analysis metrics' that's available in the pdf report. Perhaps it's much easier to let the blogosphere and data scientists dissect the new data themselves.

Am wondering when the 454 versus Illumina Long Reads side-by-side comparison will pop up


Can't find the 'key analysis metrics' in the pdf report files. Perhaps it's still being uploaded? *shrugs*
so please update me if you see it  otherwise I just have to run something on it

These are the files that I have now

total 512M
 259M Jul 18 01:01 mol-32-2832.fastq.gz
  44K Jul 24  2013 FastTrackLongReads_dmelanogaster_281c.pdf
 149K Jul 24  2013 mol-32-281c-scaffolds.txt
  44K Jul 24  2013 FastTrackLongReads_dmelanogaster_2832.pdf
 151K Jul 24  2013 mol-32-2832-scaffolds.txt
 253M Jul 24  2013 mol-32-281c.fastq.gz

6845fc3a4da9f93efc3a52f288e2d7a0  FastTrackLongReads_dmelanogaster_281c.pdf
02f5de4f7e15bbcd96ada6e78f659fdb  FastTrackLongReads_dmelanogaster_2832.pdf
586599bb7fca3c20ba82a82921e8ba3f  mol-32-281c-scaffolds.txt
b25010e9e5e13dc7befc43b5dff8c3d6  mol-32-281c.fastq.gz
6822cfbd3eb2a535a38a5022c1d3c336  mol-32-2832-scaffolds.txt
873f09080cdf59ed37b3676cddcbe26f  mol-32-2832.fastq.gz

I have ran FastQC (FastQC v0.10.1) on both samples the images below are from 281c.
you can download the full HTML report here

Reading about the Moleculo sample prep method, it seems like it's just a rather ingenious way to stitch short reads which are barcoded to form a single long contig. if that is the case, then I am not sure if the base quality scores here are meaningful anymore since it's a mini-assembly. Also this takes out any quantitative value of the number of reads I presume. So accurate quantification of long RNA molecules or splice variants isn't possible. Nevertheless it's an interesting development on the Illumina platform. Looking forward to seeing more news about it.

Other links

Illumina Long-Read Sequencing Service
Moleculo technology: synthetic long reads for genome phasing, de novo sequencing
CoreGenomics: Genome partitioning: my moleculo-esque idea
Moleculo and Haplotype Phasing - The Next Generation TechnologistNext Generation Technologist
Abstract: Production Of Long (1.5kb – 15.0kb), Accurate, DNA Sequencing Reads Using An Illumina HiSeq2000 To Support De Novo Assembly Of The Blue Catfish Genome (Plant and Animal Genome XXI Conference) (no info on this page though)
Illumina Announces Phasing Analysis Service for Human Whole-Genome Sequencing - MarketWatch
Patent information on the Long Read technology

1 comment:

  1. I'm intrigued by the long tail (head, rather) of short synthetic reads. Can you tell from the files, or from the read names, which synthetic reads came from the same pool? Then after mapping, one can find out whether these are the result of fragmented assemblies of each amplified long molecule. Even without the pooling info, mapping may shed some light on this...


Datanami, Woe be me