Kevin's GATTACA World: Illumina

Showing posts with label Illumina. Show all posts

Monday, 14 March 2016

Elephants are resistant to cancer

Whole-genome sequencing of 644 elephant tissue samples using the HiSeq 2500 System identified multiple copies of TP53. Compared to human cells, elephant cells demonstrated increased apoptotic response following DNA damage, which could account for the low incidence of cancer (4.81%) in elephant populations.

Related Links
What elephants can teach scientists about fighting cancer in humans http://www.latimes.com/science/sciencenow/la-sci-sn-elephant-cancer-story-20151007-story.html

How elephants avoid cancer http://www.nature.com/news/how-elephants-avoid-cancer-1.18534

Potential Mechanisms for Cancer Resistance in Elephants and Comparative Cellular Response to DNA Damage in Humans Journal of the American Medical Association, DOI: 10.1001/jama.2015.13134
http://jama.jamanetwork.com/article.aspx?articleid=2456041

Wednesday, 24 July 2013

Illumina produces 3k of 8500 bp reads on HiSeq using Moleculo Technology

Keith blogged about how super long read sequencing methods would be a threat to Illumina in Jan 2013. Today, Illumina can now openly acknowledge the shortcomings of their short reads for various applications like

assembly of complex genomes (polyploid, containing excessive long repeat regions, etc.),
accurate transcript assembly,
metagenomics of complex communities,
and phasing of long haplotype blocks.

the reason?
This latest set of data released on BaseSpace

Read length distribution of synthetic long reads for a D. melanogaster library

The data set, available as a single project in BaseSpace, can be accessed here.

image source: http://blog.basespace.illumina.com/2013/07/22/first-data-set-from-fasttrack-long-reads-early-access-service/

with the integration of Moleculo they have managed to generate ~30 gb of raw sequence data. They have refrained from talking about 'key analysis metrics' that's available in the pdf report. Perhaps it's much easier to let the blogosphere and data scientists dissect the new data themselves.

Am wondering when the 454 versus Illumina Long Reads side-by-side comparison will pop up

UPDATE:

Can't find the 'key analysis metrics' in the pdf report files. Perhaps it's still being uploaded? *shrugs*
so please update me if you see it otherwise I just have to run something on it

These are the files that I have now

total 512M
259M Jul 18 01:01 mol-32-2832.fastq.gz
44K Jul 24 2013 FastTrackLongReads_dmelanogaster_281c.pdf
149K Jul 24 2013 mol-32-281c-scaffolds.txt
44K Jul 24 2013 FastTrackLongReads_dmelanogaster_2832.pdf
151K Jul 24 2013 mol-32-2832-scaffolds.txt
253M Jul 24 2013 mol-32-281c.fastq.gz

md5sums
6845fc3a4da9f93efc3a52f288e2d7a0 FastTrackLongReads_dmelanogaster_281c.pdf
02f5de4f7e15bbcd96ada6e78f659fdb FastTrackLongReads_dmelanogaster_2832.pdf
586599bb7fca3c20ba82a82921e8ba3f mol-32-281c-scaffolds.txt
b25010e9e5e13dc7befc43b5dff8c3d6 mol-32-281c.fastq.gz
6822cfbd3eb2a535a38a5022c1d3c336 mol-32-2832-scaffolds.txt
873f09080cdf59ed37b3676cddcbe26f mol-32-2832.fastq.gz

I have ran FastQC (FastQC v0.10.1) on both samples the images below are from 281c.
you can download the full HTML report here
https://www.dropbox.com/sh/5unu3zba9u21ywj/JT4HdkzfOP/mol-32-281c_fastqc.zip
https://www.dropbox.com/s/mpxa5wx51iqmiz3/mol-32-2832_fastqc.zip

Reading about the Moleculo sample prep method, it seems like it's just a rather ingenious way to stitch short reads which are barcoded to form a single long contig. if that is the case, then I am not sure if the base quality scores here are meaningful anymore since it's a mini-assembly. Also this takes out any quantitative value of the number of reads I presume. So accurate quantification of long RNA molecules or splice variants isn't possible. Nevertheless it's an interesting development on the Illumina platform. Looking forward to seeing more news about it.

Tuesday, 6 December 2011

Complete Khoisan and Bantu genomes from southern Africa : Article : Nature

http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html

Just attended a very good lecture by Stephan Schuster, entitled "African Genomes: Charting Human Diversity"

He offered unbiased views / charts on the platform differences between 454, GAIIx, HiSeq, SOLiD for NGS sequencing coverage (which I think I should not repeat here). It points to the need to do sequencing on 2 different platforms to get a more accurate SNP list.

He also gave compelling reasons for getting a 20x coverage of Human genome done in 454 to complete the human genome (457 gaps in hg19).

Yes, I often forget that the media / lay person thinks that the human genome is 'complete'. It's a often ignored 'secret' that actually it isn't. Maybe the next marketing ploy(i mean strategy) for emerging sequencing platforms would be to be THE ONE that actually finishes the human genome.

Friday, 4 March 2011

Guide/tutorial for the analysis of RNA-seq data

link in seqanswers

Excellent starting point for those confused about the RNA-seq data analysis procedure.

Hello,

I've written a guide to the analysis of RNA-seq data, for the purpose of differential expression analysis. It currently lives on our internal wiki that can't be viewed outside of our division, although printouts have been used at workshops. It is by no means perfect and very much a work in progress, but a number of people have found it helpful, so I thought it would useful to have it somewhere more publicly accessible.

I've attached a pdf version of the guide, although really what I was hoping was that someone here could suggest somewhere where it could be publicly hosted as a wiki. This area is so multifaceted and fast-moving that the only way such a guide can remain useful is if it can be constantly extended and updated.

If anyone has any suggestions about potential hosting, they can contact me at myoung @wehi.edu.au

Cheers

Matt

Update: I've put a few extra things on our local Wiki and seeing as people here seem to be finding this useful I thought I'd post an updated version. I'm also an author on a review paper on Differential Expression using RNA-seq which people who find the guide useful, might also find relevant...

RNA-seq Review

Saturday, 15 January 2011

DNAvision offers Human WGS for 7,500 euros

I had imagined exome sequencing would still have a good run for the next 2-3 years but seeing how commercial service providers are throwing caution to the wind and offering Whole genome sequencing at ever decreasing costs, I think many will soon revert to WGS instead. Exome sequencing kits will have a lot to catch up in terms of price and useful data if they are to match up with quickly plummeting WGS prices.
DNAVision to Offer $10K Human Genome Sequencing Services; Purchases Four SOLiDs
Landed: First Illumina HiSeq Machines Advertised (By Nick Loman on February 10, 2010)

Tuesday, 12 October 2010

Human Whole genome sequencing at 11x coverage

http://genomebiology.com/2010/11/9/R91

Just saw this paper Sequencing and analysis of an Irish human genome. AFAIK WGS is usually done at 30x coverage. In this paper, the authors “describe a novel method for improving SNP calling accuracy at low genome coverage using haplotype information.” I thought it was pretty good considering that they had 99.3% of the reference genome covered for 10.6x coverage. That leaves only like 21 Mbases missing ..

For those interested in the tech details

Four single-end and five paired-end DNA libraries were generated and sequenced using a GAII Illumina Genome Analyzer. The read lengths of the single-end libraries were 36, 42, 45 and 100 bp and those of the paired end were 36, 40, 76, and 80 bp, with the span sizes of the paired-end libraries ranging from 300 to 550 bp (± 35 bp). In total, 32.9 gigabases of sequence were generated (Table 1). Ninety-one percent of the reads mapped to a unique position in the reference genome (build 36.1) and in total 99.3% of the bases in the reference genome were covered by at least one read, resulting in an average 10.6-fold coverage of the genome.
...
At 11-fold genome coverage, approximately 99.3% of the reference genome was covered and more than 3 million SNPs were detected, of which 13% were novel and may include specific markers of Irish ancestry.

SRMA: tool for Improved variant discovery through local re-alignment of short-read next-generation sequencing data

Have a look at this tool http://genomebiology.com/2010/11/10/R99/abstract
it is a realigner for NGS reads, that doesn't use a lot of ram. Not too sure how it compares to GATK's Local realignment around indels as it is not mentioned. but the authors used reads that were aligned with the popular BWA or BFAST as input. (Bowtie was left out though.)

Excerpted

SRMA was able to improve the ultimate variant calling using a variety of measures on the simulated data from two different popular aligners (BWA and BFAST. These aligners were selected based on their sensitivity to insertions and deletions (BFAST and BWA), since a property of SRMA is that it produces a better consensus around indel positions. The initial alignments from BFAST allow local SRMA re-alignment using the original color sequence and qualities to be assessed as BFAST retains this color space information. This further reduces the bias towards calling the reference allele at SNP positions in ABI SOLiD data, and reduces the false discovery rate of new variants. Thus, local re-alignment is a powerful approach to improving genomic sequencing with next generation sequencing technologies. The alignments to the reference genome were implicitly split into 1Mb regions and processed in parallel on a large computer cluster; the re-alignments from each region were then merged in a hierarchical fashion. This allows for the utilization of multi-core computers, with one re-alignment per core, as well as parallelization across a computer cluster or a cloud. The average peak memory utilization per process was 876Mb (on a single-core), with a maximum peak memory utilization of 1.25GB. On average, each 1Mb region required approximately 2.58 minutes to complete, requiring approximately 86.17 hours total running time for the whole U87MG genome. SRMA also supports re- alignment within user-specified regions for efficiency, so that only regions of interest need to be re-aligned. This is particularly useful for exome-sequencing or targeted re-sequencing data.

Saturday, 5 June 2010

Illumina Cuts Price of Personal Genome Sequencing Service by At Least 60 Percent

For individuals, the new price will be $19,500, while groups of five or more participants using the same ordering physician will pay $14,500 per person. In addition, individuals with serious medical conditions for whom whole-genome sequencing could be of clinical value will pay $9,500 to have their genome sequenced. Read full article

Sunday, 30 May 2010

Cofactor genomics on the different NGS platforms

Original post here

They are a commercial company that offers NGS on ABI and Illumina platforms and since this is on their company page I guess its their official stand on what rocks on each platform

Excerpted.

Applied Biosystems SOLiD 3

The Applied Biosystems SOLiD 3 has the shortest but also the highest quantity of reads. The SOLiD produces up to 240 million 50bp reads per slide per end. As with the Illumina, Mate-Pairs produce double the output by duplicating the read length on each end, and the SOLiD supports a variety of insert lengths like the 454. The SOLiD can also run 2 slides at once to again double the output. SOLiD has the lowest *raw* base qualities but the highest processed base qualities when using a reference due to its 2-base encoding. Because of the number of reads and more advanced library types, we recommend the SOLiD for all RNA and bisulfite sequencing projects.

Solexa/Illumina

The Solexa/Illumina generates shorter reads at 36-75bp but produces up to 160 million reads per run. All reads are of similar length. The Illumina has the highest *raw* quality scores and its errors are mostly base substitutions. Paired-end reads with ~200 bp inserts are possible with high efficiency and double the output of the machine by duplicating the read length on each end. Paired-end Illumina reads are suitable for de novo assemblies, especially in combination with 454. The large number of reads makes the Illumina appropriate for de novo transcriptome studies with simultaneous discovery and quantification of RNAs at qRT-PCR accuracy.

Roche/454 FLX

The Roche/454 FLX with Titanium chemistry generates the longest reads (350-500bp) and the most contiguous assemblies, can phase SNPs or other features into blocks, and has the shortest run times. However, 454 also produces the fewest total reads (~1 million) at the highest cost per base. Read lengths are variable. Errors occur mostly at the ends of long same-nucleotide stretches. Libraries can be constructed with many insert sizes (8kb - 20kb) but at half of the read length for each end and with low efficiency.

Friday, 28 May 2010

Illumina: an alternative to 454 in metagenomics?

Check out this BMC Bioinformatics paper entitled "Short clones or long clones? A simulation study on the use of paired reads in metagenomics"

"This paper addresses the problem of taxonomical analysis of paired reads. We describe a new feature of our metagenome analysis software MEGAN that allows one to process sequencing reads in pairs and makes assignments of such reads based on the combined bit scores of their matches to reference sequences. Using this new software in a simulation study, we investigate the use of Illumina paired-sequencing in taxonomical analysis and compare the performance of single reads, short clones and long clones. In addition, we also compare against simulated Roche-454 sequencing runs."

"Our study suggests that a higher percentage of Illumina paired reads than of Roche-454 single reads are correctly assigned to species."

"The gain of long-clone data (75 bp paired reads) over long single-read data (250 bp reads) is still significant at ≈ 4% (not shown)."

of course more importantly
"The authors declare that they have no competing interests."

I am not sure if such a program exists but I wonder if there is a aligner that takes into account the size between mate pairs and paired ends. Theoratically it should improve mapping. but by how much is unknown

Wednesday, 26 May 2010

A scientific spectator's guide to next-generation sequencing

ROFL
I love the title!

A scientific spectator's guide to next-generation sequencing

Dr Keith not only looks at next gen sequencing but also the emerging technologies of single molecule sequencing. Interesting read!

My fave parts of the review
"Finally, there is the cost per base, generally expressed in a cost per human genome sequenced at approximately 40X coverage. To show one example of how these trade off, the new PacBio machine has a great cost per sample (~U$100) and per run (you can run just one sample) but a poor cost per human genome – you’d need around 12,000 of those runs to sequence a human genome (~U$120K). In contrast, one can buy a human genome on the open market for U$50K and sub U$10K genomes will probably be generally available this year."

"Length is critical to genome sequencing and RNA-seq experiments, but really short reads in huge numbers are what counts for DGE/SAGE and many of the functional tag sequencing methods. Technologies with really long reads tend not to give as many, and with all of them you can always choose a much shorter run to enable the machine to be turned over to another job sooner – if your application doesn’t need long reads."

Wednesday, 7 April 2010

Comparing NGS platforms, 454, Solexa, SOLiD

Inspired by Albert's work at http://ngsbuzz.blogspot.com/

Please post discrepancies or views in comments

Friday, 26 February 2010

Illumina assembly using velvet

UC Davis has a wiki on this.
Covers Single End and Paired End

Kevin's GATTACA World

Monday, 14 March 2016

Elephants are resistant to cancer

Wednesday, 24 July 2013

Illumina produces 3k of 8500 bp reads on HiSeq using Moleculo Technology

UPDATE:

Other links

Tuesday, 6 December 2011

Complete Khoisan and Bantu genomes from southern Africa : Article : Nature

Friday, 4 March 2011

Guide/tutorial for the analysis of RNA-seq data

Saturday, 15 January 2011

DNAvision offers Human WGS for 7,500 euros

Tuesday, 12 October 2010

Human Whole genome sequencing at 11x coverage

SRMA: tool for Improved variant discovery through local re-alignment of short-read next-generation sequencing data

Saturday, 5 June 2010

Illumina Cuts Price of Personal Genome Sequencing Service by At Least 60 Percent

Sunday, 30 May 2010

Cofactor genomics on the different NGS platforms

Applied Biosystems SOLiD 3

Solexa/Illumina

Roche/454 FLX

Friday, 28 May 2010

Illumina: an alternative to 454 in metagenomics?

Wednesday, 26 May 2010

A scientific spectator's guide to next-generation sequencing

A scientific spectator's guide to next-generation sequencing

Dr Keith not only looks at next gen sequencing but also the emerging technologies of single molecule sequencing. Interesting read!

Wednesday, 7 April 2010

Comparing NGS platforms, 454, Solexa, SOLiD

Friday, 26 February 2010

Illumina assembly using velvet

Datanami, Woe be me

Analytics code

Contributors