Showing posts with label Solexa. Show all posts
Showing posts with label Solexa. Show all posts

Wednesday, 13 April 2011

ZORRO is an hybrid sequencing technology assembler:tested with Solexa 454

Typos in the header aside... you have to love the name!

waiting for the name to become a verb... " I zorroed the NGS reads the other and i had a fantastic assembly!" lol..



Here goes: http://lge.ibi.unicamp.br/zorro/


Overview

ZORRO is an hybrid sequencing technology assembler. It takes 2 sets of pre-assembled contigs and merge them into a more contiguous and consistent assembly. We have already tested Zorro with Illumina Solexa and 454 from some of organisms varying from 3Mb to 100Mb. The main caracteristic of Zorro is the treatment before and after assembly to avoid errors.
The ZORRO project is maintained by Gustavo Lacerda, Ramon Vidal and Marcelo Carazzole and were first used in this Yeast assembly: Genome structure of a Saccharomyces cerevisiae strain widely used in bioethanol production
ZORRO needs to be better documented and has not undergone enough testing. If you want to discuss the pipeline you can join the mailing list: zorro-google group

Zorro: The Complete Series

p.s. the typo is here
"ZORROthe masked assember "

Friday, 4 March 2011

Guide/tutorial for the analysis of RNA-seq data

link in seqanswers

Excellent starting point for those confused about the RNA-seq data analysis procedure.

Hello,

I've written a guide to the analysis of RNA-seq data, for the purpose of differential expression analysis. It currently lives on our internal wiki that can't be viewed outside of our division, although printouts have been used at workshops. It is by no means perfect and very much a work in progress, but a number of people have found it helpful, so I thought it would useful to have it somewhere more publicly accessible.

I've attached a pdf version of the guide, although really what I was hoping was that someone here could suggest somewhere where it could be publicly hosted as a wiki. This area is so multifaceted and fast-moving that the only way such a guide can remain useful is if it can be constantly extended and updated.

If anyone has any suggestions about potential hosting, they can contact me at myoung @wehi.edu.au

Cheers

Matt

Update: I've put a few extra things on our local Wiki and seeing as people here seem to be finding this useful I thought I'd post an updated version. I'm also an author on a review paper on Differential Expression using RNA-seq which people who find the guide useful, might also find relevant...

RNA-seq Review

Tuesday, 16 November 2010

Uniqueome a uniquely ... omics word

Spotted this post on the Tree of Life blog

Another good paper, but bad omics word of the day: uniqueome

From "The Uniqueome: A mappability resource for short-tag sequencing
Ryan Koehler, Hadar Issac , Nicole Cloonan,*, and Sean M. Grimmond." Bioinformatics (2010) doi: 10.1093/bioinformatics 
 
Paper does look interesting though!
 
Summary: Quantification applications of short-tag sequencing data (such as CNVseq and RNAseq) depend on knowing the uniqueness of specific genomic regions at a given threshold of error. Here we present the “uniqueome”, a genomic resource for understanding the uniquely mappable proportion of genomic sequences. Pre-computed data is available for human, mouse, fly, and worm genomes in both color-space and nucletotide-space, and we demonstrate the utility of this resource as applied to the quantification of RNAseq data.
Availability: Files, scripts, and supplementary data is available from http://grimmond.imb.uq.edu.au/uniqueome/; the ISAS uniqueome aligner is freely available from http://www.imagenix.com/

 

 

Tuesday, 12 October 2010

Human Whole genome sequencing at 11x coverage

http://genomebiology.com/2010/11/9/R91

Just saw this paper Sequencing and analysis of an Irish human genome. AFAIK WGS is usually done at 30x coverage. In this paper, the authors “describe a novel method for improving SNP calling accuracy at low genome coverage using haplotype information.” I thought it was pretty good considering that they had 99.3% of the reference genome covered for 10.6x coverage. That leaves only like 21 Mbases missing ..

For those interested in the tech details

Four single-end and five paired-end DNA libraries were generated and sequenced using a GAII Illumina Genome Analyzer. The read lengths of the single-end libraries were 36, 42, 45 and 100 bp and those of the paired end were 36, 40, 76, and 80 bp, with the span sizes of the paired-end libraries ranging from 300 to 550 bp (± 35 bp). In total, 32.9 gigabases of sequence were generated (Table 1). Ninety-one percent of the reads mapped to a unique position in the reference genome (build 36.1) and in total 99.3% of the bases in the reference genome were covered by at least one read, resulting in an average 10.6-fold coverage of the genome.
...
At 11-fold genome coverage, approximately 99.3% of the reference genome was covered and more than 3 million SNPs were detected, of which 13% were novel and may include specific markers of Irish ancestry.

SRMA: tool for Improved variant discovery through local re-alignment of short-read next-generation sequencing data

Have a look at this tool http://genomebiology.com/2010/11/10/R99/abstract
it is a realigner for NGS reads, that doesn't use a lot of ram. Not too sure how it compares to GATK's Local realignment around indels as it is not mentioned. but the authors used reads that were aligned with the popular BWA or BFAST as input. (Bowtie was left out though.)

Excerpted
 SRMA was able to improve the ultimate variant calling using a variety of measures on the simulated data from two different popular aligners (BWA and BFAST. These aligners were selected based on their sensitivity to insertions and deletions (BFAST and BWA), since a property of SRMA is that it produces a better consensus around indel positions. The initial alignments from BFAST allow local SRMA re-alignment using the original color sequence and qualities to be assessed as BFAST retains this color space information. This further reduces the bias towards calling the reference allele at SNP positions in ABI SOLiD data, and reduces the false discovery rate of new variants. Thus, local re-alignment is a powerful approach to improving genomic sequencing with next generation sequencing technologies.  The alignments to the reference genome were implicitly split into 1Mb regions and processed in parallel on a large computer cluster; the re-alignments from each region were then merged in a hierarchical fashion. This allows for the utilization of multi-core computers, with one re-alignment per core, as well as parallelization across a computer cluster or a cloud.  The average peak memory utilization per process was 876Mb (on a single-core), with a maximum peak memory utilization of 1.25GB. On average, each 1Mb region required approximately 2.58 minutes to complete, requiring approximately 86.17 hours total running time for the whole U87MG genome. SRMA also supports re- alignment within user-specified regions for efficiency, so that only regions of interest need to be re-aligned. This is particularly useful for exome-sequencing or targeted re-sequencing data.

Monday, 16 August 2010

Annovar:a easy way to automate variant reduction procedure AKA exome sequencing or whole genome sequencing

The authors' description of the software is that
"ANNOVAR is is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes."

whilst I am not a good enough programmer to comment on how efficient the code is, in the context of a 'why reinvent the wheel' ANNOVAR is definitely an efficient way to get started on doing exome / whole genome resequencing for variant discovery.

basically it is a collection of perl scripts that can do
1)take in variant information from popular tools like samtools-pileup , Complete Genomics, GFF3-SOLID (?), etc.
2) Do Annotation that is

The other nice thing is that the download already comes with excellent examples that you would be able to get going fast.
there's also annotation datasets avail for download from the developers
I only see hg18 for now though.

This page is a nice summary for beginners doing exome / whole genome resequencing.


note:
UPDATE (2010Aug11): Users have reported a bug (double-counting of splicing) in the auto_annovar.pl program in the 2010Aug06 version. An updated version is provided here.

Wednesday, 14 July 2010

Shiny new tool to index NGS reads G-SQZ

This is a long over due tool for those trying to do non-typical analysis with your reads.
Finally you can index and compress your NGS reads

http://www.ncbi.nlm.nih.gov/pubmed/20605925

Bioinformatics. 2010 Jul 6. [Epub ahead of print]
G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

Tembe W, Lowey J, Suh E.

Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
Abstract

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe (wtembe@tgen.org).

read the discussion thread in seqanswers for more tips and benchmarks

I am not affliated with the author btw

Sunday, 30 May 2010

Cofactor genomics on the different NGS platforms

Original post here

They are a commercial company that offers NGS on ABI and Illumina platforms and since this is on their company page I guess its their official stand on what rocks on each platform

Excerpted.

Applied Biosystems SOLiD 3

The Applied Biosystems SOLiD 3 has the shortest but also the highest quantity of reads. The SOLiD produces up to 240 million 50bp reads per slide per end. As with the Illumina, Mate-Pairs produce double the output by duplicating the read length on each end, and the SOLiD supports a variety of insert lengths like the 454. The SOLiD can also run 2 slides at once to again double the output. SOLiD has the lowest *raw* base qualities but the highest processed base qualities when using a reference due to its 2-base encoding. Because of the number of reads and more advanced library types, we recommend the SOLiD for all RNA and bisulfite sequencing projects.

Solexa/Illumina

The Solexa/Illumina generates shorter reads at 36-75bp but produces up to 160 million reads per run.  All reads are of similar length.  The Illumina has the highest *raw* quality scores and its errors are mostly base substitutions. Paired-end reads with ~200 bp inserts are possible with high efficiency and double the output of the machine by duplicating the read length on each end. Paired-end Illumina reads are suitable for de novo assemblies, especially in combination with 454. The large number of reads makes the Illumina appropriate for de novo transcriptome studies with simultaneous discovery and quantification of RNAs at qRT-PCR accuracy.

Roche/454 FLX

The Roche/454 FLX with Titanium chemistry generates the longest reads (350-500bp) and the most contiguous assemblies, can phase SNPs or other features into blocks, and has the shortest run times. However, 454 also produces the fewest total reads (~1 million) at the highest cost per base. Read lengths are variable. Errors occur mostly at the ends of long same-nucleotide stretches. Libraries can be constructed with many insert sizes (8kb - 20kb) but at half of the read length for each end and with low efficiency.

Friday, 28 May 2010

Illumina: an alternative to 454 in metagenomics?

Check out this BMC Bioinformatics paper entitled "Short clones or long clones? A simulation study on the use of paired reads in metagenomics" 


"This paper addresses the problem of taxonomical analysis of paired reads. We describe a new feature of our metagenome analysis software MEGAN that allows one to process sequencing reads in pairs and makes assignments of such reads based on the combined bit scores of their matches to reference sequences. Using this new software in a simulation study, we investigate the use of Illumina paired-sequencing in taxonomical analysis and compare the performance of single reads, short clones and long clones. In addition, we also compare against simulated Roche-454 sequencing runs."

"Our study suggests that a higher percentage of Illumina paired reads than of Roche-454 single reads are correctly assigned to species."

"The gain of long-clone data (75 bp paired reads) over long single-read data (250 bp reads) is still significant at ≈ 4% (not shown)."

of course more importantly
"The authors declare that they have no competing interests."

I am not sure if such a program exists but I wonder if there is a aligner that takes into account the size between mate pairs and paired ends. Theoratically it should improve mapping. but by how much is unknown

Wednesday, 26 May 2010

A scientific spectator's guide to next-generation sequencing

ROFL
I love the title!

A scientific spectator's guide to next-generation sequencing

Dr Keith not only looks at next gen sequencing but also the emerging technologies of single molecule sequencing. Interesting read!

My fave parts of the review
"Finally, there is the cost per base, generally expressed in a cost per human genome sequenced at approximately 40X coverage. To show one example of how these trade off, the new PacBio machine has a great cost per sample (~U$100) and per run (you can run just one sample) but a poor cost per human genome – you’d need around 12,000 of those runs to sequence a human genome (~U$120K). In contrast, one can buy a human genome on the open market for U$50K and sub U$10K genomes will probably be generally available this year."


"Length is critical to genome sequencing and RNA-seq experiments, but really short reads in huge numbers are what counts for DGE/SAGE and many of the functional tag sequencing methods. Technologies with really long reads tend not to give as many, and with all of them you can always choose a much shorter run to enable the machine to be turned over to another job sooner – if your application doesn’t need long reads."

 

 

Wednesday, 7 April 2010

Friday, 26 February 2010

Datanami, Woe be me