Thursday, 28 October 2010

1000 Genomes Pilot Paper Published

1000 Genomes Pilot Paper Published

27 OCTOBER 2010
The 1000 Genomes Project Consortium has published the results of the pilot project analysis in the journal Nature in an article appearing on line today. The paper A map of human genome variation from population-scale sequencing is available from the Nature web site and is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence to ensure wide distribution. Please share our paper appropriately.

Wednesday, 27 October 2010

de novo assembly of large genomes

Here's a informative post by Ewan Birney on the velvet user list about de novo assembly of large genomes.

Velvet's algorithms in theory work for any size. However, the engineering aspects
of Velvet, in particular memory consumption, means it's unable to handle read sets
of a particular size. This of course depends on how big a real memory machine
you have.

I know we have "routinely" (ie, for multiple strains) done Drosophila sized genomes
(~120MB) on a 125GB machine.

I've heard of Velvet being used into the 200-300MB region, but rarely further. Memory
size is not just about the size of the genome but also how error prone you reads
are (though sheer size is important).

Beyond this there are a variety of strategies:

  "Raw" de Bruijn graphs, without a tremendously aggressive use of read pairs can
be made using Cortex (unpublished, from Mario Cacamo and Zam Iqbal) or ABySS (published,
well understood, from the BC genome centre).

   Curtain (unpublished, but available, from Matthias Haimel at EBI) can do a
smart partition of the reads given an initial de Bruijn graph, run Velvet on the paritions
and thus provide an improved more read-pair aware graph. This can be iterated and in
at least some cases, the Curtain approach gets close to what Velvet can produce alone
(in the scenarios where Velvet can be run on a single memory machine to understand
Curtain's performance)

   SOAP de novo from the BGI is responsible for a number of the published assemblies
(eg, Panda, YH) although like many assemblers, tuning it seems quite hard, and I would
definitely be asking the BGI guys for advice.

   A new version of ALLPATHS (from the Broad crew) looks extremely interesting, but
is not quite released yet.

In all above the cases I know of successes, but also quite a few failures, and untangling
data quality/algorithm/choice of parameters/running bugs is really complex. So - whereas
assemblies < 100MB are "routine", currently assemblies 100MB-500MB are "challenging" and
>500MB are theoretically doable, and have been done by specific groups, but I think still
are at the leading edge of development and one should not be confident of success for
"any particular genome".

Thanks Ewan for letting me reproduce his post here

Velvet-users mailing list

Cortex seems very promising for de novo assembly of human reads using reasonable amounts of ram ( 128 Gb ) based on the mailing list. I know I be watching out for it on Sourceforge!

NGS alignment viewers, assembly viewers -warning pretty graphics ahead

Chanced upon this review of the visualization tools out there that can help you make biologists understand what you just did with their NGS data. I think assembly viewers is the best name for this category of tools. Since many of these not only support BAM but other assembly formats as well.
My vote goes to BamView for the most creative name. but for pretty visuals I have to agree with the author that Tablet takes the cake. Tablet DOES have the least obvious sounding name which makes it difficult to find on google.

Tophat adds support for strand-specific RNA-Seq alignment and colorspace

testing Tophat 1.1.2 now

on a 8 Gb Ram CentOS box managed to align 1 million reads to hg18 in 33 mins and 2 million reads in 59 mins. using 4 threads
Nice scalability! But it was slower than I was used to for bowtie. I kept killing my full set of 90 million reads thinking there's something wrong. Guess I need to be more patient and wait for 45 hours.

I do wonder if the process can be mapped to separate nodes to speed up.

Tuesday, 26 October 2010

Throwing the baby out with the bathwater:Non-Synonymous and Synonymous Coding SNPs Show Similar Likelihood and Effect Size of Human Disease Association

I was literally having a 'oh shoot' moment when i saw this news in GenomeWeb

Synonymous SNPs Shouldn't Be Discounted in Disease, Study Finds

NEW YORK (GenomeWeb News) – Synonymous SNPs that don't change the amino acid sequence encoded by a gene appear just as likely to influence human disease as non-synonymous SNPs that do, according to a paper appearing online recently in PLoS ONE by researchers from Stanford University and the Lucile Packard Children's Hospital.

from the abstract of the paper
The enrichment of disease-associated SNPs around the 80th base in the first introns might provide an effective way to prioritize intronic SNPs for functional studies. We further found that the likelihood of disease association was positively associated with the effect size across different types of SNPs, and SNPs in the 3′untranslated regions, such as the microRNA binding sites, might be under-investigated. Our results suggest that sSNPs are just as likely to be involved in disease mechanisms, so we recommend that sSNPs discovered from GWAS should also be examined with functional studies.

Hmmmm how is this going to affect your carefully crafted pipeline now? 

Monday, 25 October 2010

AB on Ion Torrent

There was a brief mention of the Ion Torrent at the 5500 presentation as well but nothing of great significance. I do wish they marketing fellows will push ion torrent out faster but i think they are trying to streamline production by testing if invitrogen kits can replace the ones at ion torrent. I do hope they do not sacrifice compatibility over performance.

For Bioscope, they are going to include base space support (hurray?) presumably so that they can use the same pipeline for analysis of their SMS and Ion Torrent technologies. 

Stay Tuned!

AB releases 4 HQ and PI as 5500xl and 5500 SOLiD

Was lucky to be part of the 1st group to view the specs and info on the new SOLiD 4 hq.
For reasons unknown,
They have renamed it to 5500XL and 5500 solid system which is your familiar 4 HQ and PI 
Or if you prefer formulas.
5500xl = 4 hq
5500 = PI

One can only fathom their obession with these 4 digits judging by similar instruments named 
AB Sciex Triple Quad 5500 and the AB Sciex QTrap 5500

Honestly the 5500 numbers are of no numerical significance AFAIK.

outlook wise both  looks like the PI system
I DO NOT see the computer cluster anymore, that's something I am curious about. 

Finally we are at 75 bp though. 
Of notable importance, there is a new Exact Call Chemistry module (ECC) which promises 99.99% accuracy which is optional as it increases the run time. 
the new solid system is co-developed with the Hitachi-Hi Technologies. 
Instead of the familiar slides, they use 'flowchips' now. with 6 individual lanes to allow for more mixing of samples of different reads. 
for the 5500xl
throughput per day is 20-30 Gb 
per run you have 180 Gb or 2.8 B tags (paired ends or mate pairs)

Contrary to most rumours, 5500xl is upgradeable from SOLiD 4 although I suspect it is a trade in program. No mention about the 5500 (which i guess is basically a downgrade).

The specs should be up soon 

Update from seqanswers from truthseqr

Here is the message that has just been posted:
AB is premiering two new instruments at ASHG next week.

Mobile ASHG calendar: (

Twitter account: @SOLiDSequencing (

SOLiD Community:

More info soon at: (

Tuesday, 19 October 2010

After a Decade, JGI Retires the Last of Its Sanger Sequencers

After a Decade, JGI Retires the Last of Its Sanger Sequencers

Time and tide waits for no man  ...

Microsoft Biology Tools

The Microsoft Biology Tools (MBT) are a collection of tools that enable biology and bioinformatics researchers to be more productive in making scientific discoveries. Some of the tools provided here take advantage of the capabilities of the Microsoft Biology Foundation, and are good examples of how you can use MBF to create other tools.

When I last visited the site, no tools were available and I didn't realised it was finally released. Definitely will try some of these tools just to satisfy curiosity but I doubt these will find widespread usage but that's just a personal opinion. 

Stuff that caught my eye
BL!P: BLAST in Pivot
False Discovery Rate
Microsoft Research Sequence Assembler
SIGMA: Large Scale Machine Learning Toolkit

Do post comments here if you have tested any of the tools and found them useful!

Tuesday, 12 October 2010

Human Whole genome sequencing at 11x coverage

Just saw this paper Sequencing and analysis of an Irish human genome. AFAIK WGS is usually done at 30x coverage. In this paper, the authors “describe a novel method for improving SNP calling accuracy at low genome coverage using haplotype information.” I thought it was pretty good considering that they had 99.3% of the reference genome covered for 10.6x coverage. That leaves only like 21 Mbases missing ..

For those interested in the tech details

Four single-end and five paired-end DNA libraries were generated and sequenced using a GAII Illumina Genome Analyzer. The read lengths of the single-end libraries were 36, 42, 45 and 100 bp and those of the paired end were 36, 40, 76, and 80 bp, with the span sizes of the paired-end libraries ranging from 300 to 550 bp (± 35 bp). In total, 32.9 gigabases of sequence were generated (Table 1). Ninety-one percent of the reads mapped to a unique position in the reference genome (build 36.1) and in total 99.3% of the bases in the reference genome were covered by at least one read, resulting in an average 10.6-fold coverage of the genome.
At 11-fold genome coverage, approximately 99.3% of the reference genome was covered and more than 3 million SNPs were detected, of which 13% were novel and may include specific markers of Irish ancestry.

Bio-Hackers not the IT kind

Amateur hobbyists are creating home-brew molecular-biology labs, but can they ferment a revolution?

Rob Carlson's path to becoming a biohacker began with a chance encounter on the train in 1996. Carlson, a physics PhD student at the time, was travelling to New York to find a journal article that wasn't available at his home institution, Princeton University in New Jersey. He found himself sitting next to an inquisitive elderly gentlemen. Carlson told him about his thesis research on the effects of physical forces on blood cells, and at the end of the journey, the stranger made him an offer. "You should come work for me," said the man, "I'm Dr Sydney Brenner." The name meant little to Carlson, who says he thought: "Yeah, OK. Whatever, 'Dr Sydney Brenner.'"


12 Geneticists unzip their genomes in full public view

A GROUP of 12 genetics experts will expose their DNA to public view today to challenge the common view that such information is so private and sensitive that it should not be widely shared. 

The "DNA dozen" will publish full results of their own genetic tests, including implications for their health, in a controversial initiative to explain the significance of the human genome for medicine and society.
The Genomes Unzipped project aims to demystify the genetic code, showing what it can and cannot reveal about individuals' health and allaying fears about discrimination and privacy.
The participants - 11 British-based scientists and an American genetics lawyer - hope to encourage many more people to share details of their genomes with researchers. This would allow the creation of open-access DNA databases that any scientist could use, enabling a "wisdom of crowds" approach to research that will accelerate discoveries about genetics and health.

Discarding quality scores for small RNA analysis.

Going through CLCbio's Small RNA analysis using Illumina data.
got to this part
Make sure the Discard quality scores and Discard read names checkboxes are checked. Information about quality scores and read names are not used
in this analysis anyway, so it will just take up disk space when importing the data.

which led me thinking. The reads are short to begin with. I would expect more information is always better. But in some cases, I guess having a 2nd metric is confusing, when there's
1)sequencing error
2)bona fide SNP
3)relatively low quality scores anyway. (how would one weight the seq quality in a fair way)

I believe CLCbio uses BWT for index and compression of the genomes to be searched, I am curious how they differ from BWA and Bowtie though.

SRMA: tool for Improved variant discovery through local re-alignment of short-read next-generation sequencing data

Have a look at this tool
it is a realigner for NGS reads, that doesn't use a lot of ram. Not too sure how it compares to GATK's Local realignment around indels as it is not mentioned. but the authors used reads that were aligned with the popular BWA or BFAST as input. (Bowtie was left out though.)

 SRMA was able to improve the ultimate variant calling using a variety of measures on the simulated data from two different popular aligners (BWA and BFAST. These aligners were selected based on their sensitivity to insertions and deletions (BFAST and BWA), since a property of SRMA is that it produces a better consensus around indel positions. The initial alignments from BFAST allow local SRMA re-alignment using the original color sequence and qualities to be assessed as BFAST retains this color space information. This further reduces the bias towards calling the reference allele at SNP positions in ABI SOLiD data, and reduces the false discovery rate of new variants. Thus, local re-alignment is a powerful approach to improving genomic sequencing with next generation sequencing technologies.  The alignments to the reference genome were implicitly split into 1Mb regions and processed in parallel on a large computer cluster; the re-alignments from each region were then merged in a hierarchical fashion. This allows for the utilization of multi-core computers, with one re-alignment per core, as well as parallelization across a computer cluster or a cloud.  The average peak memory utilization per process was 876Mb (on a single-core), with a maximum peak memory utilization of 1.25GB. On average, each 1Mb region required approximately 2.58 minutes to complete, requiring approximately 86.17 hours total running time for the whole U87MG genome. SRMA also supports re- alignment within user-specified regions for efficiency, so that only regions of interest need to be re-aligned. This is particularly useful for exome-sequencing or targeted re-sequencing data.

Monday, 11 October 2010

What's your BRCA status? Personal Genomics testing.

Do-it-yourself genetic testing
   How to test your BRCA status and why we need to prepare for the personal genomics age.
Genome Biology 2010, 11:404

Interesting read covering issues on personal genomics. Did you know that “the BRCA gene patents, which are held by Myriad Genetics, cover all known cancer-causing mutations in addition to those that might be discovered in the future.” How did that one slip through the patent office?? Not that it really matters “Currently Myriad charges more than $3000 for its tests on the BRCA genes, while sequencing one's entire genome now costs less than $20,000. Furthermore, once an individual's genome has been sequenced, it becomes a resource that can be re-tested as new disease-causing mutations are discovered. “

“Regardless of how easy it might be to test for mutations, the restrictive nature of the BRCA gene patents means that anyone wishing to examine any mutation in BRCA1 or BRCA2 will have to obtain permission from the patent holder Myriad Genetics. This restriction applies even if testing your own genome. If you wanted to look at other genes, you would have to pay license fees for any of them that were protected by patents. In practice, although it may seem absurd, this means that before scanning your own genome sequence, you might be required by law to pay thousands of license fees to multiple patent holders. “

This is complete hogwash! ( the concept that I have to pay genome-squatters (see cybersquatters) in the human genome, I would much rather pay for real estate on the moon! )

related posts
US clinics quietly embrace whole-genome sequencing @ Nature News
Commentary on Personal Genomics testing

Genome sizes

Compendium of links for gauging genome sizes (useful for calculating genome coverage)

My fav diagram to get a relative feel for the genome sizes
is from Molecular Biology of the Cell
  For which the image is reproduced here Figure 1-38   Genome sizes compared

There's other sources like

also see Comparison of different genome sizes

DOGS - Database Of Genome Sizes last modified Tuesday 16th of May 2006

Google images search for "genome sizes" for other lucky finds.

Molecular Biology of the Cell, Fourth Edition

Installing SOLiD™ System de novo Accessory Tools 2.0 with Velvet and MUMmer

howto install on CentOS 5.4 

 wget #just to keep a record
 tar zxvf denovo2.tgz
 cp velvet_0.7.55.tgz denovo2 #you can use mv if you don’t mind downloading again
 cp MUMmer3.22.tar.gz denovo2
 cd denovo2
 tar zxvf velvet_0.7.55.tgz
 tar zxvf MUMmer3.22.tar.gz
 cd MUMmer3.22/src/kurtz #this was the part where I deviated from instructions
 gmake mummer #Might be redundant but running gmake at root dir gave no binary
 gmake |tee gmake-install.log
Next step:
download the example data to run through the pipeline

This is a 50X50 Mate pair library from DH10B produced by SOLiD™ system.The set includes .csfasta and .qual files for F3 and R3. The insert size of the library is 1300bp and it is about 600X coverage of the DH10B genome. The results from MP library in the de novo documents are generated from this dataset.


GCC: The Complete Reference

pitfalls for SAET for de novo assembly

Spotted in manual for

SOLiD™ System de novo Accessory Tools 2.0

Usage of pre-assembly error correction: This is an optional tool which was
demonstrated to increase contigs length in de novo assembly by factor of 2 to 3. Do not use this tool if coverage is less than 20x. Overcorrection and under-correction are equally bad for de novo assembly; therefore use balanced number of local and global rounds of error correction. For example, the pipeline will use 1 global and 3 local rounds if reads are 25bp long, and 2 global and 5 local rounds if reads are 50bp long.

Is it just me? I would think it is trivial to implement the correction tool to correct only when the coverage is > 20x. Not sure why you would need human intervention.

Friday, 8 October 2010

small RNA analysis for SOLiD data

SOLiD™ System Small RNA Analysis Pipeline Tool (RNA2MAP) is being released as "unsupported software" by Applied Biosystems.

It failed for me at just simply producing the silly PBS scripts to run the analysis. I was advised to run it in another linux server to try by dumb luck :(
I found example scripts but documentation is brief. Not sure if it's worth the hassle to debug the collection of perl scripts or to manually edit the params in the PBS submission scripts for a tool that is not commonly used.

How are you analysing your small RNA SOlid reads? any software to recommend? is for 454 and Illumina
Partek Genomics Suite is commercial

the other two at listed at seqanswers wiki
doesn't seem to be for solid as well.

Re-Defining Storage for the Next Generation

Do have a go at this article citing David Dooling, assistant director of informatics at The Genome Center at Washington University and a few others 
Looking ahead, as genome sequencing costs lower there's going to be more data than ever generated. And the article rightly states that every postdoc with a different analysis method will have a copy of a canonical dataset. Personally, I think this is a calling for tools and data to be moved to the cloud.
Getting the data up there in the first is a choke point.
but using the cloud will most definitely force everyone to only use a single copy of shared data.
Google solved the problem of tackling large datasets with slow interconnects with map reduce paradigms.
There are tools available that make use of this already but they are not popular yet. I still get weird stares when I tell them about hadoop filesystems. Sigh. More education is needed!

In summary, my take on the matter would be to have a local hadoop FS for storing data with redundancy for analysis. and move a copy of the data to your favourite cloud as archival and sharing and possibly data analysis as well (mapreduce avail on Amazon as well)
Another issue is whether researchers are keeping data based on sentimentality or if there's a real scientific need.
I have kept copies of my PAGE gel scans from my Bsc days archived in some place where the sun doesn't shine. but honestly, I can't forsee myself going back to the data. Part of the reason I kept it was that I spent a lot of time and effort to get those.

Storage, large datasets, and computational needs are not new problems for the world. They are new to biologists however. I am afraid that because of miscommunication, alot of researchers out there are going to rush to overspec their cluster and storage when the money can be better spent on sequencing. I am sure that would make some IT vendors very happy though especially in this financial downturn for IT companies.

I don't know if I am missing anything though.. comments welcome!

Thursday, 7 October 2010

US clinics quietly embrace whole-genome sequencing @ Nature News

read the full article
As hospitals and insurers battle over coverage for single-gene diagnostic tests, and the US Food and Drug Administration cracks down on the products of personal genomics companies, a growing number of doctors are relying on the sequencing of either the whole genome or of the coding region, known as the exome.
"If one hospital is doing it, you can be sure others will start, because patients will vote with their feet," Elizabeth Worthey, a genomics specialist at the Human and Molecular Genetics Center (HMGC) of the Medical College of Wisconsin in Milwaukee, said at the Personal Genome meeting at Cold Spring Harbor Laboratory in New York last weekend.
In May 2009, the genetic-technology provider Illumina, based in San Diego, California, launched its Clinical Services programme with two of its high-throughput genome analysers. The company now has 15 such devices dedicated to this programme.
Illumina provides the raw sequence data attained from a patient's DNA sample to a physician, who passes it on to a bioinformatics team, which works to crack the patient's condition. However, Illumina is working to develop tools to help physicians navigate genomes and identify genes already associated with diseases, as well as novel ones.
So far, the company has sequenced more than 24 genomes from patients with rare diseases or atypical cancers at the request of physicians at academic medical centres. The standard US$19,500 price tag is typically covered by the patient, by means of a research grant, or with the help of private foundations, although one patient is currently applying for insurance reimbursement. 

I would be really surprised if insurance is able to cover for exome sequencing in any conditions. Also see my Commentary on Personal Genomics testing

Tuesday, 5 October 2010

Genomes: $1K to seq $1 Mil to analyse

Genomes: $1K to seq $1 Mil to analyse

October 1, 2010 | It is doubtful that the scientists and physicians who first started talking about the $1,000 genome in 2001 could have imagined that we would be on the verge of that achievement within the decade. As the cost of sequencing continues to freefall, the challenge of solving the data analysis and storage problems becomes more pressing. But those issues are nothing compared to the challenge facing the clinical community who are seeking to mine the genome for clinically actionable information—what one respected clinical geneticist calls “the $1 million interpretation.” From the first handful of published human genome sequences, the size of that task is immense.

LOL I would like to see where the $1 million dollars is trickling to.. it may not cost as much as $1 million to analyze most data but the infrastructure costs are possibly ~ $1 million. I do hope to get a pay increment if the $1 million does trickle down to me though!

 edit: added link to article and guess what? Kevin Davies wrote a book! 
very curious now.

Friday, 1 October 2010

The Solexa Story

Ya I love universal truths like things are invented over beer

BGI Hong Kong has big plans and hopes to sequence all the genetics in the world.
BGI Hong Kong has big plans and hopes to sequence all the genetics in the world.

Interesting read! But I do wonder if it's good thing if all the genetics are sequenced by one monolithic company

Datanami, Woe be me