Friday, 24 December 2010

Exome sequencing reveals mutations in previously unconsidered candidate genes

Excerpted from Bio-IT World

December 20, 2010 | Doctors at the Medical College in Wisconsin have published in the journal Genetics and Medicine the results of exome sequencing in a seriously ill boy with undiagnosed bowel disease. The study (using 454 sequencing) revealed mutations in a gene called XIAP, which interestingly was not previously considered among more than 2,000 candidate genes before the DNA sequencing was performed. ....read full article

Monday, 20 December 2010

As R&D Budgets Shrink and Data Grows, Bioinformatics Service Providers Could Gain in Popularity

After this article in Nature Singapore's Salad days are over. There's another article in GenomeWeb that talks about shrinking R&D budgets and bioinformatics outsourcing. Are labs worldwide facing a cut in budgets? Or is it just year end panic? Hmmmm

Tuesday, 14 December 2010

1000 Genomes Project Tutorial Videos

The 1000 Genomes Project has released the data sets for the pilot projects and for more than 1000 samples for the full-scale project. A tutorial for how to use the data was held at the 2010 American Society of Human Genetics (ASHG) annual convention on November 3.

Videos for each of the tutorial sessions are now available. The tutorial describes 1000 Genomes Project data, how to access it and how to use it. Each of the speakers and their topics are listed below along with their tutorial videos and PowerPoint slides.

Introduction Gil McVean, Ph.D. Professor of Statistical Genetics University of Oxford
Description of the 1000 Genomes Data Gabor Marth, D.Sc. Associate Professor of Biology Boston College
How to Access the Data Steve Sherry, Ph.D. National Center for Biotechnology Information National Library of Medicine National Institutes of Health. Bethesda, Md.
How to Use the Browser Paul Flicek, Ph.D. European Molecular Biology Laboratory Vertebrate Genomics Team European Bioinformatics Institute (EBI)
Stuctural Variants Jan Korbel, Ph.D. Group Leader, Genome Biology Research Unit Joint Appointment with EMBL-EBI European Molecular Biology Laboratory (Heidelberg, Germany)
How to Use the Data in Disease Studies Jeffrey Barrett, Ph.D. Team Leader, Statistical and Computational Genetics Wellcome Trust Sanger Institute Hinxton, United Kingdom

Friday, 3 December 2010

When Playing games is working (if you are biologist that is)

Check out this flash game Phylo
If you are thinking it's related to phylogenetics then Bingo.. Kudos for excellent idea and excellent graphics and interface but wished they had a better name and less verbose introduction for laymen.
waiting eagerly for the iphone/ipod version to be out..

from http://phylo.cs.mcgill.ca/eng/about.html

What's Phylo all about?
Though it may appear to be just a game, Phylo is actually a framework for harnessing the computing power of mankind to solve a common problem; Multiple Sequence Alignments.

What is a Multiple Sequence Alignment? A sequence alignment is a way of arranging the sequences of D.N.A, R.N.A or protein to identify regions of similarity. These similarities may be consequences of functional, structural, or evolutionary relationships between the sequences.
From such an alignment, biologists may infer shared evolutionary origins, identify functionally important sites, and illustrate mutation events. More importantly, biologists can trace the source of certain genetic diseases.

The Problem Traditionally, multiple sequence alignment algorithms use computationally complex heuristics to align the sequences.
Unfortunately, the use of heuristics do not guarantee global optimization as it would be prohibitively computationally expensive to achieve an optimal alignment. This is due in part to the sheer size of the genome, which consists of roughly three billion base pairs, and the increasing computational complexity resulting from each additional sequence in an alignment.

Our Approach Humans have evolved to recognize patterns and solve visual problems efficiently.
By abstracting multiple sequence alignment to manipulating patterns consisting of coloured shapes, we have adapted the problem to benefit from human capabilities.
By taking data which has already been aligned by a heuristic algorithm, we allow the user to optimize where the algorithm may have failed.

The Data All alignments were generously made available through UCSC Genome Browser.
Infact, all alignments contain sections of human DNA which have been speculated to be linked to various genetic disorders, such as breast cancer.
Every alignment is received, analyzed, and stored in a database, where it will eventually be re-introduced back into the global alignment as an optimization.

Tuesday, 30 November 2010

Why can't Bioscope / mapreads write to bam natively?

Spotted this small fact in Bioscope 1.3.1 release notes.

There is significant disk space required for converting ma to BAM
when the option output.filter=none is used, which roughly needs
2TB peak disk space for converting a 500 million reads ma file.
Other options do not need such large peak disk space. The disk
space required per node is smaller if more jobs are dispatched to
more nodes.

I would love to see the calculation on how they arrived at the figure of 2 TB. I am glad that they moved to bam in bioscope workflow but I am not entirely sure what's the reason for keeping the .ma file format when only they are the ones using it.

Card Trick Leads to New Bound on Data Compression - Technology Review

http://www.technologyreview.com/blog/arxiv/26078/?ref=rss
Excerpted...
Here's a card trick to impress your friends. Give a deck of cards to a pal and ask him or her to cut the deck, draw six cards and list their colours. You then immediately name the cards that have been drawn.
Magic? Not quite. Instead, it's the next best thing: mathematics. The key is to arrange the deck in advance so that the sequence of the card colours follows a specific pattern called a binary De Bruijn cycle. A De Bruijn sequence is a set from an alphabet in which every possible subsequence appears exactly once.
So when a deck of cards meets this criteria, it uniquely defines any sequences of six consecutive cards. All you have to do to perform the trick is memorise the sequences.
Usually these kinds of tricks come about as the result of some new development in mathematical thinking. Today, Travis Gagie from the University of Chile in Santiago turns the tables. He says that this trick has led him to a new mathematical bound on data compression....

Neat!! I love how maths integrates with life..
wonder how would this be used 5 years down..

The actual paper is here
Ref: arxiv.org/abs/1011.4609: Bounds from a Card Trick

Tuesday, 16 November 2010

Uniqueome a uniquely ... omics word

Spotted this post on the Tree of Life blog

Another good paper, but bad omics word of the day: uniqueome

From "The Uniqueome: A mappability resource for short-tag sequencing

Ryan Koehler, Hadar Issac , Nicole Cloonan,*, and Sean M. Grimmond." Bioinformatics (2010) doi: 10.1093/bioinformatics

Paper does look interesting though!

Summary: Quantification applications of short-tag sequencing data (such as CNVseq and RNAseq) depend on knowing the uniqueness of specific genomic regions at a given threshold of error. Here we present the “uniqueome”, a genomic resource for understanding the uniquely mappable proportion of genomic sequences. Pre-computed data is available for human, mouse, fly, and worm genomes in both color-space and nucletotide-space, and we demonstrate the utility of this resource as applied to the quantification of RNAseq data.

Availability: Files, scripts, and supplementary data is available from http://grimmond.imb.uq.edu.au/uniqueome/; the ISAS uniqueome aligner is freely available from http://www.imagenix.com/

Pending release of Contrail, Hadoop de novo assembler?

Jermdemo on Twitter

Just noticed the source code for Contrail, the first Hadoop based de-novo assembler, has been uploaded http://bit.ly/96pSbw 26 days ago

Oh the suspense!

Exome Sequencing hints at sporadic mutations as cause for mental retardation.

1st spotted off Genomeweb
NEW YORK (GenomeWeb News) – De novo mutations that spring up in children but are absent in their parents are likely the culprit in many unexplained cases of mental retardation, according to a Dutch team.
Using exome sequencing in 10 parent-child trios, the researchers found nine non-synonymous mutations in children with mental retardation that were not found in their parents, including half a dozen mutations that appear to be pathogenic. The research, which appeared online yesterday in Nature Genetics, hints at an under-appreciated role for sporadic mutations in mental retardation — and underscores the notion that mental retardation can stem from changes in a wide variety of genes.

I think it's fascinating to find so many new mutations and changes in DNA that may affect one's quality of life, simply by sequencing the coding regions (and not all of it if I may add). This paper is fascinating as it raises the question if deleterious sporadic mutations are unlikely culprits for a whole variety of diseases that have a genetic risk.
it is certainly more likely that such an event will occur in coding regions but I do not doubt that for some diseases, perhaps the non-coding regions (that play a regulatory role) might have the same effect. If it was a clear cut mutation that results in a dysfunctional protein, and there's no redundancy in the system, it is likely the system will crash. whereas, if it was changes in the expression levels, it might lead to a slightly wobbly system that just doesn't function as well.

While everyone else is waiting for Whole Genome Sequencing to drop in price. There are groups already publishing with exome data. I think in 6 months time, we will see more WGS papers coming up... It's an exciting time for Genomics science!

See the full paper below

A de novo paradigm for mental retardation Nature Genetics | Letter

Thursday, 11 November 2010

When a billion is not a billion

http://www.oxforddictionaries.com/page/114

A more erudite colleague has just pointed this fact to me when I mentioned that the human genome has 3 billion base pairs...

In British English, a billion used to be equivalent to a million million (i.e. 1,000,000,000,000), while in American English it has always equated to a thousand million (i.e. 1,000,000,000).

2011 Bioinformatics Conferences

repost here for benefit of others..
Original author
Posted by Dleon on Nov 7, 2010 11:01:06 PM

With November here, the launch of SOLiD™ BioScope™ software 1.3 is approaching. And as we prepare for the close of 2010, many of us are looking forward to the bioinformatics conferences to come in 2011. Ten years ago there were just a few bioinformatics and computational biology conferences around the world. My favorite was (and still is) ISMB. But now, there is a large variety of meetings related to bioinformatics all over the world -- something almost every month. To help you navigate the bioinformatics conference circuit, I've put together a short list for roughly the first half of 2011:

January 3-7, 2011
Pacific Symposium on Biocomputing (PSB) 2011
Big Island, Hawaii

February 26-28, 2011
2011 International Conference on Bioscience, Biochemistry and Bioinformatics
Singapore

March 7-9, 2011
AMIA: Summit on Translational Bioinformatics
San Francisco, CA

March 14-18, 2011
Sequencing Data Analysis and Interpretation
San Diego, CA

March 28-31, 2011
Research in Computational Molecular Biology (RECOMB)
Vancouver, BC

April 12-14, 2011
BioIT World
Boston, MA

May 22-27, 2011
The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies
Venice, Italy

July 15-19, 2011
Intelligent Systems for Molecular Biology (ISMB) and 10th European Conference on Computational Biology (ECCB)
Vienna, Austria

I won't participate in all of these conferences, but I hope to attend a few -- both to see how people are responding to the challenge of managing massive amounts of next-gen sequencing data, and to learn about new approaches to integrating and analyzing this data.

If you have a preferred bioinformatics conference that's not listed above, please share it in the comments, and we’ll add it.

Bioscope 1.3 is a whopping 6.6 Gb!! Officially released for download

Downloading v 1.3 now. Gosh it is a whopping 6.6 Gb download.(270 Mb for v 1.21)

Not sure where the bloat comes from. Guessing it's example data, hope the server doesn't crash under the load.

btw reason no. 5 for using Bioscope v 1.3 sounds quite flaky...

UPDATE: Argh. the md5sums match my download but I got this error
error [4462069.zip]: start of central directory not found;
zipfile corrupt.
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)

UPDATE2: Finally unzipped the 6.6 Gb file in an xp box using 7zip (apparently linux zip is finicky for files > 4 Gb. )
Guess what's inside? tarred zip files. Oh what fun to transfer them back to a linux box!
BioScope-1.3-9.tar.gz           (Regular, application/x-compressed-tar) size 217743781 mode 0744
BioScope-1.3.rBS130-51653_20101021190735.examples.tar.gz                (Regular, application/x-compressed-tar) size 4206422209 mode 0744
BS130-resources.tar.gz          (Regular, application/x-compressed-tar) size 2632156337 mode 0744

UPDATE 2:
ABI has updated the downloads to a more reasonable
208Mb Nov 25 04:24 bioscope1.3.1installer_4464106.tar.gz
md5 checksum is b688a8ae7b620d7b2dc7f68c6ca41783

Dear Valued Customer,

It is with great pleasure and excitement that I announce the release and immediate availability of BioScope v1.3

BioScope, the modular SOLiD™ data analysis bioinformatics tool, is designed specifically to optimize the accuracy of your SOLiD™ colorspace data. In addition to streamlining the construction and maintenance of your SOLiD™ pipelines, BioScope provides a simple web interface allowing non command line users the power of running sophisticated NGS data analysis.

SOLiD™ BioScope provides workflow applications including:

Improved MaxMapper Mapping and Pairing
BFAST integration
Improved SAET Accuracy Enhancement
Resequencing Pipelines

SNP/diBayes
Inversion
CNV
Small Indel
Large Indel

Whole Transcriptome
Fusion Transcript and Splicing Detection
Target Resequencing
Support for ChIPSeq
Support for Methyl Miner
Annotation and Reporting
Improved BAM file compatibility
Improved BioScope™ Users Guide

Additional details can be found at the following blog:

http://solid.community.appliedbiosystems.com/community/about_solid/blog/2010/10/25/the-top-5-reasons-to-use-solid-bioscope-software-13

Also attached is an in-depth article about our new Target Resequencing pipeline in BioScope™.

Please coordinate with your IT admin, bioinformatician, lab manager, and PI to have BioScope v1.3 installed at your site.

To get your free copy of SOLiDBioScope please go to:

http://solidsoftwaretools.com/gf/project/bioscope

Please ensure that you have an activated account on solidsoftwaretools.com Rupert.Yip@lifetech.com before downloading. If you have problems downloading, please contact

If this is your first time installing BioScope, we strongly recommend working with the BioScope software installation team to ensure a proper installation and configuration of BioScope. Please contact Rupert.Yip@lifetech.com to inquire about our free BioScope software installation services.

For information BioScope training please contact your local bioinformatics FAS or go to http://learn.appliedbiosystems.com/solid

Tuesday, 9 November 2010

SOLiD™ BioScope™ Software v1.3 releasing soon

v1.3 is due for release soon! How do I know other than the fact that you can register for v1.3 video tutorials , e.g. SOLiD™ Targeted ReSeq Data Analysis featuring BioScope 1.3 (1 hour)
The clue comes from new documentation that is being uploaded on to solidsoftwaretools.com.

BioScope™ Software v1.3 adds/enhances support for following:

Targeted Resequencing analysis (enrichment statistics and target
filtering)
BFAST integration
Annotation, reporting and statistics generation
Methylation analysis
75 bp read length support
Mapping and Pairing speed improvements

It also fixes a long list of bugs I won't repeat all of them here.
But the important ones are

Bug – Pairing: In BAM file, readPaired and firstOfPair/secondOfPair flags set incorrectly for reads with missing mates.
Bug – diBayes: Defunct java processes continue when bioscope exits
Bug – Mapping: When the last batch of the processing has the number of reads less than the value of the key mapping.np.per.node, the ma file contains duplicated entries.

Have fun playing with the new version when it's up!
here's some impt notes:

It is advised that a user runs BioScope using the user’s own user
account. Then if Control-C is used to interrupt bioscope.sh which
spawns many other processes, user can use following OS commands
to find the pid of the left-over processes, and clean them up.
ps –efl | grep bioscope.sh | grep username
ps –efl | grep java_app.sh | grep username
ps –efl | grep map | grep username
ps –efl | grep java | grep username
ps –efl | grep mapreads | grep username
ps –efl | grep pairing | grep username
kill -9 PID

Oh but I would use the command highlighted in bold carefully as basically it kills all process that have the name java in it

My suggestion to the team is to have a db table to keep the PID of launched processes instead of depending on non-unique names. Ensembl's pipeline uses perl with less overhead to track jobs and it is much cleaner to clear up.

Monday, 8 November 2010

the exponential increase in 'novel' findings

This is hilarious!

Neil Saunders, blogger at What You're Doing is Rather Desperate, posted a photo to his Twitter account last week with the newspaper-style headline style caption: "Findings Increasingly Novel, Scientists Say," which he says is meant to be a "tongue-in-cheek" look at the use of the word "novel" in the titles of papers indexed in PubMed.

Read full article at genomeweb

and Saunder's blog post

Trimming adaptor seq in colorspace (SOLiD)

Needed to do research on small RNA seq using SOLiD.
Wasn't clear of the adaptor trimming procedure (its dead easy in basespace fastq files but oh well, SOLiD has directionality and read lengths dont' really matter for small RNA)

novoalign suggests the use of cutadapt as a colorspace adaptor trimming tool
was going to script one in python if it didn't exist
Check their wiki page

Sadly on CentOS I most probably will get this

If you get this error:

   File "./cutadapt", line 62
    print("# There are %7d sequences in this data set." % stats.n, file=outfile)
                                                                       ^
SyntaxError: invalid syntax

Then your Python is too old. At least Python 2.6 is needed for cutadapt.

have to dig up how to have two versions of Python on a CentOS box..

At ASHG, Ion Torrent Drums Up Interest; Provides Preliminary Specs for PGM

WASHINGTON, DC – Ion Torrent revealed some preliminary specs for its Personal Genome Sequencer, due to be launched later this year, as the Life Technologies business unit presented the instrument to potential customers at its booth at the American Society for Human Genetics meeting this week. The speed of the instrument — a run takes approximately two hours, and several runs can be performed in a day — is what appears to be most attractive to potential customers, Maneesh Jain, Ion Torrent's vice president of marketing and business development, told In Sequence.
The first version of the PGM will sell for $49,500, plus a $16,500 server to analyze the data.
Initially, the machine will produce about 10 megabases of data per run, or about 100,000 reads of 100 base pairs each, using the so-called 314 chip, which has about 1.5 million wells and will cost $250. Reagent kits for template preparation, library preparation, and sequencing will cost another $250, bringing the total consumables cost per run to approximately $500.
In the first half of 2011, Ion Torrent plans to launch the 316 chip, with about 6 million wells, which will increase the output per run to 100 megabases and which will cost about twice as much as the 314. Additional chip upgrades will follow, with details to be revealed next year.
Sample prep, which Jain said takes about a day and can be done in batches of six to eight samples, requires an emulsion PCR protocol, which will be simplified over time. "We focused on the sequencing initially," he said, adding that the next step will be to optimize the sample prep. Life Technologies said previously that sample prep for the PGM would eventually be able to use the EZ Bead system, which was originally developed for the SOLiD system.
Read full article here

Wednesday, 3 November 2010

Life Technologies Launches New SOLiD Sequencer to Drive Advances in Cancer Biology and Genetic Disease Research

It's official! The web is crawling with the news reports. Read their press release here
My previous coverage on the preview launch is here
There's a discussion in the seqanswers forum on the new machine.

The Life Tech cmsXXX.pdfs with the useful specs are out too. you can google them or search on the website
The specs
solid.appliedbiosystems.com/solid5500

Monday, 1 November 2010

NIH has 4 gene patents!

My voracious reading has led me to a blog post describing the recent events on gene patents
interesting snippets include
NIH holding 4 gene patents (hmmm I wonder what are those.. )

For years, the U.S. Patent Office has taken the position that extracted genes, or “isolated DNA,” can be patented. And, in fact, it has issued thousands of patents on human genes, with perhaps one of every five human genes now under patent. Patent rights to a gene, of course, give the owner the exclusive right to study, test and experiment on the gene to see how its natural characteristics work.
It has been more than 20 years since the Patnet Office began approving patents for human genes in the form of “isolated” DNA. Prior to that, the Office had issued patents for synthetic DNA, but then moved on to grant monopoly rights on the natural material when extracted directly from the body and not modified. The Obama Administration, in the brief it filed late Fridiay in the Federal Circuit, is not challenging patents on synthetic DNA, or on the process of extracting DNA, but only on unmodified genes themselves.
The Patent Office’s long-running approach to genetic patents was challenged in a lawsuit filed in May 2009 by the American Civil Liberties Union and the Public Patent Foundation, contending that locking up genes in the monopoly rights of a patent would inhibit research by other scientists on diseases that might be flagged by the coding or mutations of the genes. The lawsuit targeted both the Patent Office and Myriad Genetics, specifically because of patents that company was issued on human genes that have been labeled “BRCA1″ and “BRCA2.’ Mutations of those two genes are associted with significantly higher risks of breast cancer and ovarian cancer. .....

"U.S. government is actually the co-owner of four of the seven patents that are involved in the case. It has granted Myriad an exclusive license under those patents — contrary, it said, to NIH’s usual practice of not granting exclusive licenses under DNA patents for “diagnostic applications.” In the past, NIH and other government agencies have sought and obtained patents for human genes in the form of “isolated genomic DNA,” according to the brief.

The brief did not say which claims under the four patents co-owned by NIH would be invalid under its theory of patentability."

Thursday, 28 October 2010

1000 Genomes Pilot Paper Published

27 OCTOBER 2010

The 1000 Genomes Project Consortium has published the results of the pilot project analysis in the journal Nature in an article appearing on line today. The paper A map of human genome variation from population-scale sequencing is available from the Nature web site and is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence to ensure wide distribution. Please share our paper appropriately.

Wednesday, 27 October 2010

de novo assembly of large genomes

Here's a informative post by Ewan Birney on the velvet user list about de novo assembly of large genomes.

Velvet's algorithms in theory work for any size. However, the engineering aspects
of Velvet, in particular memory consumption, means it's unable to handle read sets
of a particular size. This of course depends on how big a real memory machine
you have.

I know we have "routinely" (ie, for multiple strains) done Drosophila sized genomes
(~120MB) on a 125GB machine.

I've heard of Velvet being used into the 200-300MB region, but rarely further. Memory
size is not just about the size of the genome but also how error prone you reads
are (though sheer size is important).

Beyond this there are a variety of strategies:

"Raw" de Bruijn graphs, without a tremendously aggressive use of read pairs can
be made using Cortex (unpublished, from Mario Cacamo and Zam Iqbal) or ABySS (published,
well understood, from the BC genome centre).

Curtain (unpublished, but available, from Matthias Haimel at EBI) can do a
smart partition of the reads given an initial de Bruijn graph, run Velvet on the paritions
and thus provide an improved more read-pair aware graph. This can be iterated and in
at least some cases, the Curtain approach gets close to what Velvet can produce alone
(in the scenarios where Velvet can be run on a single memory machine to understand
Curtain's performance)

SOAP de novo from the BGI is responsible for a number of the published assemblies
(eg, Panda, YH) although like many assemblers, tuning it seems quite hard, and I would
definitely be asking the BGI guys for advice.

A new version of ALLPATHS (from the Broad crew) looks extremely interesting, but
is not quite released yet.

In all above the cases I know of successes, but also quite a few failures, and untangling
data quality/algorithm/choice of parameters/running bugs is really complex. So - whereas
assemblies < 100MB are "routine", currently assemblies 100MB-500MB are "challenging" and
>500MB are theoretically doable, and have been done by specific groups, but I think still
are at the leading edge of development and one should not be confident of success for
"any particular genome".

Thanks Ewan for letting me reproduce his post here

Velvet-users mailing list
http://listserver.ebi.ac.uk/mailman/listinfo/velvet-users

Cortex seems very promising for de novo assembly of human reads using reasonable amounts of ram ( 128 Gb ) based on the mailing list. I know I be watching out for it on Sourceforge!

NGS alignment viewers, assembly viewers -warning pretty graphics ahead

Chanced upon this review of the visualization tools out there that can help you make biologists understand what you just did with their NGS data. I think assembly viewers is the best name for this category of tools. Since many of these not only support BAM but other assembly formats as well.
My vote goes to BamView for the most creative name. but for pretty visuals I have to agree with the author that Tablet takes the cake. Tablet DOES have the least obvious sounding name which makes it difficult to find on google.

Tophat adds support for strand-specific RNA-Seq alignment and colorspace

Hooray!
testing Tophat 1.1.2 now
1.1.1

on a 8 Gb Ram CentOS box managed to align 1 million reads to hg18 in 33 mins and 2 million reads in 59 mins. using 4 threads
Nice scalability! But it was slower than I was used to for bowtie. I kept killing my full set of 90 million reads thinking there's something wrong. Guess I need to be more patient and wait for 45 hours.

I do wonder if the process can be mapped to separate nodes to speed up.

Tuesday, 26 October 2010

Throwing the baby out with the bathwater:Non-Synonymous and Synonymous Coding SNPs Show Similar Likelihood and Effect Size of Human Disease Association

I was literally having a 'oh shoot' moment when i saw this news in GenomeWeb

Synonymous SNPs Shouldn't Be Discounted in Disease, Study Finds

NEW YORK (GenomeWeb News) – Synonymous SNPs that don't change the amino acid sequence encoded by a gene appear just as likely to influence human disease as non-synonymous SNPs that do, according to a paper appearing online recently in PLoS ONE by researchers from Stanford University and the Lucile Packard Children's Hospital.

from the abstract of the paper

The enrichment of disease-associated SNPs around the 80^th base in the first introns might provide an effective way to prioritize intronic SNPs for functional studies. We further found that the likelihood of disease association was positively associated with the effect size across different types of SNPs, and SNPs in the 3′untranslated regions, such as the microRNA binding sites, might be under-investigated. Our results suggest that sSNPs are just as likely to be involved in disease mechanisms, so we recommend that sSNPs discovered from GWAS should also be examined with functional studies.

Hmmmm how is this going to affect your carefully crafted pipeline now?

Monday, 25 October 2010

AB on Ion Torrent

There was a brief mention of the Ion Torrent at the 5500 presentation as well but nothing of great significance. I do wish they marketing fellows will push ion torrent out faster but i think they are trying to streamline production by testing if invitrogen kits can replace the ones at ion torrent. I do hope they do not sacrifice compatibility over performance.

For Bioscope, they are going to include base space support (hurray?) presumably so that they can use the same pipeline for analysis of their SMS and Ion Torrent technologies.

Stay Tuned!

AB releases 4 HQ and PI as 5500xl and 5500 SOLiD

Was lucky to be part of the 1st group to view the specs and info on the new SOLiD 4 hq.
For reasons unknown,
They have renamed it to 5500XL and 5500 solid system which is your familiar 4 HQ and PI
Or if you prefer formulas.
5500xl = 4 hq
5500 = PI

One can only fathom their obession with these 4 digits judging by similar instruments named
AB Sciex Triple Quad 5500 and the AB Sciex QTrap 5500

Honestly the 5500 numbers are of no numerical significance AFAIK.

outlook wise both looks like the PI system
I DO NOT see the computer cluster anymore, that's something I am curious about.

Finally we are at 75 bp though.
Of notable importance, there is a new Exact Call Chemistry module (ECC) which promises 99.99% accuracy which is optional as it increases the run time.
the new solid system is co-developed with the Hitachi-Hi Technologies.
Instead of the familiar slides, they use 'flowchips' now. with 6 individual lanes to allow for more mixing of samples of different reads.
for the 5500xl
throughput per day is 20-30 Gb
per run you have 180 Gb or 2.8 B tags (paired ends or mate pairs)

Contrary to most rumours, 5500xl is upgradeable from SOLiD 4 although I suspect it is a trade in program. No mention about the 5500 (which i guess is basically a downgrade).

The specs should be up soon
solid.appliedbiosystems.com/solid5500

Update from seqanswers from truthseqr
http://seqanswers.com/forums/showthread.php?t=6761&goto=newpost

Here is the message that has just been posted:
***************
AB is premiering two new instruments at ASHG next week.

Mobile ASHG calendar: http://m.appliedbiosystems.com/ashg/ (http://solid.community.appliedbiosystems.com/)

Twitter account: @SOLiDSequencing (http://twitter.com/SOLiDSequencing)

SOLiD Community: http://solid.community.appliedbiosystems.com/

More info soon at: solid.appliedbiosystems.com/solid5500/ (http://solid.appliedbiosystems.com/solid5500)

Tuesday, 19 October 2010

After a Decade, JGI Retires the Last of Its Sanger Sequencers

After a Decade, JGI Retires the Last of Its Sanger Sequencers

Time and tide waits for no man ...

Microsoft Biology Tools

http://research.microsoft.com/en-us/projects/bio/mbt.aspx

The Microsoft Biology Tools (MBT) are a collection of tools that enable biology and bioinformatics researchers to be more productive in making scientific discoveries. Some of the tools provided here take advantage of the capabilities of the Microsoft Biology Foundation, and are good examples of how you can use MBF to create other tools.

When I last visited the site, no tools were available and I didn't realised it was finally released. Definitely will try some of these tools just to satisfy curiosity but I doubt these will find widespread usage but that's just a personal opinion.

Stuff that caught my eye
BL!P: BLAST in Pivot
False Discovery Rate
Microsoft Research Sequence Assembler
SIGMA: Large Scale Machine Learning Toolkit

Do post comments here if you have tested any of the tools and found them useful!

Tuesday, 12 October 2010

Human Whole genome sequencing at 11x coverage

http://genomebiology.com/2010/11/9/R91

Just saw this paper Sequencing and analysis of an Irish human genome. AFAIK WGS is usually done at 30x coverage. In this paper, the authors “describe a novel method for improving SNP calling accuracy at low genome coverage using haplotype information.” I thought it was pretty good considering that they had 99.3% of the reference genome covered for 10.6x coverage. That leaves only like 21 Mbases missing ..

For those interested in the tech details

Four single-end and five paired-end DNA libraries were generated and sequenced using a GAII Illumina Genome Analyzer. The read lengths of the single-end libraries were 36, 42, 45 and 100 bp and those of the paired end were 36, 40, 76, and 80 bp, with the span sizes of the paired-end libraries ranging from 300 to 550 bp (± 35 bp). In total, 32.9 gigabases of sequence were generated (Table 1). Ninety-one percent of the reads mapped to a unique position in the reference genome (build 36.1) and in total 99.3% of the bases in the reference genome were covered by at least one read, resulting in an average 10.6-fold coverage of the genome.
...
At 11-fold genome coverage, approximately 99.3% of the reference genome was covered and more than 3 million SNPs were detected, of which 13% were novel and may include specific markers of Irish ancestry.

Bio-Hackers not the IT kind

http://www.nature.com/news/2010/101006/full/467650a.html

Amateur hobbyists are creating home-brew molecular-biology labs, but can they ferment a revolution?

Rob Carlson's path to becoming a biohacker began with a chance encounter on the train in 1996. Carlson, a physics PhD student at the time, was travelling to New York to find a journal article that wasn't available at his home institution, Princeton University in New Jersey. He found himself sitting next to an inquisitive elderly gentlemen. Carlson told him about his thesis research on the effects of physical forces on blood cells, and at the end of the journey, the stranger made him an offer. "You should come work for me," said the man, "I'm Dr Sydney Brenner." The name meant little to Carlson, who says he thought: "Yeah, OK. Whatever, 'Dr Sydney Brenner.'"

Cool!

12 Geneticists unzip their genomes in full public view

A GROUP of 12 genetics experts will expose their DNA to public view today to challenge the common view that such information is so private and sensitive that it should not be widely shared.

The "DNA dozen" will publish full results of their own genetic tests, including implications for their health, in a controversial initiative to explain the significance of the human genome for medicine and society.
The Genomes Unzipped project aims to demystify the genetic code, showing what it can and cannot reveal about individuals' health and allaying fears about discrimination and privacy.
The participants - 11 British-based scientists and an American genetics lawyer - hope to encourage many more people to share details of their genomes with researchers. This would allow the creation of open-access DNA databases that any scientist could use, enabling a "wisdom of crowds" approach to research that will accelerate discoveries about genetics and health.

Discarding quality scores for small RNA analysis.

Going through CLCbio's Small RNA analysis using Illumina data.
got to this part
Make sure the Discard quality scores and Discard read names checkboxes are checked. Information about quality scores and read names are not used
in this analysis anyway, so it will just take up disk space when importing the data.

which led me thinking. The reads are short to begin with. I would expect more information is always better. But in some cases, I guess having a 2nd metric is confusing, when there's
1)sequencing error
2)bona fide SNP
3)relatively low quality scores anyway. (how would one weight the seq quality in a fair way)

I believe CLCbio uses BWT for index and compression of the genomes to be searched, I am curious how they differ from BWA and Bowtie though.

SRMA: tool for Improved variant discovery through local re-alignment of short-read next-generation sequencing data

Have a look at this tool http://genomebiology.com/2010/11/10/R99/abstract
it is a realigner for NGS reads, that doesn't use a lot of ram. Not too sure how it compares to GATK's Local realignment around indels as it is not mentioned. but the authors used reads that were aligned with the popular BWA or BFAST as input. (Bowtie was left out though.)

Excerpted

SRMA was able to improve the ultimate variant calling using a variety of measures on the simulated data from two different popular aligners (BWA and BFAST. These aligners were selected based on their sensitivity to insertions and deletions (BFAST and BWA), since a property of SRMA is that it produces a better consensus around indel positions. The initial alignments from BFAST allow local SRMA re-alignment using the original color sequence and qualities to be assessed as BFAST retains this color space information. This further reduces the bias towards calling the reference allele at SNP positions in ABI SOLiD data, and reduces the false discovery rate of new variants. Thus, local re-alignment is a powerful approach to improving genomic sequencing with next generation sequencing technologies. The alignments to the reference genome were implicitly split into 1Mb regions and processed in parallel on a large computer cluster; the re-alignments from each region were then merged in a hierarchical fashion. This allows for the utilization of multi-core computers, with one re-alignment per core, as well as parallelization across a computer cluster or a cloud. The average peak memory utilization per process was 876Mb (on a single-core), with a maximum peak memory utilization of 1.25GB. On average, each 1Mb region required approximately 2.58 minutes to complete, requiring approximately 86.17 hours total running time for the whole U87MG genome. SRMA also supports re- alignment within user-specified regions for efficiency, so that only regions of interest need to be re-aligned. This is particularly useful for exome-sequencing or targeted re-sequencing data.

Monday, 11 October 2010

What's your BRCA status? Personal Genomics testing.

Do-it-yourself genetic testing
How to test your BRCA status and why we need to prepare for the personal genomics age.
Genome Biology 2010, 11:404

Interesting read covering issues on personal genomics. Did you know that “the BRCA gene patents, which are held by Myriad Genetics, cover all known cancer-causing mutations in addition to those that might be discovered in the future.” How did that one slip through the patent office?? Not that it really matters “Currently Myriad charges more than $3000 for its tests on the BRCA genes, while sequencing one's entire genome now costs less than $20,000. Furthermore, once an individual's genome has been sequenced, it becomes a resource that can be re-tested as new disease-causing mutations are discovered. “

“Regardless of how easy it might be to test for mutations, the restrictive nature of the BRCA gene patents means that anyone wishing to examine any mutation in BRCA1 or BRCA2 will have to obtain permission from the patent holder Myriad Genetics. This restriction applies even if testing your own genome. If you wanted to look at other genes, you would have to pay license fees for any of them that were protected by patents. In practice, although it may seem absurd, this means that before scanning your own genome sequence, you might be required by law to pay thousands of license fees to multiple patent holders. “

This is complete hogwash! ( the concept that I have to pay genome-squatters (see cybersquatters) in the human genome, I would much rather pay for real estate on the moon! )

related posts
US clinics quietly embrace whole-genome sequencing @ Nature News
Commentary on Personal Genomics testing

Genome sizes

Compendium of links for gauging genome sizes (useful for calculating genome coverage)

My fav diagram to get a relative feel for the genome sizes
is from

For which the image is reproduced here Figure 1-38 Genome sizes compared

There's other sources like

Wikipedia http://en.wikipedia.org/wiki/Genome_size
also see Comparison of different genome sizes

DOGS - Database Of Genome Sizes last modified Tuesday 16th of May 2006

Google images search for "genome sizes" for other lucky finds.

Molecular Biology of the Cell, Fourth Edition

Installing SOLiD™ System de novo Accessory Tools 2.0 with Velvet and MUMmer

howto install on CentOS 5.4

wget http://solidsoftwaretools.com/gf/project/denovo/ #just to keep a record
wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_0.7.55.tgz
wget http://downloads.sourceforge.net/project/mummer/mummer/3.22/MUMmer3.22.tar.gz
tar zxvf denovo2.tgz
cp velvet_0.7.55.tgz denovo2 #you can use mv if you don’t mind downloading again
cp MUMmer3.22.tar.gz denovo2
cd denovo2
tar zxvf velvet_0.7.55.tgz
tar zxvf MUMmer3.22.tar.gz
cd MUMmer3.22/src/kurtz #this was the part where I deviated from instructions
gmake mummer #Might be redundant but running gmake at root dir gave no binary
gmake |tee gmake-install.log

Next step:
download the example data to run through the pipeline
http://solidsoftwaretools.com/gf/project/ecoli50x50/
http://download.solidsoftwaretools.com/denovo/ecoli_600x_F3.csfasta.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_F3.qual.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_R3.csfasta.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_R3.qual.gz

Description
This is a 50X50 Mate pair library from DH10B produced by SOLiD™ system.The set includes .csfasta and .qual files for F3 and R3. The insert size of the library is 1300bp and it is about 600X coverage of the DH10B genome. The results from MP library in the de novo documents are generated from this dataset.

YMMV

pitfalls for SAET for de novo assembly

Spotted in manual for

SOLiD™ System de novo Accessory Tools 2.0

Usage of pre-assembly error correction: This is an optional tool which was
demonstrated to increase contigs length in de novo assembly by factor of 2 to 3. Do not use this tool if coverage is less than 20x. Overcorrection and under-correction are equally bad for de novo assembly; therefore use balanced number of local and global rounds of error correction. For example, the pipeline will use 1 global and 3 local rounds if reads are 25bp long, and 2 global and 5 local rounds if reads are 50bp long.

Is it just me? I would think it is trivial to implement the correction tool to correct only when the coverage is > 20x. Not sure why you would need human intervention.

Friday, 8 October 2010

small RNA analysis for SOLiD data

SOLiD™ System Small RNA Analysis Pipeline Tool (RNA2MAP) is being released as "unsupported software" by Applied Biosystems.
see http://solidsoftwaretools.com/gf/project/rna2map/

It failed for me at just simply producing the silly PBS scripts to run the analysis. I was advised to run it in another linux server to try by dumb luck :(
I found example scripts but documentation is brief. Not sure if it's worth the hassle to debug the collection of perl scripts or to manually edit the params in the PBS submission scripts for a tool that is not commonly used.

How are you analysing your small RNA SOlid reads? any software to recommend?
http://seqanswers.com/wiki/MirTools is for 454 and Illumina
Partek Genomics Suite is commercial

the other two at listed at seqanswers wiki
doesn't seem to be for solid as well.

Re-Defining Storage for the Next Generation

Do have a go at this article citing David Dooling, assistant director of informatics at The Genome Center at Washington University and a few others
Looking ahead, as genome sequencing costs lower there's going to be more data than ever generated. And the article rightly states that every postdoc with a different analysis method will have a copy of a canonical dataset. Personally, I think this is a calling for tools and data to be moved to the cloud.
Getting the data up there in the first is a choke point.
but using the cloud will most definitely force everyone to only use a single copy of shared data.
Google solved the problem of tackling large datasets with slow interconnects with map reduce paradigms.
There are tools available that make use of this already but they are not popular yet. I still get weird stares when I tell them about hadoop filesystems. Sigh. More education is needed!

In summary, my take on the matter would be to have a local hadoop FS for storing data with redundancy for analysis. and move a copy of the data to your favourite cloud as archival and sharing and possibly data analysis as well (mapreduce avail on Amazon as well)
Another issue is whether researchers are keeping data based on sentimentality or if there's a real scientific need.
I have kept copies of my PAGE gel scans from my Bsc days archived in some place where the sun doesn't shine. but honestly, I can't forsee myself going back to the data. Part of the reason I kept it was that I spent a lot of time and effort to get those.

Storage, large datasets, and computational needs are not new problems for the world. They are new to biologists however. I am afraid that because of miscommunication, alot of researchers out there are going to rush to overspec their cluster and storage when the money can be better spent on sequencing. I am sure that would make some IT vendors very happy though especially in this financial downturn for IT companies.

I don't know if I am missing anything though.. comments welcome!