Tuesday, 30 November 2010

Why can't Bioscope / mapreads write to bam natively?

Spotted this small fact in Bioscope 1.3.1 release notes.

There is significant disk space required for converting ma to BAM
  when the option output.filter=none is used, which roughly needs
  2TB peak disk space for converting a 500 million reads ma file.
  Other options do not need such large peak disk space. The disk
  space required per node is smaller if more jobs are dispatched to
  more nodes.

I would love to see the calculation on how they arrived at the figure of 2 TB. I am glad that they moved to bam in bioscope workflow but I am not entirely sure what's the reason for keeping the .ma file format when only they are the ones using it.

Card Trick Leads to New Bound on Data Compression - Technology Review

Here's a card trick to impress your friends. Give a deck of cards to a pal and ask him or her to cut the deck, draw six cards and list their colours. You then immediately name the cards that have been drawn.
Magic? Not quite. Instead, it's the next best thing: mathematics. The key is to arrange the deck in advance so that the sequence of the card colours follows a specific pattern called a binary De Bruijn cycle. A De Bruijn sequence is a set from an alphabet in which every possible subsequence appears exactly once.
So when a deck of cards meets this criteria, it uniquely defines any sequences of six consecutive cards. All you have to do to perform the trick is memorise the sequences.
Usually these kinds of tricks come about as the result of some new development in mathematical thinking. Today, Travis Gagie from the University of Chile in Santiago turns the tables. He says that this trick has led him to a new mathematical bound on data compression....

Neat!! I love how maths integrates with life..
wonder how would this be used 5 years down..

The actual paper is here
Ref: arxiv.org/abs/1011.4609: Bounds from a Card Trick

Tuesday, 16 November 2010

Uniqueome a uniquely ... omics word

Spotted this post on the Tree of Life blog

Another good paper, but bad omics word of the day: uniqueome

From "The Uniqueome: A mappability resource for short-tag sequencing
Ryan Koehler, Hadar Issac , Nicole Cloonan,*, and Sean M. Grimmond." Bioinformatics (2010) doi: 10.1093/bioinformatics 
Paper does look interesting though!
Summary: Quantification applications of short-tag sequencing data (such as CNVseq and RNAseq) depend on knowing the uniqueness of specific genomic regions at a given threshold of error. Here we present the “uniqueome”, a genomic resource for understanding the uniquely mappable proportion of genomic sequences. Pre-computed data is available for human, mouse, fly, and worm genomes in both color-space and nucletotide-space, and we demonstrate the utility of this resource as applied to the quantification of RNAseq data.
Availability: Files, scripts, and supplementary data is available from http://grimmond.imb.uq.edu.au/uniqueome/; the ISAS uniqueome aligner is freely available from http://www.imagenix.com/



Pending release of Contrail, Hadoop de novo assembler?

Jermdemo on Twitter

Just noticed the source code for Contrail, the first Hadoop based de-novo assembler, has been uploaded http://bit.ly/96pSbw 26 days ago

Oh the suspense!

Exome Sequencing hints at sporadic mutations as cause for mental retardation.

1st spotted off Genomeweb
NEW YORK (GenomeWeb News) – De novo mutations that spring up in children but are absent in their parents are likely the culprit in many unexplained cases of mental retardation, according to a Dutch team.
Using exome sequencing in 10 parent-child trios, the researchers found nine non-synonymous mutations in children with mental retardation that were not found in their parents, including half a dozen mutations that appear to be pathogenic. The research, which appeared online yesterday in Nature Genetics, hints at an under-appreciated role for sporadic mutations in mental retardation — and underscores the notion that mental retardation can stem from changes in a wide variety of genes.

 I think it's fascinating to find so many new mutations and changes in DNA that may affect one's quality of life, simply by sequencing the coding regions (and not all of it if I may add). This paper is fascinating as it raises the question if deleterious sporadic mutations are unlikely culprits for a whole variety of diseases that have a genetic risk.
it is certainly more likely that such an event will occur in coding regions but I do not doubt that for some diseases, perhaps the non-coding regions (that play a regulatory role) might have the same effect. If it was a clear cut mutation that results in a dysfunctional protein, and there's no redundancy in the system, it is likely the system will crash. whereas, if it was changes in the expression levels, it might lead to a slightly wobbly system that just doesn't function as well.

While everyone else is waiting for Whole Genome Sequencing to drop in price. There are groups already publishing with exome data. I think in 6 months time, we will see more WGS papers coming up... It's an exciting time for Genomics science!

See the full paper below

A de novo paradigm for mental retardation Nature Genetics | Letter


Thursday, 11 November 2010

When a billion is not a billion


A more erudite colleague has just pointed this fact to me when I mentioned that the human genome has 3 billion base pairs...

In British English, a billion used to be equivalent to a million million (i.e. 1,000,000,000,000), while in American English it has always equated to a thousand million (i.e. 1,000,000,000).

2011 Bioinformatics Conferences

repost here for benefit of others..
Original author
Posted by Dleon on Nov 7, 2010 11:01:06 PM 

With November here, the launch of SOLiD™ BioScope™ software 1.3 is approaching. And as we prepare for the close of 2010, many of us are looking forward to the bioinformatics conferences to come in 2011. Ten years ago there were just a few bioinformatics and computational biology conferences around the world. My favorite was (and still is) ISMB. But now, there is a large variety of meetings related to bioinformatics all over the world -- something almost every month. To help you navigate the bioinformatics conference circuit, I've put together a short list for roughly the first half of 2011:

January 3-7, 2011
Pacific Symposium on Biocomputing (PSB) 2011
Big Island, Hawaii

February 26-28, 2011
2011 International Conference on Bioscience, Biochemistry and Bioinformatics

March 7-9, 2011
AMIA: Summit on Translational Bioinformatics
San Francisco, CA

March 14-18, 2011
Sequencing Data Analysis and Interpretation
San Diego, CA

March 28-31, 2011
Research in Computational Molecular Biology (RECOMB)
Vancouver, BC

April 12-14, 2011
BioIT World
Boston, MA

May 22-27, 2011
The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies
Venice, Italy

July 15-19, 2011
Intelligent Systems for Molecular Biology (ISMB) and 10th European Conference on Computational Biology (ECCB)
Vienna, Austria

I won't participate in all of these conferences, but I hope to attend a few -- both to see how people are responding to the challenge of managing massive amounts of next-gen sequencing data, and to learn about new approaches to integrating and analyzing this data.

If you have a preferred bioinformatics conference that's not listed above, please share it in the comments, and we’ll add it.

Bioscope 1.3 is a whopping 6.6 Gb!! Officially released for download

Downloading v 1.3 now. Gosh it is a whopping 6.6 Gb download.(270 Mb for v 1.21)
Not sure where the bloat comes from. Guessing it's example data, hope the server doesn't crash under the load. 
btw reason no. 5 for using Bioscope v 1.3 sounds quite flaky...

UPDATE: Argh. the md5sums match my download but I got this error
error [4462069.zip]:  start of central directory not found;
  zipfile corrupt.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

UPDATE2: Finally unzipped the 6.6 Gb file in an xp box using 7zip (apparently linux zip is finicky for files > 4 Gb. )
Guess what's inside? tarred zip files. Oh what fun to transfer them back to a linux box!
BioScope-1.3-9.tar.gz           (Regular, application/x-compressed-tar) size 217743781  mode 0744
BioScope-1.3.rBS130-51653_20101021190735.examples.tar.gz                (Regular, application/x-compressed-tar) size 4206422209 mode 0744
BS130-resources.tar.gz          (Regular, application/x-compressed-tar) size 2632156337 mode 0744

ABI has updated the downloads to a more reasonable 
208Mb Nov 25 04:24 bioscope1.3.1installer_4464106.tar.gz
md5 checksum is b688a8ae7b620d7b2dc7f68c6ca41783
Dear Valued Customer,

It is with great pleasure and excitement that I announce the release and immediate availability of BioScope v1.3

BioScope, the modular SOLiD™ data analysis bioinformatics tool, is designed specifically to optimize the accuracy of your SOLiD™ colorspace data.  In addition to streamlining the construction and maintenance of your SOLiD™ pipelines, BioScope provides a simple web interface allowing non command line users the power of running sophisticated NGS data analysis.

SOLiD™ BioScope provides workflow applications including:
  • Improved MaxMapper Mapping and Pairing
  • BFAST integration
  • Improved SAET Accuracy Enhancement
  • Resequencing Pipelines
    • SNP/diBayes
    • Inversion
    • CNV
    • Small Indel
    • Large Indel
  • Whole Transcriptome
  • Fusion Transcript and Splicing Detection
  • Target Resequencing
  • Support for ChIPSeq
  • Support for Methyl Miner
  • Annotation and Reporting
  • Improved BAM file compatibility
  • Improved BioScope™ Users Guide

Additional details can be found at the following blog:

Also attached is an in-depth article about our new Target Resequencing pipeline in BioScope™.

Please coordinate with your IT admin, bioinformatician, lab manager, and PI to have BioScope v1.3 installed at your site.

To get your free copy of SOLiDBioScope please go to:

Please ensure that you have an activated account on solidsoftwaretools.comRupert.Yip@lifetech.com before downloading.  If you have problems downloading, please contact

If this is your first time installing BioScope, we strongly recommend working with the BioScope software installation team to ensure a proper installation and configuration of BioScope.  Please contact Rupert.Yip@lifetech.com to inquire about our free BioScope software installation services.

For information BioScope training please contact your local bioinformatics FAS or go to http://learn.appliedbiosystems.com/solid

Tuesday, 9 November 2010

SOLiD™ BioScope™ Software v1.3 releasing soon

v1.3 is due for release soon! How do I know other than the fact that you can register for v1.3 video tutorials , e.g. SOLiD™ Targeted ReSeq Data Analysis featuring BioScope 1.3 (1 hour)
The clue comes from new documentation that is being uploaded on to solidsoftwaretools.com.

BioScope™ Software v1.3 adds/enhances support for following:
  •     Targeted Resequencing analysis (enrichment statistics and target
  •     filtering)
  •     BFAST integration
  •     Annotation, reporting and statistics generation
  •     Methylation analysis
  •     75 bp read length support
  •     Mapping and Pairing speed improvements

It also fixes a long list of bugs I won't repeat all of them here.
But the important ones are

  • Bug – Pairing: In BAM file, readPaired and firstOfPair/secondOfPair flags set incorrectly for reads with missing mates.
  •   Bug – diBayes: Defunct java processes continue when bioscope exits 
  • Bug – Mapping: When the last batch of the processing has the number of reads less than the value of the key mapping.np.per.node, the ma file contains duplicated entries.

Have fun playing with the new version when it's up!
here's some impt notes:

  It is advised that a user runs BioScope using the user’s own user
  account. Then if Control-C is used to interrupt bioscope.sh which
  spawns many other processes, user can use following OS commands
  to find the pid of the left-over processes, and clean them up.
  ps –efl | grep bioscope.sh | grep username
  ps –efl | grep java_app.sh | grep username
  ps –efl | grep map | grep username
  ps –efl | grep java | grep username
  ps –efl | grep mapreads | grep username
  ps –efl | grep pairing | grep username
  kill -9 PID

Oh but I would use the command highlighted in bold carefully as basically it kills all process that have the name java in it

My suggestion to the team is to have a db table  to keep the PID of launched processes instead of depending on non-unique names. Ensembl's pipeline uses perl with less overhead  to track jobs and it is much cleaner to clear up.

Monday, 8 November 2010

the exponential increase in 'novel' findings

This is hilarious!

Neil Saunders, blogger at What You're Doing is Rather Desperate, posted a photo to his Twitter account last week with the newspaper-style headline style caption: "Findings Increasingly Novel, Scientists Say," which he says is meant to be a "tongue-in-cheek" look at the use of the word "novel" in the titles of papers indexed in PubMed.

Read full article at genomeweb

and Saunder's blog post 

Trimming adaptor seq in colorspace (SOLiD)

Needed to do research on small RNA seq using SOLiD.
Wasn't clear of the adaptor trimming procedure (its dead easy in basespace fastq files but oh well, SOLiD has directionality and read lengths dont' really matter for small RNA)

novoalign suggests the use of cutadapt  as a colorspace adaptor trimming tool
was going to script one in python if it didn't exist
Check their wiki page

Sadly on CentOS I most probably will get this

If you get this error:
   File "./cutadapt", line 62
    print("# There are %7d sequences in this data set." % stats.n, file=outfile)
SyntaxError: invalid syntax
Then your Python is too old. At least Python 2.6 is needed for cutadapt.

have to dig up how to have two versions of Python on a CentOS box.. 

At ASHG, Ion Torrent Drums Up Interest; Provides Preliminary Specs for PGM

WASHINGTON, DC – Ion Torrent revealed some preliminary specs for its Personal Genome Sequencer, due to be launched later this year, as the Life Technologies business unit presented the instrument to potential customers at its booth at the American Society for Human Genetics meeting this week. The speed of the instrument — a run takes approximately two hours, and several runs can be performed in a day — is what appears to be most attractive to potential customers, Maneesh Jain, Ion Torrent's vice president of marketing and business development, told In Sequence.
The first version of the PGM will sell for $49,500, plus a $16,500 server to analyze the data.
Initially, the machine will produce about 10 megabases of data per run, or about 100,000 reads of 100 base pairs each, using the so-called 314 chip, which has about 1.5 million wells and will cost $250. Reagent kits for template preparation, library preparation, and sequencing will cost another $250, bringing the total consumables cost per run to approximately $500.
In the first half of 2011, Ion Torrent plans to launch the 316 chip, with about 6 million wells, which will increase the output per run to 100 megabases and which will cost about twice as much as the 314. Additional chip upgrades will follow, with details to be revealed next year.
Sample prep, which Jain said takes about a day and can be done in batches of six to eight samples, requires an emulsion PCR protocol, which will be simplified over time. "We focused on the sequencing initially," he said, adding that the next step will be to optimize the sample prep. Life Technologies said previously that sample prep for the PGM would eventually be able to use the EZ Bead system, which was originally developed for the SOLiD system.
Read full article here

Wednesday, 3 November 2010

Life Technologies Launches New SOLiD Sequencer to Drive Advances in Cancer Biology and Genetic Disease Research

It's official! The web is crawling with the news reports. Read their press release here 
My previous coverage on the preview launch is here
There's a discussion in the seqanswers forum on the new machine.

The Life Tech cmsXXX.pdfs with the useful specs are out too. you can google them or search on the website
The specs 

Monday, 1 November 2010

NIH has 4 gene patents!

My voracious reading has led me to a blog post describing the recent events on gene patents
interesting snippets include
NIH holding 4 gene patents (hmmm I wonder what are those.. )

For years, the U.S. Patent Office has taken the position that extracted genes, or “isolated DNA,” can be patented. And, in fact, it has issued thousands of patents on human genes, with perhaps one of every five human genes now under patent.  Patent rights to a gene, of course, give the owner the exclusive right to study, test and experiment on the gene to see how its natural characteristics work.
It has been more than 20 years since the Patnet Office began approving patents for human genes in the form of “isolated” DNA.  Prior to that, the Office had issued patents for synthetic DNA, but then moved on to grant monopoly rights on the natural material when extracted directly from the body and not modified.  The Obama Administration, in the brief it filed late Fridiay in the Federal Circuit, is not challenging patents on synthetic DNA, or on the process of extracting DNA, but only on unmodified genes themselves.
The Patent Office’s long-running approach to genetic patents was challenged in a lawsuit filed in May 2009 by the American Civil Liberties Union and the Public Patent Foundation, contending that locking up genes in the monopoly rights of a patent would inhibit research by other scientists on diseases that might be flagged by the coding or mutations of the genes.   The lawsuit targeted both the Patent Office and Myriad Genetics, specifically because of patents that company was issued on human genes that have been labeled “BRCA1″ and “BRCA2.’  Mutations of those two genes are associted with significantly higher risks of breast cancer and ovarian cancer. .....

"U.S. government is actually the co-owner of four of the seven patents that are involved in the case.  It has granted Myriad an exclusive license under those patents — contrary, it said, to NIH’s usual practice of not granting exclusive licenses under DNA patents for “diagnostic applications.”  In the past, NIH and other government agencies have sought and obtained patents for human genes in the form of “isolated genomic DNA,” according to the brief.
The brief did not say which claims under the four patents co-owned by NIH would be invalid under its theory of patentability."

Datanami, Woe be me