Kevin's GATTACA World: March 2010

Tuesday 23 March 2010

The First Galaxy Developer Conference 2010

The First Galaxy Developer Conference will be in May. I am quite excited and hope to attend if I can.

p.s. I totally love how they capitalise the most important parts of the conference. BEER and FOOD

Friday 19 March 2010

AGBT 2010

Anthony Fejes blog and AGBT 2010 notes is full of useful info!

other helpful links from SEQanswers forum.
http://www.fiamh.info/category/agbt-2010/.

Daniel MacArthur at GeneticFuture

Tuesday 16 March 2010

IonTorrent: Sequencing for the Masses

massgenomics describes it as sequencing for the masses. It does seem to be an understated technology with thoroughput and more importantly the budget that can match small research labs.

"In the last session of AGBT 2010, new player Ion Torrent impressed the crowd with a low-cost sequencing platform that uses a silicon chip to sequence DNA. The principle, for once, is easy to understand: a silicon chip printed with millions of tiny semiconductor wells. In each well, when DNA polymerase adds a base to the template strand, a hydrogen ion is released. It hits a pH detector in the bottom of the well, and the pH change is recorded digitally. Multiple incorporations cause linear increases in the pH change, so homopolymers should be less of a problem than 454. It operates on native DNA, without any of the reagents, dyes, and other complicating aspects of other sequencing technologies."
http://www.massgenomics.org/2010/03/agbt-ion-torrent-semiconductor-sequencin.html

BEDTools: a flexible suite of utilities for comparing genomic features

bumper crop of NGS software in current issue of Bioinformatics!

This article introduces a new software suite for thecomparison, manipulation and annotation of genomic featuresin Browser Extensible Data (BED) and General Feature Format(GFF) format. BEDTools also supports the comparison of sequencealignments in BAM format to both BED and GFF features. The toolsare extremely efficient and allow the user to compare largedatasets (e.g. next-generation sequencing data) with both publicand custom genome annotation tracks. BEDTools can be combinedwith one another as well as with standard UNIX commands, thusfacilitating routine genomics tasks as well as pipelines thatcan quickly answer intricate questions of large genomic datasets.
Availability and implementation: BEDTools was written in C++.Source code and a comprehensive user manual are freely availableat http://code.google.com/p/bedtools

Bioinformatics 2010 26(6):841-842; doi:10.1093/bioinformatics/btq033

Filtering error from SOLiD Output collection of Perl Scripts

Abstract
Summary: Here, we report the development of a filtering frameworkdesigned for efficient identification of both polyclonal andindependent errors within SOLiD sequence data. The filteringutilizes the quality values reported by SOLiD's primary analysisfor the identification of the two different types of errors.The filtering framework facilitates the passage of high-qualitydata into a variety of functional genomics applications, includingde novo assemblers and sequence matching programs for SNP calling,improving the output quality and reducing resources necessaryfor analysis.
Availability: This error analysis framework is written in Perland runs on Mac OS and Linux/Unix systems. The filter, documentationand sample Excel files for quality analysis are available athttp://hts.rutgers.edu/filter and are distributed as Open Sourcesoftware under the GPLv3.0.

Biopieces are a collection of bioinformatics tools ..

The Biopieces are a collection of bioinformatics tools that can be pieced together in a very easy and flexible manner to perform both simple and complex tasks. The Biopieces work on a data stream in such a way that the data stream can be passed through several different Biopieces, each performing one specific task: modifying or adding records to the data stream, creating plots, or uploading data to databases and web services. The Biopieces are executed in a command line environment where the data stream is initialized by specific Biopieces which read data from files, databases, or web services, and output records to the data stream that is passed to downstream Biopieces until the data stream is terminated at the end of the analysis as outlined below:

read_data | calculate_something | write_results

The following example demonstrates how a Solexa deep sequencing experiment can be analyzed – including removal of adaptor sequence, determining the number of unique sequences, mapping to a specified genome, and uploading the data to the UCSC genome browser for further analysis:

read_solexa –i data.solexa |              #  Initialize data stream from a file. 
remove_adaptor -a TCGTATGCC -m 2 |        #  Remove adaptor sequence allowing for 2 mismatches. 
grab –e ‘ADAPTOR_POS > -1’ |              #  Get all entries where an adaptor sequence was found. 
count_vals –k SEQ |                       #  Determine the occurrences of all sequences.  
uniq_vals –k SEQ |                        #  Get all entries with a unique sequence. 
merge_vals –k SEQ_NAME,SEQ_COUNT |        #  Append the sequence count to the sequence name. 
vmatch_seq –g hg18 |                      #  Map the sequences to the Human genome using Vmatch. 
upload_to_ucsc –d hg18 –t solexa_data –x  #  Upload the mapping results to the UCSC Genome Browser.

The advantage of the Biopieces is that a user can easily solve simple and complex tasks without having any programming experience. Moreover, since the data format used to pass data between Biopieces is text based, different developers can quickly create new Biopieces in their favorite programming language - and all the Biopieces will maintain compatibility.

Monday 15 March 2010

Good words on bad omics words: "A crisis in postgenomic nomenclature" from 2002

Good words on bad omics words: "A crisis in postgenomic nomenclature" from 2002

Posted using ShareThis

Thursday 11 March 2010

Ray-0.0.3: a NEW MPI-based parallel genome assembler

The Ray Project Team gives you a 100% parallel MPI-based assembler called Ray. Ray is NOW available at http://sourceforge.net/projects/denovoassembler/files/. It supports Illumina paired-end reads. It is 100% parallel, and it is a single executable (no pesky perl scripts!). The source code is licensed with the GPL-v3.

Try it, and give us your comments, bugs, suggestions, and concerns on our mailing list.
http://lists.sourceforge.net/lists/l...ssembler-users

Ray-0.0.3: a NEW MPI-based parallel genome assembler
http://sourceforge.net/mailarchive/f...ssembler-users

***
The Ray Project Team
http://denovoassembler.SourceForge.net/

Compiling BFAST and DNAA in CentOS 5.4

Update finally got it to work.

#prereqs
GNU Autoconf version 2.59
GNU Automake version 1.9.6
GNU Libtool version 1.5.22
also requires bzlib.h found in bzlib-devel (debian name) in CentOS 5.4 it's bzip2-devel-1.0.3-4.el5_2.x86_64.rpm

#bfast prereq
download bfast source
tar zxvf bfast-0.6.3c.tar.gz
sh autogen.sh
./configure
make

#samtools source
download samtools source
tar jxvf samtools-0.1.7a.tar.bz2

#install this for tview use yum
/var/cache/yum/base/packages/ncurses-devel-5.5-24.20060715.x86_64.rpm
/var/cache/yum/base/packages/ncurses-devel-5.5-24.20060715.i386.rpm

cd samtools-0.1.7a
make

git clone git://dnaa.git.sourceforge.net/gitroot/dnaa/dnaa

#symbolic link to bfast dir in root dir (.. relative to dnaa dir)
cd /home/username/bin/source/dnaa/dnaa
ln -s /home/username/bin/source/bfast-0.6.3c/ bfast
ln -s /home/username/bin/source/samtools/samtools-0.1.7a samtools
cd ..
ln -s /home/username/bin/source/bfast-0.6.3c/ bfast
ln -s /home/username/bin/source/samtools/samtools-0.1.7a samtools

cd /home/username/bin/source/dnaa/dnaa
sh autogen.sh
./configure
make

update: Used checkinstall to create rpm package so its easier for me to uninstall and recompile updates.

with checkinstall 1.6.2 I had to softlink a library

ln -s /usr/local/lib/installwatch.so /usr/local/lib64/installwatch.so

for bfast now the install method is

tar zxvf bfast-*.tar.gz
cd bfast-*
sh autogen.sh
./configure
make
sudo checkinstall
rpm -ivv bfast-0.6.4a-1.x86_64.rpm

Wednesday 3 March 2010

Mongodb or Couchdb for storing NGS reads?

Been chasing missing reads in my 70 million short reads data from ABI SOLid. Other than the gremlins took them I have no idea why the code fails and works some times. NFS or Network issues perhaps? Not the sysadmin on the cluster so I can't do much except to audit my numbers each time.
Am thinking ahead of how to speed up or make the process more reliable and I found Brad's blog on his experience with document stores.
Going to follow up and do some testing with this when I have the time.

Tuesday 2 March 2010

Image files from NGS sequencers

Good points raised in this article about how keeping your image files can be very expensive!

Do you keep them?
Post your comments!

Tips for de novo bacterial genome assembly

Found this handy article for de novo bacterial genome assembly.

In Summary,

Filter reads
Use VelvetOptimiser Script
Use Bowtie to map back reads to finished assembly to validate

Kevin's GATTACA World