Kevin's GATTACA World: August 2010

Friday, 27 August 2010

METAREP is a new open source tool developed for high-performance comparative metagenomics

Found this blog post at JCVI
Are your carrying out large scale metagenomics analyses to identify differences among multiple sample sites? Are you looking for suitable analysis tools?
If you have not yet found the right analysis tool, you may be interested in the latest beta version of JCVI Metagenomics Reports (METAREP) [Test It].
METAREP is a new open source tool developed for high-performance comparative metagenomics .
It provides a suite of web based tools to help scientists view, query, browse, and compare metagenomics annotation data derived from ORFs called on metagenomics reads or assemblies.

What hardware do I get for NGS data analysis?

Another that should belong in a NGS FAQ

Computer Hardware: CPU vs. Memory
http://seqanswers.com/forums/showthread.php?t=6564
In forum: Bioinformatics

Why not use BLAST for NGS reads?

This should be in a FAQ somewhere

Why do we use mapping programs instead of blast for mapping to a reference?
http://seqanswers.com/forums/showthread.php?t=6568
In forum: Bioinformatics

Wednesday, 25 August 2010

How a programmer views the human genome

Grabbed this fantastic description on the ultimate question in the galaxy.

The genome is the source of a program to build and run a human

But: the author is not available for comment

It’s 3GB in size

In a single line

Due to constant forking, there are about 7 billion different versions

It’s full of copy-and-paste and cruft

And it’s completely undocumented

Q: How do you debug it?

Brilliant

I agree!

I forgot there are 7 billion versions when actually only 1000 are publicly avail for download! Lol

Author is

Jim Stalker, the Senior Scientific Manager in charge of vertebrate resequencing informatics at the Sanger center. Grabbed off Todd Smith's post

howto do BWA mapping in colorspace

Here's what I use for bwa alignment (without removing PCR dups).
You can replace the paths with your own and put into a bash script for automation
comments or corrections welcome!

#Visit kevin-gattaca.blogspot.com to see updates of this template!
#http://kevin-gattaca.blogspot.com/2010/08/howto-do-bwa-mapping-in-colorspace.html
#updated 16th Mar 2011
#Creates colorspace index
bwa index -a bwtsw -c hg18.fasta

#convert to fastq.gz
perl /opt/bwa-0.5.7/solid2fastq.pl Sample-input-prefix-name Sample

#aln using 4 threads
#-l 25        seed length
#-k 2         mismatches allowed in seed
#-n 10      total mismatches allowed

bwa aln -c -t 4 -l 25 -k 2 -n 10 /data/public/bwa-color-index/hg18.fasta Sample.single.fastq.gz > Sample.bwa.hg18.sai

#for bwa samse
bwa samse /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sai Sample.single.fastq.gz > Sample.bwa.hg18.sam

#creates bam file from pre-generated .fai file

samtools view -bt /data/public/hg18.fasta.fai -o Sample.bwa.hg18.sam.bam Sample.bwa.hg18.sam

#sorts bam file

samtools sort Sample.bwa.hg18.sam.bam{,.sorted}

#From a sorted BAM alignment, raw SNP and indel calls are acquired by:

samtools pileup -vcf /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup

#resultant output should be further filtered by:

/opt/samtools/misc/samtools.pl varFilter Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup | awk '$6>=20' > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup.final.pileup

#new section using mpileup and bcftools to generate vcf files
samtools mpileup -ugf hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

Do note the helpful comments below! Repost here for clarity.

Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.

Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.

Also:

1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".

2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."

The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).

2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.

3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)

4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.

5) View the alignments with samtools tview.

Wednesday, 18 August 2010

A Programmer’s Discussion: Procedural vs. OO

A Programmer’s Discussion: Procedural vs. OO
now that's something that I didn't expect to pop-up, people making a stand for Procedural programming vs OO. (in the comments, the author of the post stands neutral though)

After so many years, I think I have only managed a few OO code although I am totally convinced that OO is the way to go. I have attributed this to my poor programming skills and often I need the ad hoc script to just work and cut down on development time. It is more rare than often that I have to reuse my code in a way that copy and paste doesn't solve the issue at hand fast. (vs remembering where did I deposit that method and how to access the method)

Hmmm perhaps I am not alone in this after all!

Playing with NFS & GlusterFS on Amazon cc1.4xlarge EC2 instance types

I wished I had time to do stuff like what they do at bioteam.
Benchmarking the Amazon cc1.4xlarge EC2 instance.

These are the questions they aimed to answer

We are asking very broad questions and testing assumptions along the lines of:

Does the hot new 10 Gigabit non-blocking networking fabric backing up the new instance types really mean that “legacy” compute farm and HPC cluster architectures which make heavy use of network filesharing possible?
How does filesharing between nodes look and feel on the new network and instance types?
Are the speedy ephemeral disks on the new instance types suitable for bundling into NFS shares or aggregating into parallel or clustered distribtued filesystems?
Can we use the replication features in GlusterFS to mitigate some of the risks of using ephemeral disk for storage?
Should the shared storage built from ephermeral disk be assigned to “/scratch” or other non-critical duties due to the risks involved? What can we do to mitigate the risks?
At what scale is NFS the easiest and most suitable sharing option? What are the best NFS server and client tuning parameters to use?
When using parallel or cluster filesystems like GlusterFS, what rough metrics can we use to figure out how many data servers to dedicate to a particular cluster size or workflow profile?

ZOMG Life Technologies to Acquire Ion Torrent

6 hours ago Life Tech on their facebook page announced this. I think it's an interesting piece of news considering that Ion Torrent isn't actually selling loads of machines yet. It will be interesting to see how Ion Torrent will fit in with SOLiD and their other slew of sequencing related machines.

Life Technologies Today we announced a definitive agreement to acquire Ion Torrent. Read the details here http://bit.ly/cIzfmq

Life Technologies Corporation (NASDAQ: LIFE), a provider of innovative life science solutions, today announced a definitive agreement to acquire Ion Torrent for $375 million in cash and stock.
The sellers are entitled to additional consideration of $350 million in cash and stock upon the achievement of certain technical and time-based milestones through 2012. Life Technologies' Board of Directors has approved an additional share repurchase program in order to repurchase its shares associated with the stock portion of the consideration. The impact on total share count is expected to be neutral.

UPDATE
more buzz on the net
http://www.marketwatch.com/story/life-technologies-announces-agreement-to-acquire-ion-torrent-2010-08-17?reflink=MW_news_stmp

Ion Torrent bought by Life Technologies : Genetic Future

Pathogens blog makes an analogy of the sequencing 'zoo'

Lifetech gobbles Ion Torrent: Omics Omics

Jermdemo thought's on the acquisition

Monday, 16 August 2010

Annovar:a easy way to automate variant reduction procedure AKA exome sequencing or whole genome sequencing

The authors' description of the software is that
"ANNOVAR is is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes."

whilst I am not a good enough programmer to comment on how efficient the code is, in the context of a 'why reinvent the wheel' ANNOVAR is definitely an efficient way to get started on doing exome / whole genome resequencing for variant discovery.

basically it is a collection of perl scripts that can do
1)take in variant information from popular tools like samtools-pileup , Complete Genomics, GFF3-SOLID (?), etc.
2) Do Annotation that is

The other nice thing is that the download already comes with excellent examples that you would be able to get going fast.
there's also annotation datasets avail for download from the developers
I only see hg18 for now though.

This page is a nice summary for beginners doing exome / whole genome resequencing.

note:
UPDATE (2010Aug11): Users have reported a bug (double-counting of splicing) in the auto_annovar.pl program in the 2010Aug06 version. An updated version is provided here.

Tuesday, 10 August 2010

BWA sai files are useless.

if you know what sai means in a particular chinese dialect, you would have known that BWA sai files are redundant. Well it took abit of googling for me to know this from seqanswers

"No, sai is a fast changing format and does not guarantee backward compatibility at all. One should not keep sai files. I always delete them when I get SAM output."

PyroNoise:Accurate determination of microbial diversity from 454 pyrosequencing data

Using 454 to do microbial ecology / metagenomics of environmental / soil samples?
Then I think you should take a look at this paper.

Quince, C., Lanzén, A., Curtis, T., Davenport, R., Hall, N., Head, I., Read, L., & Sloan, W. (2009). Accurate determination of microbial diversity from 454 pyrosequencing data Nature Methods, 6 (9), 639-641 DOI: 10.1038/nmeth.1361

The Pathogens blog has a good summary post on it.

Tuesday, 3 August 2010

NIDDK Taps BGI for Human Genome Sequencing Project

Excerpted from genomeweb

"
The National Institute of Diabetes and Digestive and Kidney Diseases plans to award a project to sequence and analyze a single human genome to BGI-Hong Kong, according to a solicitation posted recently on the Federal Business Opportunities website.
NIDDK wants to have the genome of one Pima Indian sequenced “at a very high level of completeness,” according to the document. The institute will provide 5 micrograms of DNA and asked for 90-fold coverage on the Illumina platform “or its equivalent.” It expects to receive an assembly of the consensus sequence, as well as the detection and distribution of SNPs, indels, and structural variations.
According to the institute, BGI will sequence the sample on the Illumina technology and provide, within two-and-a-half months of sample delivery and quality control, 270 gigabases of sequence data, an assembly of the consensus sequence, and a comparison with other ethnic genomes. It will deliver the data — both raw sequence data and data in a finalized format — through its FTP site or via mail on a hard disk."

Kevin's GATTACA World