Thursday, 31 December 2009

use csplit to split fasta files

Got this off the net a looong while back. Sorry I can't attribute the source. please drop a comment if you know the orginal author. Naming the files using a increasing counter is a godsend if you wanna batch qsub / pbs jobs.

# split fasta file into separate sequence files

if [ $# -gt 1 ]
 echo "Use: fsplit SEQFILE DESTDIR"
 echo "     Splits fasta file SEQFILE into separate files in DESTDIR folder"

mkdir $2
#names the fa files as sequence00 i.e. with padding
#csplit -f $destdir/sequence $seqfile "%^>%" "/^>/" "{*}" -s
#names the fa files as sequence0 i.e. without padding
csplit -n 1 -f $destdir/sequence $seqfile "%^>%" "/^>/" "{*}" -s

Thursday, 24 December 2009

De novo assembly with ABI SOLiD reads

Ran a trial run with sample data!

perl --run fragment --file fixed-saet/reads.csfasta
./velveth_de output_directory/ 21 -short /home/kev/bin/source/solid-denovo-acc-tools/output/ -strand_specific

./velvetg_de output_directory/ -read_trkg yes -amos_file yes
./ --afgfile Velvet_asm.afg --csfasta colorspace_input.csfasta --run fragment

denovoadp sample_input 200 > sample_output
Voila! base space!
doing more testing will update post with remarks later

@ Victor,
you may be able to find denovoadp here but its a new version and I have yet to test it  
SOLiD™ System de Novo Assembly Tools 2.0 The tools in this project provide the ability to create de novo assemblies from SOLiD™ colorspace reads.

Wednesday, 23 December 2009

WebCARMA for metagenomic reads

Fascinating software! Will explore in time 

BMC Bioinformatics. 2009 Dec 18;10(1):430. [Epub ahead of print]

WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads.

ABSTRACT: BACKGROUND: Metagenomics is a new field of research on natural microbial communities. High-throughput sequencing techniques like 454 or Solexa-Illumina promise new possibilities as they are able to produce huge amounts of data in much shorter time and with less efforts and costs than the traditional Sanger technique. But the data produced comes in even shorter reads (35-100 basepairs with Illumina, 100-500 basepairs with 454-sequencing). CARMA is a new software pipeline for the characterisation of species composition and the genetic potential of microbial samples using short, unassembled reads. RESULTS: In this paper, we introduce WebCARMA, a refined version of CARMA available as a web application for the taxonomic and functional classification of unassembled (ultra-)short reads from metagenomic communities. In addition, we have analysed the applicability of ultra-short reads in metagenomics. CONCLUSIONS: We show that unassembled reads as short as 35 bp can be used for the taxonomic classification of a metagenome. The web application is freely available at
PMID: 20021646 [PubMed - as supplied by publisher]

Tuesday, 22 December 2009

Simulated ABI Solid data sets

Finally found a link that describes how u can generate a test data set for ABI solid runs! Done using SAMtools
now to put my spanking new cluster to the test

Thursday, 17 December 2009

ABySS: A parallel assembler for short read sequence data

new version available
reinstall in CentOS

sudo yum install openmpi

 ./configure --prefix=/opt/ABySS --with-mpi=/usr/lib/openmpi CPPFLAGS=-I/usr/include  && make

sudo make install

Trying to install this for de novo transcriptome assembly

checking how to run the C++ preprocessor... /lib/cpp
configure: error: in `/home/k/bin/source/abyss-1.0.16':
configure: error: C++ preprocessor "/lib/cpp" fails sanity check
See `config.log' for more details.

|              Syntax error
configure:6628: /lib/cpp   conftest.cpp
cpp: error trying to exec 'cc1plus': execvp: No such file or directory
configure:6628: $? = 1
configure: failed program was:
| /* confdefs.h */

Had the above error

then tried
  yum install gcc-c++
fingers crossed

Part 2 of install woes

well the above worked but i got this at the end of configure

warning: ABySS should be compiled with Google sparsehash to
        reduce memory usage. It may be downloaded here:

Hmmm you would think they would mention this in the 1st line of the README
two options I want
If you wish to build the parallel assembler with MPI support,
MPI should be found in /usr/include and /usr/lib or its location
specified to configure:
./configure --with-mpi=/usr/lib/openmpi && make

ABySS should be built using Google sparsehash to reduce memory usage,
although it will build without. Google sparsehash should be found in
/usr/include or its location specified to configure:
./configure CPPFLAGS=-I/usr/local/include

Wednesday, 16 December 2009

VM marketplace?

Wow that's a term I never imagined would exist


The Nimbus Marketplace

The Nimbus marketplace is like the virtual machine marketplaces that are becoming popular but is specifically for scientific applications and other VMs that are useful for grid computing.
This is a place to find VMs and we also host many of them directly on this webserver. Each is accompanied by a populated workspace metadata file for quick deployment on resources running the Workspace Service.

Must try EC2 with bioinformatics once

This seems like a good post to start looking at EC2 instances that can do bioinformatics.
For sure, I will explore crossbow for my work soon!

Found this paper for Hadoop/BLAST Interesting!

48 cores in a single Chip??

This is not a dream. I WOULD like to get my hands on one of these!
The SCC has 48 cores hooked together in network that mimics cloud computing on a chip level, and support highly parallel "scale-out" programming models. Intel Labs expects to build 100 or more experimental chips for use by dozens of industrial and academic research collaborators around the world with the goal of developing new software applications and programming models for future many-core processors.
For more information, see Exploring programming models with the Single-chip Cloud Computer research prototype

To Intel: Would you like my mailing address?

Petascale Tools and Genomic Evolution

Abstract from post:
Technological advances in high-throughput DNA sequencing have opened up the possibility of determining how living things are related by analyzing the ways in which their genes have been rearranged on chromosomes. However, inferring such evolutionary relationships from rearrangement events is computationally intensive on even the most advanced computing systems available today.

Research recently funded by the American Recovery and Reinvestment Act of 2009 aims to develop computational tools that will utilize next-generation petascale computers to understand genomic evolution. The four-year $1 million project, supported by the National Science Foundation's PetaApps program, was awarded to a team of universities that includes the Georgia Institute of Technology, the University of South Carolina, and The Pennsylvania State University. 

Author: Hmmm Petascale Tools that's a new term for me! But I dun really get the details of what the author plan to do. AFAIK, Computational problems are always based on the present.
So if the biggest baddest computer that you have isn't enough for you, you basically have 2 options 
a) Buy more/new computers
b) improve your algorithm

So are they are developing  tools that will be used on Petascale computers which doesn't exist yet? Or are they developing algorithm for tools that will need petascale computers but can run on present computing powers?

Ahhh the vagaries of grant application

Monday, 14 December 2009

NFS Howto

Found this comprehensive article on NFS thumbsup!

ATI Catalyst™ 9.11 Driver on CentOS 5.4

Tried to install ATI Catalyst™ 9.11 Driver on CentOS 5.4 with the Integrated ATI Radeon HD 4200 graphics in the motherboard GA-MA785GT-UD3H (rev. 1.0).

1st try failed.
looking at the error log at

Turns out that I need to install the kernel headers before I install this driver.
installed the header rpm for 2.6.18-164.6.1.el5-x86_64
and it works fine now.
Yet to benchmark it though.
will try this soon

256 MB  Radeon HD 4200

[15:06:33 ~]$ glxgears
13524 frames in 5.0 seconds = 2704.076 FPS
14208 frames in 5.0 seconds = 2841.537 FPS
18396 frames in 5.0 seconds = 3679.084 FPS
18753 frames in 5.0 seconds = 3750.495 FPS
15337 frames in 5.0 seconds = 3066.924 FPS
18235 frames in 5.0 seconds = 3646.986 FPS
18245 frames in 5.0 seconds = 3648.970 FPS

[15:07:58 ~]$ /usr/bin/fgl_glxgears
Using GLX_SGIX_pbuffer
2983 frames in 5.0 seconds = 596.600 FPS
4220 frames in 5.0 seconds = 844.000 FPS
4225 frames in 5.0 seconds = 845.000 FPS
4213 frames in 5.0 seconds = 842.600 FPS
5879 frames in 5.0 seconds = 1175.800 FPS
6707 frames in 5.0 seconds = 1341.400 FPS
6278 frames in 5.0 seconds = 1255.600 FPS
6580 frames in 5.0 seconds = 1316.000 FPS

Friday, 11 December 2009

No fuse-ntfs-3g on CentOS 5.4!!

[root@node00 ~]# yum install fuse fuse-ntfs-3g
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * addons:
 * base:
 * extras:
 * updates:
Setting up Install Process
Package fuse-2.7.4-8.el5.x86_64 already installed and latest version
No package fuse-ntfs-3g available.
Nothing to do

This is so sad.... apparently ntfs is broken from 5.2 onwards.. How can this happen?

Thursday, 10 December 2009

A sign of things to come.

NGS really takes data sizes to new heights. Even downloading the sample data sets for small RNA analysis from ABI website ( Human Small RNA Data Set ) takes this amount of time

Downloaded: 4 files, 14G in 1d 4h 35m 3s (146 KB/s)

well in truth the files were 26 Gb in total but due to my network issues I had to retry the download again.

no md5 checksums. So I will know in a day or two if the downloads went smoothly.

Monday, 7 December 2009

Install VirtualBox on CentOS 5.4

Got the RPM for RedHat Enterprise Linux 5 (RHEL 5) from the site
After downloading, I got this error from the install
No precompiled module for this kernel found! ...
Compilation of kernel module FAILED! ...
Please consult
to find out why the kernel module does not compile.
Most probably the kernel sources are not found.
Install them and execute
   /etc/init.d/vboxdrv setup
as root.

Found this fix trying it now

Sunday, 6 December 2009

Booting Ubuntu off a USB thumb/hdd

Hands up for those who feel frustrated burning CDs of linux distros only to find out at boot that the burn went wrong and you have failed md5 checksums on the disc thereby restarting your installation process!

there's other stuff like haven't to reinstall all your favourite programs like gvim bioperl etc.. cos the liveCD doesn't contain them. 

Have always known that you can boot off a USB thumbdrive, so I found an old seagate 10 GB HDD (dun laugh! I finally found a use for it!!) plugged that in a USB casing,
followed the instructions in here for making a casper-rw for persistent file settings

downloaded the proprietary ATI drivers for my motherboard. and off I go in experimentation!
will update this post for my experience!
I hope CentOS will behave well in this manner! Not relishing the experience of having to burn another failed DVD

UPDATE: Darn Gigabyte MA785GT-UD3H refuses to boot from my USB HDD nor my 2 GB  thumb!
the bios recog the HDD, and thumb
for the former it just hangs there.
the latter gave a boot error with no specifics!
no idea why!

Tuesday, 1 December 2009

Cheaper Next Generation Sequencing?

The biggest barrier to NGS is always the cost. Roche has a nice idea to bring down cost by downsizing their machines. But I wonder how many people need a NGS machine that shortcuts on the volume of data.

Bioinformatics Bottleneck

One of the bigger concerns for small labs trying to do NGS is the bioinformatics and its a VERY valid concern.

I am very interested to see how bioinformatics will become like in the next few years when NGS becomes cheap enough for small labs with no IT budget

This is also an interesting article on scalable bioinformatics at geospiza

New Job new distro

Have started in a new job!
but basically am doing Next Generation Sequencing Bioinformatics.
Sounds like a mouthful but hope it goes well.

1st week was spent on sourcing a cheap cluster for analysis. but 'cheap cluster' is an oxymoron!

Playing around with CentOS now. So far, its less than enjoyable compared to Ubuntu. Especially the 7 CDs or single DVD downloading.
I can't understand why making people download so many RPMs would be a good thing for bandwidth or convenience.

I miss my Ubuntu box. but setting up a HPC cluster using Ubuntu might be tricky without tech support.

Any advice for those familiar with ABI Solid's offline cluster setup?

Datanami, Woe be me