Saturday, 31 March 2012

Recorded Webinar: Dissecting the Apache Hadoop Stack



Cloudera Essentials for
Apache Hadoop: Part Two

Sorry we missed you. 

You can still take a closer look inside the Apache Hadoop stack and gain insight into the anatomy of the Hadoop cluster. Learn about the Hadoop Distributed File System (HDFS) and MapReduce and how the technologies could fit into your own IT environment. 

Access the recorded version of Dissecting the Apache Hadoop Stack here.

Register for Part Three of Cloudera Essentials for Apache Hadoop

Cloudera Essentials for Apache Hadoop
Part Three | Solving Business Challenges with Apache Hadoop
April 18, 2012
11 am PT, 2 pm ET 

Now that you know the basics of how Apache Hadoop works, learn how the technology is used in the real world. Solving Business Challenges with Apache Hadoop explores ways to use Apache Hadoop to harness Big Data and solve business problems in ways never before imaginable. The webinar identifies common business challenges and shares real-world use cases for how to to improve your business by analyzing your data and gaining insights and fresh solutions to these challenges.

In this webinar you will learn about:

  • Common problems that can be addressed and solved using Hadoop
  • Types of analytics performed with Hadoop
  • Where the data analyzed by Hadoop originates and the benefits of analyzing it
  • Real-world business use cases for Hadoop

Register here.

Need to catch up?

View the recording of Part One of the Cloudera Essentials for Apache Hadoop webinar series: The Motivation for Hadoop.

Friday, 30 March 2012

HPCwire: BGI Starts Bioinformatics and Computing Lab in Tianjin

Professor Jian Wang, President of BGI, said, "In the past, it took a year to conduct a project on the genomics association study of 500 human samples, but now with "Tianhe", 3 hours is enough. We believe this will broaden the applications of Tianhe-1A in life science and greatly accelerate the development of frontier of science and technology."

Wednesday, 28 March 2012

BMC Bioinformatics | Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II.
BMC Bioinformatics. 2012 Mar 23;13(1):47. [Epub ahead of print]

Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II.




Next generation sequencing provides detailed insight into the variation present within viral populations, introducing the possibility of treatment strategies that are both reactive and predictive. Current software tools, however, need to be scaled up to accommodate for high-depth viral data sets, which are often temporally or spatially linked. In addition, due to the development of novel sequencing platforms and chemistries, each with implicit strengths and weaknesses, it will be helpful for researchers to be able to routinely compare and combine data sets from different platforms/chemistries. In particular, error associated with a specific sequencing process must be quantified so that true biological variation may be identified.


Segminator II was developed to allow for the efficient comparison of data sets derived from different sources. We demonstrate its usage by comparing large data sets from 12 influenza H1N1 samples sequenced on both the 454 Life Sciences and Illumina platforms, permitting quantification of platform error. For mismatches median error rates at 0.10 and 0.12%, respectively, suggested that both platforms performed similarly. For insertions and deletions median error rates within the 454 data (at 0.3 and 0.2%, respectively) were significantly higher than those within the Illumina data (0.004 and 0.006%, respectively). In agreement with previous observations these higher rates were strongly associated with homopolymeric stretches on the 454 platform. Outside of such regions both platforms had similar indel error profiles. Additionally, we apply our software to the identification of low frequency variants.


We have demonstrated, using Segminator II, that it is possible to distinguish platform specific error from biological variation using data derived from two different platforms. We have used this approach to quantify the amount of error present within the 454 and Illumina platforms in relation to genomic location as well as location on the read. Given that next generation data is increasingly important in the analysis of drug-resistance and vaccine trials, this software will be useful to the pathogen research community. A zip file containing the source code and jar file is freely available for download from

[PubMed - as supplied by publisher]

BMC Bioinformatics | Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer
BMC Bioinformatics. 2012 Mar 23;13(1):48. [Epub ahead of print]

Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer.




The analysis of next-generation sequencing data from large genomes is a timely research topic. Sequencers are producing billions of short sequence fragments from newly sequenced organisms. Computational methods for reconstructing whole genomes/transcriptomes (de novo assemblers) are typically employed to process such data. However, these methods require large memory resources and computation time. Many basic biological questions could be answered targeting specific information in the reads, thus avoiding complete assembly.


We present Mapsembler, an iterative micro and targeted assembler which processes large datasets of reads on commodity hardware. Mapsembler checks for the presence of given regions of interest that can be constructed from reads and builds a short assembly around it, either as a plain sequence or as a graph, showing contextual structure. We introduce new algorithms to retrieve approximate occurrences of a sequence from reads and construct an extension graph. Among other results presented in this paper, Mapsembler enabled to retrieve previously described human breast cancer candidate fusion genes, and to detect new ones not previously known.


Mapsembler is the first software that enables de novo discovery around a region of interest of repeats, SNPs, exon skipping, gene fusion, as well as other structural events, directly from raw sequencing reads. As indexing is localized, the memory footprint of Mapsembler is negligible. Mapsembler is released under the CeCILL license and can be freely downloaded from

[PubMed - as supplied by publisher]

Tuesday, 27 March 2012

[Velvet-users] A strategy for scaling genome and transcriptome contig assembly (digital normalization)

---------- Forwarded message ----------
From: "C. Titus Brown"
Date: Mar 25, 2012 12:48 PM
Subject: [Velvet-users] A strategy for scaling genome and transcriptome contig assembly (digital normalization)
To: "velvet-users"
Hi all,

last week I posted a preprint of a paper discussing a strategy for coverage normalization, data reduction, and error elimination:

we call this strategy 'digital normalization' and it can yield good to spectacular reductions in data size and memory usage for assembly.  In the paper we test it with Velvet, Oases, and Trinity on a variety of data sets.


On the paper site,

I just posted a tutorial for running it on microbial genomes prior to Velvet assembly, and on the Trinity paper's yeast mRNAseq data set prior to Oases or Trinity assembly.

The tutorial uses an Amazon EC2 instance for reproducibility, but with a bit of hopefully obvious tweaking the commands should work on any Linux system.  Note, you'll need about 15 gb of RAM for the yeast Oases & Trinity assemblies.

Let me know if you have any questions (but be please to ask just on the relevant mailing list -- I'm sending this to velvet, oases, and trinity lists).


Velvet-users mailing list

Saturday, 24 March 2012

Flow cytometric chromosome sorting in plants: The next generation.
Methods. 2012 Mar 14. [Epub ahead of print]

Flow cytometric chromosome sorting in plants: The next generation.


Centre of the Region Haná for Biotechnological and Agricultural Research, Institute of Experimental Botany, Sokolovská 6, CZ-77200 Olomouc, Czech Republic.


Genome analysis in many plant species is hampered by large genome size and by sequence redundancy due to the presence of repetitive DNA and polyploidy. One solution is to reduce the sample complexity by dissecting the genomes to single chromosomes. This can be realized by flow cytometric sorting, which enables purification of chromosomes in large numbers. Coupling the chromosome sorting technology with next generation sequencing provides a targeted and cost effective way to tackle complex genomes. The methods outlined in this article describe a procedure for preparation of chromosomal DNA suitable for next-generation sequencing.

ARIEL and AMELIA: Testing for an Accumulation of Rare Variants Using Next-Generation Sequencing Data.
Hum Hered. 2012 Mar 22;73(2):84-94. [Epub ahead of print]

ARIEL and AMELIA: Testing for an Accumulation of Rare Variants Using Next-Generation Sequencing Data.


Wellcome Trust Sanger Institute, Hinxton, UK.


Objectives: There is increasing evidence that rare variants play a role in some complex traits, but their analysis is not straightforward. Locus-based tests become necessary due to low power in rare variant single-point association analyses. In addition, variant quality scores are available for sequencing data, but are rarely taken into account. Here, we propose two locus-based methods that incorporate variant quality scores: a regression-based collapsing approach and an allele-matching method. Methods: Using simulated sequencing data we compare 4 locus-based tests of trait association under different scenarios of data quality. We test two collapsing-based approaches and two allele-matching-based approaches, taking into account variant quality scores and ignoring variant quality scores. We implement the collapsing and allele-matching approaches accounting for variant quality in the freely available ARIEL and AMELIA software. Results: The incorporation of variant quality scores in locus-based association tests has power advantages over weighting each variant equally. The allele-matching methods are robust to the presence of both protective and risk variants in a locus, while collapsing methods exhibit a dramatic loss of power in this scenario. Conclusions: The incorporation of variant quality scores should be a standard protocol when performing locus-based association analysis on sequencing data. The ARIEL and AMELIA software implement collapsing and allele-matching locus association analysis methods, respectively, that allow the incorporation of variant quality scores.

Copyright © 2012 S. Karger AG, Basel.

Thursday, 22 March 2012

Multi-threaded BAM compression and sorting The multi-threaded sort/merge/view is available at the "mt" branch:

---------- Forwarded message ----------
From: Heng Li
Date: Thu, Mar 22, 2012 at 11:32 AM
Subject: [Samtools-help] Multi-threaded BAM compression and sorting

To convert coordinated sorted BAM back to FASTQ, the recommended way is to sort BAM by name and then convert the name sorted BAM to fastq. This is important because some mapper such as BWA assumes the input is random. They may have some troubles if we directly convert a coordinate sorted BAM with Picard's bam2fastq. While novosort, it does not sort by name. As I need to do BAM=>FASTQ for some huge BAMs, I added multi-threading to "sort", "merge" and "view".

This is not a full parallelization in that not all the steps are parallelized. Thus the efficiency is not scaled linearly with the number of threads. It is not recommended to use more than 8 threads. With 4 threads, time on sorting is reduced to 40% according to limited test. It may save you half a day if you have a huge BAM.

All my changes are naive and simple. It is possible to speed up sorting and compression further, but so far as I can see, this needs quite a lot of code restructuring and development time. For coordinate sorting, novosort scales much better with the number of threads (though I do not know if multi-threaded novosort is free to use beyond 15 days). Nils' multi-threaded bgzip should also do better on compression. These are sophisticated implementations. Mine is not.

The multi-threaded sort/merge/view is available at the "mt" branch:

The samtools/bgzf APIs stay the same except a few new functions to enable threading. In addition to multithreading, there are a few other improvements to sorting (some are based on Nils'):

1) @HD-SO tag is properly set (finally).

2) The compression level can be changed on the command line (-l).

3) Coordinate sorting considers strand as part of the key.

4) Improved alpha-numeric comparison between query names. The previous version was slower and did not work when there is a large integer.

5) Supporting "K/M/G" with option "-m". The maximum memory is estimated a little more accurately.

6) I kept claiming samtools sort was stable (i.e. the relative order of two records having the same coordinate are retained), but this was not true. The new sort is truly stable. This also means under the same compression level, sort always produces exactly the same output. For endusers, stable sorting is largely irrelevant. This just makes me feel more comfortable, "in theory".


 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.

contributed by iceman  (see comments )
relevant links to an alternative implementation: 

CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing

CloVR Architecture
CloVR is a VM that executes on a desktop (or laptop) computer, providing the ability to run analysis pipelines on local resources (Figure ​(Figure1).1). CloVR is invoked using one of two supported VM players, VMware [33] and VirtualBox [34]; at least one of which is freely available on all major desktop platforms: Windows, Unix/Linux, and Mac OS. On a local computer, CloVR utilizes local disk storage and compute resources, as supported by the VM player, including multi-core CPUs if available. To access data stored on the local computer, users can copy files into a "shared folder" that is accessible on both the VM and the local desktop and uses available hard drive space on the computer. Once inside the shared folder, CloVR can read this data for processing. Similarly, CloVR writes output data to this shared folder, making the pipeline output available on the desktop. This shared folder feature is supported by both VMware and VirtualBox.

Figure 1
Schematic of the automated pipelines provided in the CloVR virtual machine. The CloVR virtual machine includes pre-packaged automated pipelines for analyzing raw sequence data on both a local computer and cloud computing platform. T

Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community BMC Bioinformatics | Abstract |


A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure.


Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds.


CloudBioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.

only hg19 resources available for PLINK/SEQ

If you are following the tutorial at

you would have to first download the data

Downloading the data for the tutorial

We assume you have a working copy of PLINK/SEQ already installed. The data for this tutorial are in two archives you need to download:

but do note that the resources link is broken for now possibly pending changes? the download for ver 0.0.8 of the program is also disabled. 

you may download the hg19 version of the resources at

[   ]locdb19-Mar-2012 16:23 102M 
 [   ]locdb.ccds17-Mar-2012 18:18 80M  
[   ]locdb.gencode17-Mar-2012 18:19 207M  
[   ]refdb.g1k.gz17-Mar-2012 18:21 1.0G  
[   ]refdb.gz17-Mar-2012 18:23 1.3G  
[   ]seqdb17-Mar-2012 18:54 832M

UPDATE: a better page to describe the resources can be found here
although only hg19 is avail

Flash on 64 bit Ubuntu is 1 good reason to install Google Chrome

Adobe Flash is directly integrated with Google Chrome and enabled by default. Available updates for Adobe Flash are automatically included in Chrome system updates.

Wednesday, 21 March 2012

DeconSeq @ automatically detect and efficiently remove sequence condaminations from genomic and metagenomic datasets.


Sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, possibly causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants presents a necessary step for all metagenomic projects.

DeconSeq is distributed under the GNU Public License (GPL). All its source codes are freely available to both academic and commercial users. The latest version can be downloaded at the SourceForge download page.

Web version

        TOP OF PAGE

The interactive web interface of DeconSeq can be used to automatically detect and efficiently remove sequence condaminations from genomic and metagenomic datasets.

Necessary resources

  - Computer connected to the Internet

  - Up-to-date Web browser (Firefox, Safari, Chrome, Internet Explorer, ...)

  - FASTA file with sequence data
  - FASTQ file (as alternative format to trim sequence and quality data)

Upload data to the DeconSeq web version

To upload a new dataset in FASTA or FASTQ format to DeconSeq, follow these steps:
  1. Go to
  2. Click on "Use DeconSeq" in the top menu on the right (the latest DeconSeq web version should load)
  3. Select your FASTA or FASTQ file
  4. Select the retain and remove (optional) database(s)
  5. Click "Submit"

Tuesday, 20 March 2012

gigasync - Tool that enables rsync to mirror enormous directory trees.

gigasync - Tool that enables rsync to mirror enormous directory trees. has a couple issues with mirroring large (> 100K) directory trees.

rsync's memory usage is directly proportional to the number of files in a tree. Large directories take a large amount of RAM.
rsync can recover from previous failures, but always determines the files to transfer up-front. If the connection fails before that determination can be made, no forward progress in the mirror can occur.
The solution? Chop up the workload by using perl to recurse the directory tree, building smallish lists of files to transfer with rsync. Most of the time these small lists of files transfer over fine, but if they fail, this script can look for that specific failure and retry that set a couple times before giving up.

download at 

Monday, 19 March 2012

WSEmboss has been deprecated for a year ...

I wasn't aware of this .. it was a godsend to have a web wrapper for those nifty tools to deal with sequence data .. I guess everyone is using now ... I barely can remember the name when i was googling for emboss to find a tools to clean up fasta files .. Not sure what are the other tools I would miss


From Monday 28th February 2011 the WSEmboss service has been deprecated and the service will be retired during 2011. New development should use the Soaplab services or the tool specific services referenced below.


EMBOSS (European Molecular Biology Open Software Suite) is a free Open Source software analysis package specially developed for the needs of the molecular biology user community.
The software automatically copes with data in a variety of formats and allows transparent retrieval of sequence data from the web. Since extensive libraries are provided with the package providing a platform to allow scientists to develop and release software in the true open source spirit. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole.
For more information see:

Saturday, 17 March 2012

Article: Broad's Heng Li Wins 2012 Benjamin Franklin Award - Bio-IT World

Broad's Heng Li Wins 2012 Benjamin Franklin Award - Bio-IT World

(Sent from Flipboard)

Sent from my iPad

Using Excel for Bioinformatics Data: Five Issues, Five Solutions

Using Excel for Bioinformatics Data: Five Issues, Five Solutions

Nice tips !
I had been wondering about the Issue: Mistaken SYLK Files for the longest time! Forgot how I solved it eventually. MacOS excel also appears to have different default behavior. In windows I used to be able to open .csv files and there will be a dialog box asking me what are the field separators in the file (:,;,|,tab,space)

Now my just run everything thru sed to change it to commas so that it works.

 Sed 's/tab/,/g' file > file.csv
 Open file.csv

Thursday, 15 March 2012

theBioBucket*: R-Function to Read Data from Google Docs Spreadsheets

This should be useful! Keeping for future reference

Understanding the “improved” in VIM - Super User Blog

Well the title is actually misleading as it doesn't explain the 'm' in vim 

BUT it offers an excellent insight of how to remember vim's 'macros' 
i like the analogy of thinking of vim as a language .. 
I definitely didn't know that grep came from the g/re/p command to print lines matching the regular expression!

Wednesday, 14 March 2012

Detecting selective sweeps from pooled - PubMed Mobile

Abstract Due to its cost effectiveness, next generation sequencing of pools of individuals (Pool-Seq) is becoming a popular strategy for characterizing variation in population samples. Since Pool-Seq provides genome-wide SNP frequency data, it is possible to use them for demographic inference and/or the identification of selective sweeps. Here, we introduce a statistical method that is designed to detect selective sweeps from pooled data by accounting for statistical challenges associated with Pool-Seq, namely sequencing errors and random sampling among chromosomes. This allows for an efficient use of the information : all base calls are included in the analysis, but the higher credibility of regions with higher coverage and base calls with better quality scores is accounted for.

The Pistoia Alliance Sequence Squeeze competition

Sorry this is a little late .. but better than never yeah?

The Pistoia Alliance Sequence Squeeze Competition

Amazon vouchers

The first 40 entries received will each receive a US$20 Amazon Web Services voucher.
(Only one voucher per person.)

The volume of next-generation sequencing data is a big problem. Data volumes are growing rapidly as sequencing technology improves. Individual runs are providing many more reads than before, and decreasing run times mean that more data can today be generated by a single machine in one day than a single machine could have produced in the whole of 2005.

Storing millions of reads and their quality scores uncompressed is impractical, yet current compression technologies are becoming inadequate. There is a need for a new and novel method of compressing sequence reads and their quality scores in a way that preserves 100% of the information whilst achieving much-improved linear (or, even better, non-linear) compression ratios.

The Pistoia Alliance, in the interests of promoting pre-competitive collaboration, is putting forward a prize fund of US$15,000 to the best novel open-source NGS compression algorithm submitted before the closing date of 15 March 2012.

Follow us on Twitter - @SeqSqueeze

LKVenia: Using OS X Terminal keys on a Macbook Pro

The default key settings on the Mac OS X (10.5) Terminal are pretty weird, especially on a Macbook Pro, and took me a while to figure them out and setup as I wanted.

Thanks to this post I finally found how to get back the shift PgUp ! you have no idea how vexing it is to use the arrow keys to scroll man pages .. 

On the Terminal preferences, go to the Settings item, and choose the Keyboard tab.
Find these keys and edit them (note that the \033 you can get by pressing the esc key):

End - send string to shell: \033[4~
Home - send string to shell: \033[1~
Page down - send string to shell: \033[6~
Page up - send string to shell: \033[5~
Shift page down - scroll to next page in buffer
Shift page up - scroll to previous page in buffer

Easyfig | Free software downloads at

Easyfig enables the creation of linear comparison figures showing BLAST matches between multiple genomic loci or prokaryote genomes. Easyfig has an easy-to-use graphical user interface and is able to launch BLAST searches interactively.

WITHDRAWN: Evaluation of next-generation sequenc... [J Hum Genet. 2011] - PubMed - NCBI

I wonder why ... 

J Hum Genet. 2011 Jun 16. doi: 10.1038/jhg.2011.62. Epub 2011 Jun 16.

WITHDRAWN: Evaluation of next-generation sequencing software in mapping and assembly.


Department of Biochemistry, Center for Reproduction, Development and Growth, The University of Hong Kong, Hong Kong.


Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages, when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields and further provided advices on selecting suitable tools for specific biological applications.Journal of Human Genetics advance online publication, 16 June 2011; doi:10.1038/jhg.2011.62.

Datanami, Woe be me