Wednesday, 30 November 2011

[Bio-bwa-help] BWA unique mapping, multireads

removing BWA multiple mapped reads 

Notes (credit Li Heng) 

For single-end mapping, you can use X0:i:1 and/or XT:A:U. tags to filter

But for paired-end mapping, reads may be moved in the process of pairing. 
Most of those X? tags are computed for single-end mapping. That is also why some X? tags are missing in "sampe".

The Human OligoGenome Resource: a database of oligonucleotide capture probes for resequencing target regions across the human genome

  1. Nucl. Acids Res. (2011) doi: 10.1093/nar/gkr973


    Recent exponential growth in the throughput of next-generation DNA sequencing platforms has dramatically spurred the use of accessible and scalable targeted resequencing approaches. This includes candidate region diagnostic resequencing and novel variant validation from whole genome or exome sequencing analysis. We have previously demonstrated that selective genomic circularization is a robust in-solution approach for capturing and resequencing thousands of target human genome loci such as exons and regulatory sequences. To facilitate the design and production of customized capture assays for any given region in the human genome, we developed the Human OligoGenome Resource ( This online database contains over 21 million capture oligonucleotide sequences. It enables one to create customized and highly multiplexed resequencing assays of target regions across the human genome and is not restricted to coding regions. In total, this resource provides 92.1% in silico coverage of the human genome. The online server allows researchers to download a complete repository of oligonucleotide probes and design customized capture assays to target multiple regions throughout the human genome. The website has query tools for selecting and evaluating capture oligonucleotides from specified genomic regions

GenomeView: a next-generation genome browser

  1. Nucl. Acids Res. (2011) doi: 10.1093/nar/gkr995


    Due to ongoing advances in sequencing technologies, billions of nucleotide sequences are now produced on a daily basis. A major challenge is to visualize these data for further downstream analysis. To this end, we present GenomeView, a stand-alone genome browser specifically designed to visualize and manipulate a multitude of genomics data. GenomeView enables users to dynamically browse high volumes of aligned short-read data, with dynamic navigation and semantic zooming, from the whole genome level to the single nucleotide. At the same time, the tool enables visualization of whole genome alignments of dozens of genomes relative to a reference sequence. GenomeView is unique in its capability to interactively handle huge data sets consisting of tens of aligned genomes, thousands of annotation features and millions of mapped short reads both as viewer and editor. GenomeView is freely available as an open source software package. 

     Figure 1.Figure 2.Figure 3.

Abstract | eXframe: reusable framework for storage, analysis and visualization of genomics experiments


Genome-wide experiments are routinely conducted to measure gene expression, DNA-protein interactions and epigenetic status. Structured metadata for these experiments is imperative for a complete understanding of experimental conditions, to enable consistent data processing and to allow retrieval, comparison, and integration of experimental results. Even though several repositories have been developed for genomics data, only a few provide annotation of samples and assays using controlled vocabularies. Moreover, many of them are tailored for a single type of technology or measurement and do not support the integration of multiple data types.


We have developed eXframe - a reusable web-based framework for genomics experiments that provides 1) the ability to publish structured data compliant with accepted standards 2) support for multiple data types including microarrays and next generation sequencing 3) query, analysis and visualization integration tools (enabled by consistent processing of the raw data and annotation of samples) and is available as open-source software. We present two case studies where this software is currently being used to build repositories of genomics experiments - one contains data from hematopoietic stem cells and another from Parkinson's disease patients.


The web-based framework eXframe offers structured annotation of experiments as well as uniform processing and storage of molecular data from microarray and next generation sequencing platforms. The framework allows users to query and integrate information across species, technologies, measurement types and experimental conditions. Our framework is reusable and freely modifiable - other groups or institutions can deploy their own custom web-based repositories based on this software. It is interoperable with the most important data formats in this domain. We hope that other groups will not only use eXframe, but also contribute their own useful modifications.

Video: Non-Laser Capture Microscopy Approach for the Microdissection of Discrete Mouse Brain Regions for Total RNA Isolation and Downstream Next-Generation Sequencing and Gene Expression Profiling

Video: Non-Laser Capture Microscopy Approach for the Microdissection of Discrete Mouse Brain Regions for Total RNA Isolation and Downstream Next-Generation Sequencing and Gene Expression Profiling

As technological platforms, approaches such as next-generation sequencing, microarray, and qRT-PCR have great promise for expanding our understanding of the breadth of molecular regulation. Newer approaches such as high-resolution RNA sequencing (RNA-Seq)1 provides new and expansive information about tissue- or state-specific expression such as relative transcript levels, alternative splicing, and micro RNAs2-4. Prospects for employing the RNA-Seq method in comparative whole transcriptome profiling5 within discrete tissues or between phenotypically distinct groups of individuals affords new avenues for elucidating molecular mechanisms involved in both normal and abnormal physiological states. Recently, whole transcriptome profiling has been performed on human brain tissue, identifying gene expression differences associated with disease progression6. However, the use of next-generation sequencing has yet to be more widely integrated into mammalian studies.

Gene expression studies in mouse models have reported distinct profiles within various brain nuclei using laser capture microscopy (LCM) for sample excision7,8. While LCM affords sample collection with single-cell and discrete brain region precision, the relatively low total RNA yields from the LCM approach can be prohibitive to RNA-Seq and other profiling approaches in mouse brain tissues and may require sub-optimal sample amplification steps. Here, a protocol is presented for microdissection and total RNA extraction from discrete mouse brain regions. Set-diameter tissue corers are used to isolate 13 tissues from 750-μm serial coronal sections of an individual mouse brain. Tissue micropunch samples are immediately frozen and archived. Total RNA is obtained from the samples using magnetic bead-enabled total RNA isolation technology. Resulting RNA samples have adequate yield and quality for use in downstream expression profiling. This microdissection strategy provides a viable option to existing sample collection strategies for obtaining total RNA from discrete brain regions, opening possibilities for new gene expression discoveries.

Monday, 28 November 2011

Ubuntu popularity in downward spiral, is Unity to blame? - TechSpot News

Well I did a test run of 32 bit 11.10 in a virtual environment giving it 2 Gb ram and a single core, Unity or something definitely slowed it down too much to be useable.. guess I will stick to 10.04 LTS for now till it's sorted out.

mRNA-Seq Illumina Libraries from 10 Nanograms Total RNA. [J Vis Exp. 2011] - PubMed - NCBI


Whole transcriptome sequencing by mRNA-Seq is now used extensively to perform global gene expression, mutation, allele-specific expression and other genome-wide analyses. mRNA-Seq even opens the gate for gene expression analysis of non-sequenced genomes. mRNA-Seq offers high sensitivity, a large dynamic range and allows measurement of transcript copy numbers in a sample. Illumina's genome analyzer performs sequencing of a large number (> 10(7)) of relatively short sequence reads (< 150 bp).The "paired end" approach, wherein a single long read is sequenced at both its ends, allows for tracking alternate splice junctions, insertions and deletions, and is useful for de novo transcriptome assembly. One of the major challenges faced by researchers is a limited amount of starting material. For example, in experiments where cells are harvested by laser micro-dissection, available starting total RNA may measure in nanograms. Preparation of mRNA-Seq libraries from such samples have been described(1, 2) but involves significant PCR amplification that may introduce bias. Other RNA-Seq library construction procedures with minimal PCR amplification have been published(3, 4) but require microgram amounts of starting total RNA. Here we describe a protocol for the Illumina Genome Analyzer II platform for mRNA-Seq sequencing for library preparation that avoids significant PCR amplification and requires only 10 nanograms of total RNA. While this protocol has been described previously and validated for single-end sequencing(5), where it was shown to produce directional libraries without introducing significant amplification bias, here we validate it further for use as a paired end protocol. We selectively amplify polyadenylated messenger RNAs from starting total RNA using the T7 based Eberwine linear amplification method, coined "T7LA" (T7 linear amplification). The amplified poly-A mRNAs are fragmented, reverse transcribed and adapter ligated to produce the final sequencing library. For both single read and paired end runs, sequences are mapped to the human transcriptome(6) and normalized so that data from multiple runs can be compared. We report the gene expression measurement in units of transcripts per million (TPM), which is a superior measure to RPKM when comparing samples(7).

Saturday, 26 November 2011

Confirmed [Bio-bwa-help] color-space support to be dropped in version 0.6

From: "Heng Li"
Date: 24 Nov 2011 11:38

Thanks for all the replies. I will disable the color-space support in the 0.6.x branch, but leave non-functional source code in the files (though this is not my style). In future, I may re-evaluate the necessity of supporting color-space alignment in the 0.6.x branch. People who use bwa for color-space alignment may continue to use 0.5.10. 0.5.10 is as accurate as 0.6.x. It may be slower but just a little bit.

Thank you all,


Friday, 25 November 2011

Convey Computer’s new Burrows-Wheeler Aligner (BWA) accelerates genome reference mapping by 15x

Convey Expands Bioinformatics Suite with A New Personality

Convey Computer's new Burrows-Wheeler Aligner (BWA) personality dramatically accelerates genome reference mapping by 15x, enabling researchers and clinicians to more rapidly and cost-effectively identify variants.

not affliated .. interesting

Thursday, 24 November 2011

unix - How do you handle the "Too many files" problem when working in Bash? - Stack Overflow

I used to encounter this problem alot when I thought it was a better idea to split fasta files individually and keeping a blast report separate for each file (Sanger Seq era ... ) revisited this problem when a friend asked for help ... 

In short

  find . -print0 | xargs -0 grep -H foo
  find ../path -exec grep foo '{}' '+'

Genome-wide profiling of novel and conserved Populus microRNAs involved in pathogen stress response by deep sequencing.. [Planta. 2011] - PubMed - NCBI


MicroRNAs (miRNAs) are small RNAs, generally of 20-23 nt, that down-regulate target gene expression during development, differentiation, growth, and metabolism. In Populus, extensive studies of miRNAs involved in cold, heat, dehydration, salinity, and mechanical stresses have been performed; however, there are few reports profiling the miRNA expression patterns during pathogen stress. We obtained almost 38 million raw reads through Solexa sequencing of two libraries from Populus inoculated and uninoculated with canker disease pathogen. Sequence analyses identified 74 conserved miRNA sequences belonging to 37 miRNA families from 154 loci in the Populus genome and 27 novel miRNA sequences from 35 loci, including their complementary miRNA* strands. Intriguingly, the miRNA* of three conserved miRNAs were more abundant than their corresponding miRNAs. The overall expression levels of conserved miRNAs increased when subjected to pathogen stress, and expression levels of 33 miRNA sequences markedly changed. The expression trends determined by sequencing and by qRT-PCR were similar. Finally, nine target genes for three conserved miRNAs and 63 target genes for novel miRNAs were predicted using computational analysis, and their functions were annotated. Deep sequencing provides an opportunity to identify pathogen-regulated miRNAs in trees, which will help in understanding the regulatory mechanisms of plant defense responses during pathogen infection.

Wednesday, 23 November 2011

3 Big Data Tech Talks You Can’t Miss

From the linkedin blog

Always nice to see how similar problems are solved in other fields

1. Mining Billion-Node Graphs by Christos Faloutsos (Carnegie Mellon University)

1. Mining Billion-Node Graphs by Christos Faloutsos (Carnegie Mellon University)
What do graphs look like? How do they evolve over time? How do you handle a graph with a billion nodes? Chris presents a comprehensive list of static and temporal laws, grounded in recent observations on real graphs. He then presents tools for discovering anomalies and patterns in graphs. Finally, an overview of the PEGASUS system which is designed to handle billion-node graphs using Hadoop.

2. The Art and Science of Matching Items to Users by Deepak Agarwal (Yahoo! Research)
Algorithmically matching items to users in a given context is essential for the success and profitability of large scale recommender systems like content optimization, computational advertising, search, shopping, movie recommendation, and many more. In this talk, Deepak discusses some of the key technical challenges by focusing on a concrete application – content optimization on the Yahoo! front page. He also briefly discusses response prediction techniques for serving ads on the RightMedia Ad exchange.

3. Big Data in Real Time: Processing Data Streams at LinkedIn by Jay Kreps (LinkedIn)
My colleague, Jay Kreps, discusses the state of up-and-coming stream processing technologies and how they fit in the broader data infrastructure ecosystem — from live storage systems to Hadoop. He explores problems that are amenable to real-time stream processing, solutions that change and shape the way we think about data, and challenges and lessons that we have learned while building LinkedIn’s data infrastructure. A must-see presentation.

In addition to providing compelling speakers, Open Tech Talks offer attendees a low-pressure environment in which people with shared professional interests can reconnect with people they know, as well as make new connections. For those who cannot attend, we live-stream the talks and post the entire recordings on YouTube.

Tuesday, 22 November 2011

Intel Demos Teraflop-Crunching Monster Chip | News & Opinion |,2817,2396576,00.asp

1 teraflop of double-precision floating point performance, 50+-core ... it would be interesting to see how GPU servers fare now that Intel is pushing this chip in HPC ... that said, I don't know any guys that use GPUs for bioinformatics work ... do you?

Fwd: NGS Field Guide – Overview | The Molecular Ecologist


Compare, contrast, decide  !

NGS Field Guide – Overview

The tables presented in Glenn (2011) are split and updated in the following:

  • Table 1a-c.  "Grades" for common applications on various NGS instruments.  Other information from the original table 1 is relatively static.
  • Table 2a.  Run time, Millions of reads/run, Bases/read, and Yield/run for all common commercial NGS platforms.
  • Table 2b. Reagent costs/run, reagent costs/Mb, and minimum commercially available units for all common commercial NGS platforms.
  • Table 3a. List purchase price for for all common commercial NGS platforms, ancillary equipment, and service contracts.
  • Table 3b. Computational resources required for all common commercial NGS platforms.
  • Table 3c. Errors and error rates for common commercial NGS platforms.
  • Table 4.  Advantages and Disadvantages for all common commercial NGS platforms.
Citation: Glenn, TC (2011) Field Guide to Next Generation DNA Sequencers.  Molecular Ecology Resources. doi: 10.1111/j.1755-0998.2011.03024.x
©2011 Blackwell Publishing Ltd

Monday, 21 November 2011

get info on motherboard / PC specs without opening your case Linux only ..

Found this nifty command that saves me work with a screwdriver! Useful also when you are using a text only or headless Linux box.

    sudo dmidecode | less

e.g. of output
Processor Information
    Socket Designation: Microprocessor
    Type: Central Processor
    Family: Pentium 4
    Manufacturer: Intel
    ID: 41 0F 00 00 FF FB EB BF
    Signature: Type 0, Family 15, Model 4, Stepping 1
        FPU (Floating-point unit on-chip)
        VME (Virtual mode extension)
        DE (Debugging extension)
        PSE (Page size extension)
        TSC (Time stamp counter)
        MSR (Model specific registers)
        PAE (Physical address extension)
        MCE (Machine check exception)
        CX8 (CMPXCHG8 instruction supported)
        APIC (On-chip APIC hardware supported)
        SEP (Fast system call)
        MTRR (Memory type range registers)
        PGE (Page global enable)
        MCA (Machine check architecture)
        CMOV (Conditional move instruction supported)
        PAT (Page attribute table)
        PSE-36 (36-bit page size extension)
        CLFSH (CLFLUSH instruction supported)
        DS (Debug store)
        ACPI (ACPI supported)
        MMX (MMX technology supported)
        FXSR (Fast floating-point save and restore)
        SSE (Streaming SIMD extensions)
        SSE2 (Streaming SIMD extensions 2)
        SS (Self-snoop)
        HTT (Hyper-threading technology)
        TM (Thermal monitor supported)
        PBE (Pending break enabled)
    Version: Not Specified
    Voltage: 1.7 V
    External Clock: 800 MHz
    Max Speed: 4000 MHz
    Current Speed: 3000 MHz
    Status: Populated, Enabled
    Upgrade: ZIF Socket
    L1 Cache Handle: 0x0700
    L2 Cache Handle: 0x0701
    L3 Cache Handle: Not Provided

Sunday, 20 November 2011

[Bio-bwa-help] color-space support may be dropped in version 0.6

I think that bwa being able to support color space is a boon for analyzing solid data on ram limited machines.

Or on galaxy, which has only color space support on their test server ...

If bwa drops solid support then the remaining opensource options off the top of my head are bfast and bowtie.

There's also novoalign and bioscope (non open source )

But the drop for might also reflect a decrease in solid data in the wild ...

Possibly due to Life Tech push for Ion Torrent ...

What are your views?

On 20 Nov 2011 03:29, "Heng Li" wrote:
> The color-space alignment is not working in 0.6.0. Perhaps it is not so hard to make it work again, but bwa may not work well with solid reads all the time. Actually I have never evaluated this myself. 0.5.10 should work solid data.
> Any objections? Do you think it is worth keeping the color-space support in bwa?
> Thanks,
> Heng
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> ------------------------------------------------------------------------------
> Bio-bwa-help mailing list

Friday, 18 November 2011

Fwd: What are my Cloud based NGS analysis options?

For Commercial offerings there are 

DNAnexus provides solutions for both DNA sequencing centers growing their next-gen capacity, and the researchers working with next-gen sequence data. Our web-based platform solves the data management and analysis challenges common to both with a single, unified system. We support sequencing operations and research organizations of virtually any size, with absolutely no upfront hardware investment needed.
They have secured $15 mil funding led by google ventures so this is a company to watch for .. 

only works with Illumina data 
Next-generation sequencing cloud computing for biologists. 
Combining industry leading NGS technology with easy-to-use bioinformatics, storage, and sharing.

Only works with SOLiD data 
SOLiD™ offers customers an alternative to buying and maintaining the expensive compute infrastructure typically required for Next Generation Sequencing (NGS) data analysis. Your NGS data and the tools necessary to analyze that data are available to you wherever you access the internet—be it in the lab or on the beach!

Galaxy CloudMan: delivering cloud compute clusters
Ok this is not strictly commercial, but you would have to pay for compute hours for your Amazon instances. This is an option I am eager to play around with. 

:-)  there's also Galaxy the cloud based NGS web gui that's fast gaining traction . 

What are your views on the available solutions? 

BED file format - FAQ

UCSC very good description of the BED format

Bedtools attempts to auto-detect the file formats used in each command.  Thus, as long as your file conform to the format definitions for each file type, you should be okay.  For example:

- BAM is zero-based, half-open.  SAM is 1-based, closed.
- BED is zero-based, half-open.

zero length intervals (start == end), which in BED format, are interpreted as insertions in the reference.

I can't confirm this but from the top of my head, I recall that

start -1, end in BED format refers to SNPs

(source mostly from BEDTools mailling list)

Ion Torrent PGM Mate-Paired Library Preparation

Life Technologies Demonstrated Protocol: Ion Mate-Paired Library Preparation

Ion Personal Genome Machine™ System


Publication Part Number 4472004

Rev. B Revision Date 14 October 2011


Revision B includes the following correction: in the required materials, Library Size Selection gels (part # 4443733) changed to E-Gel SizeSelect 2% agarose, part # G661002.

David Jenkins (EdgeBio) recently presented results ( from a run with average 10kb inserts.)  at their local Ion User Group Meeting Check out their blog post

VCFtools BEDtools compare, intersect, merge

A Good computational biologist is only as good as the tools he uses (or maybe how good he is at google) rofl kidding ...
Life is always easier when you find the correct tool.
I also adopt the path of least resistance when trying to solve problems that are more common than I imagine.
There's always the good old linux tools for comparing SNPs called from different programs / options

   grep | sed | awk | cut | diff | comm

and if you are working with NGS data, you most probably already have samtools installed on your system and you might have used bcftools
Did you also know that there's also a (unrelated) set of tools called vcftools?

The VCFtools package is broadly split into two sections: 

Then there's  also the highly used BEDTools

which I highly recommend to keep as part of your tools collection. Check out the link below

Do watch out for this 'oversight' in vcftools as pointed out in seqanswers.
Overlap number discrepancy between VCFTools and BEDTools

UsageExamples of common usage.   Featured

Whole genome resequencing of Black Angus and Holstein cattle for SNP and CNV discovery using SOLID [BMC Genomics. 2011] - PubMed - NCBI




One of the goals of livestock genomics research is to identify the genetic differences responsible for variation in phenotypic traits, particularly those of economic importance. Characterizing the genetic variation in livestock species is an important step towards linking genes or genomic regions with phenotypes. The completion of the bovine genome sequence and recent advances in DNA sequencing technology allow for in-depth characterization of the genetic variations present in cattle. Here we describe the whole-genome resequencing of two Bos taurus bulls from distinct breeds for the purpose of identifying and annotating novel forms of genetic variation in cattle.


The genomes of a Black Angus bull and a Holstein bull were sequenced to 22-fold and 19-fold coverage, respectively, using the ABI SOLiD system. Comparisons of the sequences with the Btau4.0 reference assembly yielded 7 million single nucleotide polymorphisms (SNPs), 24% of which were identified in both animals. Of the total SNPs found in Holstein, Black Angus, and in both animals, 81%, 81%, and 75% respectively are novel. In-depth annotations of the data identified more than 16 thousand distinct non-synonymous SNPs (85% novel) between the two datasets. Alignments between the SNP-altered proteins and orthologues from numerous species indicate that many of the SNPs alter well-conserved amino acids. Several SNPs predicted to create or remove stop codons were also found. A comparison between the sequencing SNPs and genotyping results from the BovineHD high-density genotyping chip indicates a detection rate of 91% for homozygous SNPs and 81% for heterozygous SNPs. The false positive rate is estimated to be about 2% for both the Black Angus and Holstein SNP sets, based on follow-up genotyping of 422 and 427 SNPs, respectively. Comparisons of read depth between the two bulls along the reference assembly identified 790 putative copy-number variations (CNVs). Ten randomly selected CNVs, five genic and five non-genic, were successfully validated using quantitative real-time PCR. The CNVs are enriched for immune system genes and include genes that may contribute to lactation capacity. The majority of the CNVs (69%) were detected as regions with higher abundance in the Holstein bull.


Substantial genetic differences exist between the Black Angus and Holstein animals sequenced in this work and the Hereford reference sequence, and some of this variation is predicted to affect evolutionarily conserved amino acids or gene copy number. The deeply annotated SNPs and CNVs identified in this resequencing study can serve as useful genetic tools, and as candidates in searches for phenotype-altering DNA differences.

Thursday, 17 November 2011

Feature based classifiers for somatic mutation detection in tumour-normal paired sequencing data. [Bioinformatics. 2011] - PubMed - NCBI



The study of cancer genomes now routinely involves using next generation sequencing technology (NGS) to profile tumors for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge.


We present the comparison of four standard supervised machine learning algorithms for the purpose of somatic SNV prediction in tumour/normal NGS experiments. To evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine, and logistic regression) we constructed 106 features representing 3369 candidate somatic SNVs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. We trained the classifiers on this data (consisting of 1015 true somatic mutations and 2354 non-somatic mutation positions) and conducted a rigourous evaluation of these methods using a cross-validation framework and hold-out test NGS data from both exome capture and whole genome shotgun platforms. All learning algorithms employing predictive discriminative approaches with feature selection improved the predictive accuracy over standard approaches by statistically significant margins. In addition, using unsupervised clustering of the ground truth 'false positive' predictions, we noted several distinct classes and present evidence suggesting non-overlapping sources of technical artefacts illuminating important directions for future study.


Software called MutationSeq and datasets are available from

snpEff New version: 2.0.4.rc1 (2011-11-15). Release Candidate 1

New version: 2.0.4.rc1 (2011-11-15). Release Candidate 1

Take a look at all the new features added
  • Database download command, e.g. "java -jar snpEff.jar download GRCH37.64"
  • RefSeq annotations support added.
  • Rogue transcript filter: By default SnpEff filters out some suspicious transcripts from annotations databases. This should improve false positive rates.
  • Amino acid changes in HGVS style (VCF output)
  • SnpSift: Added 'intIdx', looks for intervals using indexing and memory mapped I/O on the VCF file. Works really fast! Designed to extract a small number of intervals from huge VCF files.
  • Optimized parsing for VCF files with large number of samples (genotypes).
  • Option to suppress summary calculation ('-noStats'), can speed up processing considerably in some cases.
  • Option '-onlyCoding' is set to 'auto' to reduce number of false positives (see next).
  • Option '-onlyCoding' can be assigne a value: If value is 'true', report only 'protein_coding' transcripts as proteing coding changes. If 'false', report all transcript as if they were conding. Default: Auto, i.e. if transcripts any marked as 'protein_coding' the set it to 'true', if no transcripts are marked as 'protein_coding' then set it to 'false'.
  • Added BED output format. This is usefull to annotate the output of a Chip-Seq experiment (e.g. after performing peak calling with MACS, you want to know where the peaks hit).

Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems |Genome Biology |


The generation and analysis of high-throughput sequencing data is becoming a major component of many studies in molecular biology and medical research. Illumina's Genome Analyzer (GA) and HiSeq instruments are currently the most widely used sequencing devices. Here, we comprehensively evaluate properties of genomic HiSeq and GAIIx data derived from two plant genomes and one virus, with read lengths of 95-150 bases.


We provide quantifications and evidence for GC bias, error rates, error sequence context, effects of quality filtering, and the reliability of quality values. By combining different filtering criteria we reduced error rates 7-fold at the expense of discarding 12.5% of alignable bases. While overall error rates are low in HiSeq data we observed regions of accumulated wrong base calls. Only 3% of all error positions accounted for 24.7% of all substitution errors. Analyzing the forward and reverse strand separately revealed error rates of up to 18.7%. Insertions and deletions occurred at very low rates on average but increased to up to 2% in homopolymers. A positive correlation between read coverage and GC content was found depending on the GC content range.


The errors and biases we report have implications on the use and the interpretation of Illumina sequencing data. GAIIx and HiSeq data sets show slightly different error profiles. Quality filtering is essential to minimize downstream analysis artifacts. Supporting previous recommendations, the strand-specificity provides a criterion to distinguish sequencing errors from low abundance polymorphisms.

Getting Genetics Done: Guide to RNA-seq Analysis in Galaxy

Getting Genetics Done: Guide to RNA-seq Analysis in Galaxy

Came across this new blog which also highlighted the very useful RNA-seq tutorial which can be done entirely in Galaxy (comes with sample data)

I have also previously highlighted this tutorial in

FAQ - Howto do RNA-seq Bioinformatics analysis on Galaxy

Nice ... looks like the community is opening up to the potential of cloud computing for bioinformatics and publishing their workflows to create community reviewed standards for data analysis.

Using Plant DNA Barcodes to Estimate Species Richness in Poorly Known Floras

Using Plant DNA Barcodes to Estimate Species Richness in Poorly Known Floras
Costion, Ford et al., PLoS One
Investigators at Australia's University of Adelaide show that "plant DNA barcodes can accurately estimate species richness in poorly known floras." In a case study, the Adelaide team demonstrates the "potential of plant DNA barcodes for the rapid estimation of species richness in taxonomically poorly known areas or cryptic populations revealing a powerful new tool for rapid biodiversity assessment." Overall, the team says it shows that "although DNA barcodes fail to discriminate all species of plants, new perspectives and methods on biodiversity value and quantification may overshadow some of these shortcomings by applying barcode data in new ways."

Tuesday, 15 November 2011

Howto setup VPN , Juniper Network Connect in NUS, on Ubuntu 10.04 Lucid Lynx 64 bit! amd64

#to install sun-java6

sudo add-apt-repository "deb lucid partner"
sudo aptitude update
sudo aptitude install sun-java6-plugin sun-java6-jdk sun-java6-jre ia32-sun-java6-bin fastjar

You will need to download the juniper client, the easiest way is to try to login to the webvpn address via a browser
for NUS it is

Answer 'yes' to all the questions
#Read here for more details to solve the Juniper connect

note: in 10.04 edition the su / sudo problem seems to be solved. But it seems like I still needed to use the junipernc script. 

when running junipernc script for the first time it will ask you for 
details which you only need to enter for the first time 

USER="CCEV747" <- example id from the NUS guide

when asked to "Enter your PIN + SecurID Code"
enter your password (it won't be saved)

You should be connected :)

There's another similar guide @

Notes: I encountered a error which required me to install fastjar 

"This program requires the program 'jar'.
It is often found in the Java JDK package.
Use your package manager to install it."

Also there's a silly image that blocks the entry of iPad version of the Juniper app. ITCare is waiting for the vendor to resolve the issue. 

Postdoctoral position in bioinformatics of high throughput DNA sequencing | Careers | GenomeWeb

interesting trends

We are looking for a post doctoral associate in Computational Biology to join the "Learning from Human Genetics" initiative within the Program of Medical and Population Genetics at the Broad Institute. Our goal is to develop and apply methods to understand the function of genes/genetic loci identified in genome-wide association studies for Crohn's disease (CD) and Type 1 diabetes (T1D).

Postdoctoral Position in Bioinformatics of High-Throughput DNA Sequencing, Beijing Institute of Genomics, Chinese Academy of Sciences
The primary focus of the research will be on detecting and explaining the aberrant splicing events induced by common and de novo genetic variants. The research project will involve routine processing of high volumes of next generation sequencing data. Further data analysis based on probabilistic models will assess possible disease associations for discovered variants.

• Ability to program in Perl or Python, with knowledge of R/S-plus/ MATLAB and SQL essential
Familiarity with Unix systems
•Be experienced in with programming in Java, C/C++, Python, Perl or similar languages and be familiar with Linux OS

Dr. Andrew Weil: Why Data Smog May Be Making You Depressed | TIME Ideas |

We  live in the Information Age. But I've never heard — nor would any sane person suggest — that we live in the Useful Information Age. The modern downpour of data is largely worthless distraction, and the sheer amount is drowning us. Of all of the ways in which the contemporary environment is mismatched with our genes and harms our emotional health, I believe the revolution in information delivery is the one most responsible for epidemic depression.

Read more:

Why do people reject science? Here’s why (Science Alert)

Are there ways in which such gaps between scientific knowledge and public acceptance can be bridged?

Potentially, yes.

There is much evidence that the framing of information facilitates its acceptance when it no longer threatens people's worldview. HI individuals are more likely to accept climate science when the proposed solution involves nuclear power than when it involves emission cuts.

Similarly, the messenger matters. HPV vaccination is more likely to be found acceptable by HI individuals if arguments in its favour are presented by someone clearly identified as hierarchical-individualistic.

Monday, 14 November 2011

RT @thinkgenome: How to apply de Bruijn graphs to genome assembly : Nature ...: A mathematical concept known as a de Bruijn graph...

RT @thinkgenome: How to apply de Bruijn graphs to genome assembly : Nature ...: A mathematical concept known as a de Bruijn graph...

Warning! Math ahead ...

Finally! An article that explains de Bruijin to the biologists

Saturday, 12 November 2011

Use of low-coverage, large-insert, short-read data ... [PLoS One. 2011] - PubMed - NCBI


Next-generation genomic technology has both greatly accelerated the pace of genome research as well as increased our reliance on draft genome sequences. While groups such as the Genomics Standards Consortium have made strong efforts to promote genome standards there is a still a general lack of uniformity among published draft genomes, leading to challenges for downstream comparative analyses. This lack of uniformity is a particular problem when using standard draft genomes that frequently have large numbers of low-quality sequencing tracts. Here we present a proposal for an "enhanced-quality draft" genome that identifies at least 95% of the coding sequences, thereby effectively providing a full accounting of the genic component of the genome. Enhanced-quality draft genomes are easily attainable through a combination of small- and large-insert next-generation, paired-end sequencing. We illustrate the generation of an enhanced-quality draft genome by re-sequencing the plant pathogenic bacterium Pseudomonas syringae pv. phaseolicola 1448A (Pph 1448A), which has a published, closed genome sequence of 5.93 Mbp. We use a combination of Illumina paired-end and mate-pair sequencing, and surprisingly find that de novo assemblies with 100x paired-end coverage and mate-pair sequencing with as low as low as 2-5x coverage are substantially better than assemblies based on higher coverage. The rapid and low-cost generation of large numbers of enhanced-quality draft genome sequences will be of particular value for microbial diagnostics and biosecurity, which rely on precise discrimination of potentially dangerous clones from closely related benign strains.

Datanami, Woe be me