Thursday, 31 May 2012

Getting Genetics Done: Monday Links: 23andMe, RStudio, PacBio+Galaxy, Data Science One-Liners, Post-Linkage RFA, SSH

A handful of links for those getting genetics done for the first time and also for the seasoned veteran.

Data Science Hand Tools This piece includes nice demonstrations of using grep, awk, cut, colrm, sort, uniq, wc, find, xargs, and other simple and flexible UNIX one-liners to do some serious data munging. 

And finally, simple instructions for setting up password-free SSH login

Wednesday, 30 May 2012

Sichuan Agricultural University and LC Sciences Uncover the Epigenetics of Obesity - Houston Chronicle

Sichuan Agricultural University and LC Sciences Uncover the Epigenetics of Obesity

Published 09:06 a.m., Tuesday, May 29, 2012

Press Release

In a new study published online in Nature Communications, researchers from Sichuan Agricultural University and LC Sciences report the miRNAome in porcine adipose and muscle tissues. The report provides a valuable epigenomic source for obesity prediction and prevention and furthers the development of pig as a model organism for human obesity research.

Hangzhou, China (PRWEB) May 29, 2012

In a new study published online in Nature Communications, researchers from Sichuan Agricultural University and LC Sciences report the miRNAome in porcine adipose and muscle tissues. The report provides a valuable epigenomic source for obesity prediction and prevention and furthers the development of pig as a model organism for human obesity research[1].

Scientists now know that the genetic code alone isn't responsible for adult phenotype or even the offspring of these adults. Epigenetics refers to changes in gene expression affecting phenotype that don't involve changes to the DNA nucleotide sequence itself, and yet are heritable. DNA methylation, histone modification and microRNA (miRNA) expression are examples of epigenetic mechanisms that have recently been identified as important regulators of gene expression in many biological systems.

Obesity is a huge problem worldwide. Recently, the World Health Organization reported that obesity levels doubled in every region of the world between 1980 and 2008, spurring rates of non-communicable diseases such as diabetes and cancer that now account for almost two out of three deaths globally. It has become evident that epigenetic factors, such as DNA methylation and miRNA expression, have essential roles in obesity development.

Now, a team led by Researchers at the Institute of Animal Genetics and Breeding, Sichuan Agricultural University, China has used a pig model to investigate the systematic association between epigenetic regulators and obesity. Pigs are an excellent model system to study obesity due to their similar physiology to ours including: metabolic features, cardiovascular systems, and proportional organ sizes. The researchers generated a genome-wide DNA methylation map as well as miRNA expression and gene expression maps for adipose and muscle tissues from three pig breeds living within comparable environments but displaying distinct fat levels.

Genome-wide identification and expression analysis of heat-responsive and novel microRNAs in Populus tomentosa.

(Display the citation in PubMed)

1. Gene. 2012 May 24. [Epub ahead of print]

Genome-wide identification and expression analysis of heat-responsive and novel microRNAs in Populus tomentosa.

Chen L, Ren Y, Zhang Y, Xu J, Sun F, Zhang Z, Wang Y.


Plant microRNAs have a vital role in various abiotic stress responses by regulating gene expression. Heat stress is one of the most severe abiotic stresses, and affects plant growth and development, even leading to death. To identify heat-responsive miRNAs at the genome-wide level in Populus, Solexa sequencing was employed to sequence two libraries from Populus tomentosa, treated and untreated by heat stress. Sequence analysis identified 134 conserved miRNAs belonging to 30 miRNA families, and 16 novel miRNAs belonging to 14 families. Among these miRNAs, 52 miRNAs from 15 families were responsive to heat stress and most of them were down-regulated. qRT-PCR analysis confirmed that the conserved and novel miRNAs were expressed in P. tomentosa, and revealed similar expression trends to the Solexa sequencing results obtained under heat stress. One hundred and nine targets of the novel miRNAs were predicted. This study opens up a new avenue for understanding the regulatory mechanisms of miRNAs involvement in the heat stress response of trees.
Copyright © 2012 Elsevier B.V. All rights reserved.
PMID: 22634103 [PubMed - as supplied by publisher]
Icon for Elsevier Science

Tuesday, 29 May 2012

TaxMan: a server to trim rRNA reference databases and inspect taxonomic coverage.

Item 1 of 1    (Display the citation in PubMed)

1. Nucleic Acids Res. 2012 May 22. [Epub ahead of print]

TaxMan: a server to trim rRNA reference databases and inspect taxonomic coverage.

Brandt BW, Bonder MJ, Huse SM, Zaura E.


Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam (ACTA), University of Amsterdam and VU University Amsterdam, Amsterdam, The Netherlands, Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands and Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA, USA.


Amplicon sequencing of the hypervariable regions of the small subunit ribosomal RNA gene is a widely accepted method for identifying the members of complex bacterial communities. Several rRNA gene sequence reference databases can be used to assign taxonomic names to the sequencing reads using BLAST, USEARCH, GAST or the RDP classifier. Next-generation sequencing methods produce ample reads, but they are short, currently ∼100-450 nt (depending on the technology), as compared to the full rRNA gene of ∼1550 nt. It is important, therefore, to select the right rRNA gene region for sequencing. The primers should amplify the species of interest and the hypervariable regions should differentiate their taxonomy. Here, we introduce TaxMan: a web-based tool that trims reference sequences based on user-selected primer pairs and returns an assessment of the primer specificity by taxa. It allows interactive plotting of taxa, both amplified and missed in silico by the primers used. Additionally, using the trimmed sequences improves the speed of sequence matching algorithms. The smaller database greatly improves run times (up to 98%) and memory usage, not only of similarity searching (BLAST), but also of chimera checking (UCHIME) and of clustering the reads (UCLUST). TaxMan is available at

Free Article
PMID: 22618877 [PubMed - as supplied by publisher]
Icon for HighWire Press

VarioWatch: providing large-scale and comprehensive annotations on human genomic variants in the next generation sequencing era.

1. Nucleic Acids Res. 2012 May 22. [Epub ahead of print]

VarioWatch: providing large-scale and comprehensive annotations on human genomic variants in the next generation sequencing era.

Cheng YC, Hsiao FC, Yeh EC, Lin WJ, Tang CY, Tseng HC, Wu HT, Liu CK, Chen CC, Chen YT, Yao A.


National Center for Genome Medicine and Institute of Biomedical Sciences, Academia Sinica, Taiwan 11529, R.O.C.


VarioWatch ( has been vastly improved since its former publication GenoWatch in the 2008 Web Server Issue. It is now at least 10 000-times faster in annotating a variant. Drastic speed increase, through complete re-design of its working mechanism, makes VarioWatch capable of annotating millions of human genomic variants generated from next generation sequencing in minutes, if not seconds. While using MegaQuery of VarioWatch to quickly annotate variants, users can apply various filters to retrieve a subgroup of variants according to the risk levels, interested regions, etc. that satisfy users' requirements. In addition to performance leap, many new features have also been added, such as annotation on novel variants, functional analyses on splice sites and in/dels, detailed variant information in tabulated form, plus a risk level decision tree regarding the analyzed variant. Up to 1000 target variants can be visualized with our carefully designed Genome View, Gene View, Transcript View and Variation View. Two commonly used reference versions, NCBI build 36.3 and NCBI build 37.2, are supported. VarioWatch is unique in its ability to annotate comprehensively and efficiently millions of variants online, immediately delivering the results in real time, plus visualizes up to 1000 annotated variants.
Free Article
PMID: 22618869 [PubMed - as supplied by publisher]
Icon for HighWire Press

Haploinsufficiency of CELF4 at 18q12.2 is associated with developmental and behavioral disorders, seizures, eye manifestations, and obesity.

 (Display the citation in PubMed)

1. Eur J Hum Genet. 2012 May 23. doi: 10.1038/ejhg.2012.92. [Epub ahead of print]

Haploinsufficiency of CELF4 at 18q12.2 is associated with developmental and behavioral disorders, seizures, eye manifestations, and obesity.

Halgren C, Bache I, Bak M, Myatt MW, Anderson CM, Brøndum-Nielsen K, Tommerup N.


Department of Cellular and Molecular Medicine, Wilhelm Johannsen Centre for Functional Genome Research, University of Copenhagen, Faculty of Health Sciences, Copenhagen, Denmark.


Only 20 patients with deletions of 18q12.2 have been reported in the literature and the associated phenotype includes borderline intellectual disability, behavioral problems, seizures, obesity, and eye manifestations. Here, we report a male patient with a de novo translocation involving chromosomes 12 and 18, with borderline IQ, developmental and behavioral disorders, myopia, obesity, and febrile seizures in childhood. We characterized the rearrangement with Affymetrix SNP 6.0 Array analysis and next-generation mate pair sequencing and found truncation of CELF4 at 18q12.2. This second report of a patient with a neurodevelopmental phenotype and a translocation involving CELF4 supports that CELF4 is responsible for the phenotype associated with deletion of 18q12.2. Our study illustrates the utility of high-resolution genome-wide techniques in identifying neurodevelopmental and neurobehavioral genes, and it adds to the growing evidence, including a transgenic mouse model, that CELF4 is important for human brain development.European Journal of Human Genetics advance online publication, 23 May 2012; doi:10.1038/ejhg.2012.92.
PMID: 22617346 [PubMed - as supplied by publisher]
Icon for Nature Publishing Group

pypeFLOW is light weight and reusable make / flow data process library written in Python.

What is pypeFLOW

pypeFLOW is light weight and reusable make / flow data process library written in Python.

Most of bioinformatics analysis or general data analysis includes various steps combining data files, transforming files between different formats and calculating statistics with a variety of tools. Ian Holmes has a great summary and opinions about bioinformatics workflow at It is interesting that such analysis workflow is really similar to constructing software without an IDE in general. Using a "makefile" file for managing bioinformatics analysis workflow is actually great for generating reproducible and reusable analysis procedure. Combining with a proper version control tool, one will be able to manage to work with a divergent set of data and tools over a period of time for a project especially when there are complicate dependence between the data, tools and customized code for the analysis tasks.

However, using "make" and "makefile" implies all data analysis steps are done by some command line tools. If you have some customized analysis tasks, you will have to write some scripts and to make them into command line tools. In my personal experience, I find it is convenient to bypass such burden and to combine those quick and simple steps in a single scripts. The only caveat is that if an analyst does not save the results of any intermediate steps, he or she has to repeat the computation all over again for every steps from the beginning. This will waste a lot of computation cycles and personal time. Well, the solution is simple, just like the traditional software building process, one have to track the dependencies and analyze them and only reprocess those parts that are necessary to get the most up-to-date final results.

How Not To Be A Bioinformatician Source Code for Biology and Medicine 2012, 7:3 doi:10.1186/1751-0473-7-3

How Not To Be A Bioinformatician
Source Code for Biology and Medicine 2012, 7:3 doi:10.1186/1751-0473-7-3

Although published material exists about the skills required for a successful bioinformatics career, strangely enough no work to date has addressed the matter of how to excel at not being a bioinformatician. A set of basic guidelines and a code of conduct is hereby presented to re-address that imbalance for fellow-practitioners whose aim is to not to succeed in their chosen bioinformatics field. By scrupulously following these guidelines one can be sure to regress at a highly satisfactory rate.


"Be unreachable and isolated. Configure your contact email to either bounce back or
permanently set it to vacation. Miss key meetings or seminars where other colleagues may be presenting their seminal results and never, ever make any attempt at remembering their names or where they work. Reinvent the wheel. Do not keep up with the literature on current methods of research if you possibly can. "

was this even neccessary to be in the paper?

BPS: Men with brown eyes are perceived as more dominant, but it's not because their eyes are brown

White men with brown eyes are perceived to be more dominant than their blue-eyed counterparts. However, a blue-eyed man looking to make himself appear more dominant would be wasting his time investing in brown-coloured contact lenses. A new study by Karel Kleisner and colleagues at Charles University in the Czech Republic has found that brown iris colour seems to co-occur with some other aspect of facial appearance that triggers in others the perception of dominance. 

Sixty-two student participants, half of them female, rated the dominance and/or attractiveness of the photographed faces of forty men and forty women. All models were Caucasian, and all of them were holding a neutral expression. Men with brown eyes were rated consistently as more dominant than blue-eyed men. No such effect of eye-colour was found for the photos of women. Eye colour also bore no association to the attractiveness ratings. 

Next the researchers used Photoshop to give the brown-eyed men blue eyes and the blue-eyed men brown eyes. The photos were then rated by a new batch of participants. The intriguing finding here was that the dominance ratings were left largely unaffected by the eye colour manipulation. The men who really had brown eyes, but thanks to Photoshop appeared with blue eyes, still tended to be rated as more dominant. 

10 Python one liners to impress your friends « /code/blog

Monday, 28 May 2012

a visual dictionary of R Graphs with code and thumbnails!
This is a must see resource for anyone with the question " I wanna  plot this in R .. how do I  ... " 

I planned to do this for Circos one day (when deadlines are less impending!) 

SCORE-Seq: Score-Type Tests for Detecting Disease Associations With Rare Variants in Sequencing Studies

SCORE-Seq: Score-Type Tests for Detecting Disease Associations With Rare Variants in Sequencing Studies
SCORE-Seq is a command-line program which implements the methods of Lin and Tang (2011) for detecting disease associations with rare variants in sequencing studies. The mutation information is aggregated across multiple variant sites of a gene through a weighted linear combination and then related to disease phenotypes through appropriate regression models. The weights can be constant or dependent on allele frequencies and phenotypes. The association testing is based on score-type statistics. The allele-frequency threshold can be fixed or variable. Statistical significance can be assessed by using asymptotic normal approximation or resampling. A detailed description of the methods is given in Lin and Tang (2011). The current release covers binary and continuous traits with arbitrary covariates under case-control and cross-sectional sampling. The newest version was released on May 21, 2012 with some new features. We are working intensely to improve the capabilities of SCORE-Seq, so please check back frequently for updates.
General information
SCORE-Seq is a command-line program written in the C language to implement the methods of Lin and Tang (2011) for detecting disease associations with rare variants in sequencing studies. In the software, various tests are conducted for each gene. There are options for the minor allele frequency (MAF) upper bound, the call rate (CR) lower bound and the minor allele count (MAC) lower bound. A variant is deleted if its MAF is greater than the MAF upper bound or its CR is lower than the CR lower bound. A gene is excluded from the analysis if the MAFs of all its variants are greater than the MAF upper bound, the CRs of all its variants are less than the CR lower bound or its MAC is less than the MAC lower bound. By default, the MAF upper bound is 0.05, the CR lower bound is 0 and the MAC lower bound is 1. The MAFs may be determined internally (i.e., calculated from the genotype file) or externally (i.e., input in the mapping file). Under the additive genetic model (default), the test statistics are based on one or several sets of genetic scores that are calculated by a weighted sum of mutation counts for each subject. A set of genetic scores corresponds to a specifically defined weight function. A description of the genetic score and weight function for each test is given in the OPTIONS section below. A fixed-threshold test only involves one set of genetic scores in the test statistic, while a variable threshold test involves multiple sets of genetic scores. We perform three fixed-threshold tests (T1, T5 and Fp) plus one variable-threshold test (VT test). T1 and T5 pertain to the MAF thresholds of 1% and 5%, respectively. The user may request any threshold less than 5% (e.g. 3% or 0.5%) by setting the MAF upper bound to the desired threshold. Asymptotic p-values are provided by default while resampling p-values can be generated by using the option -resample. The software also outputs the p-value of the EREC test (for detecting variants with opposite effects) if resampling is turned on. In addition, the T1, T5 and VT tests under the dominant genetic model can be obtained by using the option -dominant. In that case, all the tests based on the additive genetic model are suppressed. Besides the rare variant analysis described above, the user can conduct single variant analysis for common SNPs by using the option -com. To suppress the rare variant analysis, use the option -noRare.

PROGRAM: seqtk for sampling, trimming, fastq2fasta, subsequence, reverse complement and more

Following the discussion on subsampling sequence from fasta/fastq, I think perhaps it is time to more openly advertise my in-house tool: seqtk. Currently, seqtk supports quality based trimming with the phred algorithm, converting fastq to fasta, reverse complementing sequences, extracting or masking subsequences in regions given in a BED/name list file, and more. I have just added a subsampling module to sample exactly n sequences or a fraction of sequences.

Seqtk supports both fasta and fastq input files, which can be optionally gzip compressed. Each module is perhaps the most efficient among tools of the same functionality. For example, I know fasta-to-fastq is 10X faster than another converter, while being more flexible.

Seqtk is implemented in a single .c file and two header files and only depends on zlib. The source code is freely available here (MIT license):


Sunday, 27 May 2012

The Three Sexy Skills of Data Geeks « Dataspora

Hal Varian, Google's Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:
"The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that's going to be a hugely important skill."


haha sorry I can't resist embedding this YOUTUBE 
It doesnt get more sexy than this! Well it didnt back in the day apparently. A true classic in every sense of the word!

Fwd: A new approach for detecting low-level mutations in next-generation sequence data.

1. Genome Biol. 2012 May 23;13(5):R34. [Epub ahead of print]

A new approach for detecting low-level mutations in next-generation sequence data.

Li M, Stoneking M.


ABSTRACT: We propose a new method that incorporates population re-sequencing data, distribution of reads, and strand bias in detecting low-level mutations. The method can accurately identify low-level mutations down to a level of 2.3%, with an average coverage of 500x, and with a false discovery rate of less than 1%. In addition, we also discuss other problems in detecting low-level mutations, including chimeric reads and sample cross-contamination, and provide possible solutions to them.

PMID: 22621726 [PubMed - as supplied by publisher]
Icon for BioMed Central

[Velvet-users] Velvet 1.2.06 no need for interleaving of paired end reads

From: Daniel Zerbino
Date: 24 May 2012 10:18
Subject: [Velvet-users] Velvet 1.2.06

Dear Velvet users,

Torsten Seeman and David Powell from Monash University have been
cleaning up Velvet code, available as usual on github or

They cleaned up the parsing code, added some unit tests, but especially
added a feature which many people have clamored for a long time: the
interleaving of paired-end files is no longer necessary. By default,
Velvet's behavior stays the same but with the '-separate' flag, you can
now provide pairs of files, as in:

velveth Assem 31 -shortPaired -fasta -separate left.fa right.fa

Many thanks to Torsten and David for their work,

Best regards,

Velvet-users mailing list

Saturday, 26 May 2012

What can long reads tell us about centromere evolution?

Webinar by Simon Chan, UC Davis on how Pacific Biosciences revealed higher order repeats in centromeres which would have been hidden if you are restricted by the much shorter Sanger reads. 

If there was a way to enrich the 'un-sequence-able'  portions of the human genome, sequencing by PacBio would perhaps finally yield the COMPLETE human genome. 

Would be interesting to see how PacBio fares when Oxford Nanopore Tech has their product out in the wild. 

Thursday, 24 May 2012

Have you heard of ReadCube? PDF / journal organizer

Spotted this while downloading a PDF from nat genet.
looks interesting .. though I use foxit for pdf annotation

Packed with features to make your life easier
  • Let ReadCube organize your article collection
  • Import article PDFs from your computer.
  • Your articles immediately become full-text searchable so you can find what you want.
  • ReadCube will automatically identify the author, title, and journal citation information of every article.

Find new papers fast
Instantly view article abstracts alongside PubMed and Google Scholar™ search results.
Clickable references take you straight to the articles referenced in the paper you are reading.
Immediately view articles that cite the paper you are reading, or those that are related to it.

Download articles with a single click
Download new articles with a single click from PubMed, Google Scholar™, or publisher web sites.
ReadCube integrates with your university or institution login so you can download articles straight into your library.

Personalized article recommendations
Get daily article recommendations based on your research interests and the contents of your library – so you need never worry about missing that important paper again.

Read and annotate articles
Create in-line comments and directly highlight key phrases. Highlight directly on articles.
ReadCube saves your annotations so you can keep track of important notes you’ve made.

Cite and create references
Citations for every article in your library are automatically found, so you can easily import them straight into EndNote™ or your favorite citation software.

Differential confounding of rare and common variants in spatially structured populations.

PubMed Results
Item 1 of 1    (Display the citation in PubMed)

1. Nat Genet. 2012 Feb 5;44(3):243-6. doi: 10.1038/ng.1074.

Differential confounding of rare and common variants in spatially structured populations.

Mathieson I, McVean G.


Wellcome Trust Centre for Human Genetics, University of Oxford, UK.


Well-powered genome-wide association studies, now made possible through advances in technology and large-scale collaborative projects, promise to characterize the contribution of rare variants to complex traits and disease. However, while population structure is a known confounder of association studies, it remains unknown whether methods developed to control stratification are equally effective for rare variants. Here, we demonstrate that rare variants can show a stratification that is systematically different from, and typically stronger than, common variants, and this is not necessarily corrected by existing methods. We show that the same process leads to inflation for load-based tests and can obscure signals at truly associated variants. Furthermore, we show that populations can display spatial structure in rare variants, even when Wright's fixation index F(ST) is low, but that allele frequency-dependent metrics of allele sharing can reveal localized stratification. These results underscore the importance of collecting and integrating spatial information in the genetic analysis of complex traits.

PMCID: PMC3303124 [Available on 2012/8/5]
PMID: 22306651 [PubMed - indexed for MEDLINE]
Icon for Nature Publishing Group

Wednesday, 23 May 2012

An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People.

This message contains search results from the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine (NLM). Do not reply directly to this message
Sent on: Tue May 22 12:27:53 2012
1 selected item: 22604722

PubMed Results
Item 1 of 1    (Display the citation in PubMed)

1. Science. 2012 May 17. [Epub ahead of print]

An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People.

Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, Shen J, Tang Z, Bacanu SA, Fraser D, Warren L, Aponte J, Zawistowski M, Liu X, Zhang H, Zhang Y, Li J, Li Y, Li L, Woollard P, Topp S, Hall MD, Nangle K, Wang J, Abecasis G, Cardon LR, Zöllner S, Whittaker JC, Chissoe SL, Novembre J, Mooser V.


Quantitative Sciences, GlaxoSmithKline, RTP, NC, USA; Upper Merion, PA, USA; and Stevenage, UK.


Rare genetic variants contribute to complex disease risk; however, the abundance of rare variants in human populations remains unknown. We explored this spectrum of variation by sequencing 202 genes encoding drug targets in 14,002 individuals. We find rare variants are abundant (one every 17 bases) and geographically localized, such that even with large sample sizes, rare variant catalogs will be largely incomplete. We used the observed patterns of variation to estimate population growth parameters, the proportion of variants in a given frequency class that are putatively deleterious, and mutation rates for each gene. Overall, we conclude that, due to rapid population growth and weak purifying selection, human populations harbor an abundance of rare variants, many of which are deleterious and have relevance to understanding disease risk.
PMID: 22604722 [PubMed - as supplied by publisher]
Icon for HighWire Press

Cyber-T web server: differential analysis of high-throughput data.

Nucleic Acids Res. 2012 May 16. [Epub ahead of print]

Cyber-T web server: differential analysis of high-throughput data.


Department of Computer Science and Institute for Genomics and Bioinformatics, University of California, Irvine. Irvine, CA 92697, USA.


The Bayesian regularization method for high-throughput differential analysis, described in Baldi and Long (A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001: 17: 509-519) and implemented in the Cyber-T web server, is one of the most widely validated. Cyber-T implements a t-test using a Bayesian framework to compute a regularized variance of the measurements associated with each probe under each condition. This regularized estimate is derived by flexibly combining the empirical measurements with a prior, or background, derived from pooling measurements associated with probes in the same neighborhood. This approach flexibly addresses problems associated with low replication levels and technology biases, not only for DNA microarrays, but also for other technologies, such as protein arrays, quantitative mass spectrometry and next-generation sequencing (RNA-seq). Here we present an update to the Cyber-T web server, incorporating several useful new additions and improvements. Several preprocessing data normalization options including logarithmic and (Variance Stabilizing Normalization) VSN transforms are included. To augment two-sample t-tests, a one-way analysis of variance is implemented. Several methods for multiple tests correction, including standard frequentist methods and a probabilistic mixture model treatment, are available. Diagnostic plots allow visual assessment of the results. The web server provides comprehensive documentation and example data sets. The Cyber-T web server, with R source code and data sets, is publicly available at

Icon for HighWire Press

Tuesday, 22 May 2012

annoyance at 'case-insensitiveness' of Mac terminal shell

Ok there might have been days where I wished for case insensitive filenames as per dos/windows, but I have since learnt the errors of my way and embraced how case sensitive adds a new way for me to organise files by filenames.
BUT all this is gone in MacOS :/

data visualisation: Getting matplotlib on MacOS X

Decided I should probably play around with matplotlib finally since I found a manhattan plot python script that required it.

But omg I thought that installing stuff on Ubuntu was troublesome, Macs actually ups the level of troublesome one notch up

Gonna leave it undone .. but in chronological order of discovery

python eggs = FAILED (not sure why the script insists that it doesn't have write permissions to create files

next up was trying this helpful post
Installing matplotlib in Lion

oh okay I need homebrew

(oh wow I didn't know ruby is installed by default)
but hit another snag as per below ..
continue another day ..

Press enter to continue
==> /usr/bin/sudo /bin/chmod g+rwx /usr/local/. /usr/local/bin /usr/local/lib
==> /usr/bin/sudo /usr/bin/chgrp admin /usr/local/. /usr/local/bin /usr/local/lib
==> Downloading and Installing Homebrew...
==> Installation successful!
You should run `brew doctor' *before* you install anything.
Now type: brew help
------------------------------------------------------------------------------------------------------------------------------------ 03:07:56
k@k:~$ brew doctor

Error: You have no /usr/bin/cc.
This means you probably can't build *anything*. You need to install the Command
Line Tools for Xcode. You can either download this from
or install them from inside Xcode's Download preferences. Homebrew does not
require all of Xcode! You only need the Command Line Tools package!
Error: Git could not be found in your PATH.
Homebrew uses Git for several internal functions, and some formulae use Git
checkouts instead of stable tarballs. You may want to install Git:
  brew install git
Error: Your compilers are different from the standard versions for your Xcode.
If you have Xcode 4.3 or newer, you should install the Command Line Tools for
Xcode from within Xcode's Download preferences.
Otherwise, you should reinstall Xcode.
Error: Your Xcode is configured with an invalid path.
You should change it to the correct path. Please note that there is no correct
path at this time if you have *only* installed the Command Line Tools for Xcode.
If your Xcode is pre-4.3 or you installed the whole of Xcode 4.3 then one of
these is (probably) what you want:

    sudo xcode-select -switch /Developer
    sudo xcode-select -switch /Applications/


Saturday, 19 May 2012

[Denovoassembler-users] Ray v2.0.0-rc7 is available online !

---------- Forwarded message ----------
From: Sébastien Boisvert
Date: Thu, May 17, 2012 at 11:02 PM
Subject: [Denovoassembler-users] Ray v2.0.0-rc7 is available online !

Hello !

I am proud to announce the immediate availability of the Ray assembler
version 2.0.0 release candidate 7, code name "Dark Astrocyte of Knowledge".

This version ships with RayPlatform v1.0.2, code name "Timely Gate of

Link for download:

Changes in Ray

* The CMakeList file was updated.
* GC content for contigs are dumped in XML files.
* New option -one-color-per-file for graph coloring.
* Optimized file system input/output operations.
* Network testing is more verbose.
* Fixed an integer overflow bug in the scaffolder.
* New guide in Documentation/ for software message routing.
* Fixed an integer overflow bug in the profiler.
* Fixed a synchronization bug in the coloring algorithm.
* Increased the sensitivity of the biological profiling algorithms.
* Disabled the plugin for neighbourhoods.
* New plugin to compute gene ontology profiles.
* Added various missing code headers.
* Simplified the plugin creation process.
* Fixed some divisions per 0.
* Fixed a synchronization bug for gene ontology.
* Added simple profile files for sequence abundance, taxonomy profiles
and gene ontology profiles.
* A bug that caused k-mers with >= 65536 coverage to have less coverage
was fixed.
        ---> This was a long-standing bug that caused some issues.
* Added some datatypes.

Changes in RayPlatform

* Command line arguments can be obtained.
* Simplified the plugin creation process.
* Fixed two divisions per 0.
* Added some datatypes.


Denovoassembler-users mailing list

Friday, 18 May 2012

[BioRuby] New biogems for IonTorrent, pileup files, pfam and hmmer

LOL bio-gag ... 

Maybe we can term the gag error Ion-Gag
maybe this will be a new twitter hashtag that will catch on! 


On Fri, May 18, 2012 at 7:40 AM, Ben Woodcroft <> wrote:
> Hi guys,
> Here's some blatant advertising for some code I've recently written in
> biogem form.
> bio-gag: "gag error" is the term I've coined to describe an error that
> various people have observed on certain sequencing kits with IonTorrent,
> though it has not previously been characterised very well that I know of
> (we noticed that the errors seemed to occur at GAG positions in the reads
> that were supposed to be GAAG). This biogem tries to find and fix these
> errors. It isn't benchmarked for accuracy but worked well enough for my
> lab's own purposes. Actually to be honest we've only used an older version
> of the software on real data and the logic has a little since given some
> recent evidence we have, but I thought I'd push it out with the latest and
> greatest error model.
> bio-pileup_iterator: To find gag errors bio-gag iterates through pileup
> files looking for particular patterns e.g. strand bias of insertions. This
> gem can be used to iterate through pileup files one position (one line) at
> a time, building up the sequence of each read as it goes, recording their
> direction etc. Probably not the fastest piece of code in the world, sorry.
> I'm not sure whether this should/can be incorporated into bio-samtools? It
> adds functionality - there's no duplication (I don't think).
> bio-hmmer_model: This is a parser of HMM files e.g. from PFAM according to
> the hmmer v3 manual.
> bio-hmmer3_report: Parsing of HMMER3 result files. Currently only handles
> tabular format files - the guts of this were written by Christian - see
> yesterday's thread for details. I'm hoping to add regular (non-tabular)
> format parsing in the near future, but no promises.
> I'm sure there is bugs and deficiencies - apologies in advance.
> Enjoy,
> ben
> _______________________________________________
> BioRuby Project -
> BioRuby mailing list


BioRuby Project -
BioRuby mailing list

Thursday, 17 May 2012

Ion Torrent complains to Nature Biotech about bias in Loman paper - SEQanswers


Nothing gets ppl riled up like when you compare sequencing technologies ... 
Honestly I think Life Tech has done better with PGM cf SOLID. 
I wished they would do away with their 'closed in' community and put their FAS to answer questions in seqanswers instead. I am sure they would reach more and save more by doing away with webservers. 

This is a couple days old but figured it would be interesting to some of you.

Nick Loman's original paper comparing desktop sequencers (which we've discussed here before)...has been subject of controversy. LifeTech apparently didn't like how the study was presented, and wrote a letter to Nature about the study.

Other takes on it:

PLoS: Gene Mapping via Bulked Segregant RNA-Seq (BSR-Seq)

Read the open-access, full-text article here:

Gene Mapping via Bulked Segregant RNA-Seq (BSR-Seq)


Bulked segregant analysis (BSA) is an efficient method to rapidly and efficiently map genes responsible for mutant phenotypes. BSA requires access to quantitative genetic markers that are polymorphic in the mapping population. We have developed a modification of BSA (BSR-Seq) that makes use of RNA-Seq reads to efficiently map genes even in populations for which no polymorphic markers have been previously identified. Because of the digital nature of next-generation sequencing (NGS) data, it is possible to conduct de novo SNP discovery and quantitatively genotype BSA samples by analyzing the same RNA-Seq data using an empirical Bayesian approach. In addition, analysis of the RNA-Seq data provides information on the effects of the mutant on global patterns of gene expression at no extra cost. In combination these results greatly simplify gene cloning experiments. To demonstrate the utility of this strategy BSR-Seq was used to clone the glossy3 (gl3) gene of maize. Mutants of the glossy loci exhibit altered accumulation of epicuticular waxes on juvenile leaves. By subjecting the reference allele of gl3 to BSR-Seq, we were able to map the gl3 locus to an ~2 Mb interval. The single gene located in the ~2 Mb mapping interval whose expression was down-regulated in the mutant pool was subsequently demonstrated to be the gl3 gene via the analysis of multiple independent transposon induced mutant alleles. The gl3 gene encodes a putative myb transcription factor, which directly or indirectly affects the expression of a number of genes involved in the biosynthesis of very-long-chain fatty acids.

Copy number variation detection and genotyping from exome sequence data.
Genome Res. 2012 May 14. [Epub ahead of print]

Copy number variation detection and genotyping from exome sequence data.


University of Washington;


While exome sequencing is readily amenable to single-nucleotide variant discovery, the sparse and non-uniform nature of the exome capture reaction has hindered exome-based detection and characterization of genic copy number variation. We developed a novel method using singular value decomposition (SVD) normalization to discover rare genic copy number variants (CNVs) as well as genotype copy number polymorphic (CNP) loci with high sensitivity and specificity from exome sequencing data. We estimate the precision of our algorithm using 122 trios (366 exomes) and show that this method can be used to reliably predict (94% overall precision) both de novo and inherited rare CNVs involving three or more consecutive exons. We demonstrate that exome-based genotyping of CNPs strongly correlates with whole-genome data (median r2 = 0.91), especially for loci with fewer than eight copies, and can estimate the absolute copy number of multi-allelic genes with high accuracy (78% call level). The resulting user-friendly computational pipeline, CoNIFER (copy number inference from exome reads), can reliably be used to discover disruptive genic CNVs missed by standard approaches and should have broad application in human genetic studies of disease.

[PubMed - as supplied by publisher]

Tackling formalin-fixed, paraffin-embedded tumor tissue with next-generation sequencing.
Cancer Discov. 2012 Jan;2(1):23-4.

Tackling formalin-fixed, paraffin-embedded tumor tissue with next-generation sequencing.


Departments of Pathology and Molecular and Medical Genetics, and Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon.


Most tumor samples available for clinical genotyping are formalin-fixed and paraffin-embedded (FFPE), but there has been relatively little published on the suitability of such samples for next-generation sequencing approaches. A new study by Wagle and colleagues shows that a combination of hybridization-capture and deep sequencing yields high-quality data from FFPE specimens. Cancer Discovery; 2(1); 23-4. ©2012 AACR.

[PubMed - in process]

Datanami, Woe be me