Saturday, 30 June 2012

Recording Available: Illumina IGN Webinar Series: Webinar Two - Expanding your Current WGS Knowledge

Ok NOT too sure if I can post the link here to share, naturally I think Ilmn marketing team would like to have your email address to spam you with related info
 BUT I think I am doing them a favour to share this link on my blog so .. hahah let me know if i shld take it down ..

Illumina Webinar Series
Illumina IGN Webinar Series: Webinar Two - Expanding your Current WGS Knowledge
Thank you for registering for our recent webinar!

The webinar recording is now available for viewing. We appreciate your interest in the Illumina Webinar Series and hope you will join us for future events.

To have an Illumina representative contact you about our products and services, please complete our request form and someone will be in contact with you shortly.

View the Webinar

Datanami: Virtualizing the Mighty Elephant

Vmware releases new project that allows hadoop to be deployed in a virtual environment.

Thursday, 28 June 2012

FAQ: What is genome build 'hg_g1k_v37' ?

I have wondered about the g1k bit before as well ..
Here's a explanation lifted from the galaxy-user list

---------- Forwarded message ----------
From: Jennifer Jackson
Date: 27 June 2012 23:50
Subject: Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

Hello Lilach,

The genome build 'hg_g1k_v37' is build "b37" in the GATK documentation. Hg19 is also included (as a distinct build). I encourage you to examine these if you are interested in crossing over between genomes or identifying other projects that have data based on the same genome build. ->

" GATK resource bundle: A collection of standard files for working with human resequencing data with the GATK.

The standard reference sequence we use in the GATK is the the b37 edition from the Human Genome Reference Consortium. All of the key GATK data files are available against this reference sequence. Additionally, we used to use UCSC-style (chr1, not 1) for build hg18, and provide lifted-over files from b37 to hg18 for those still using those files.

b37 resources: the standard data set
* Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
<more, please follow link for details ...>

hg19 resources: lifted over from b37
* Includes the UCSC-style hg19 reference along with all lifted over VCF files."

Hopefully this helps,

Galaxy team

Wednesday, 27 June 2012

Finding Waldo, a flag on the moon and multiple choice tests, with R - Freakonometrics

Ok given my interest in photography, I have a side interest in computational methods for image analysis. Happily the code didn't find wally (cos I think it's a challenge for me that I would rate it as a suitable Turing's test )
But at the end of the page it does offer a solution to implementing your OCR method for rating MCQ exam papers which is fantastic! 

And one can even find Mathematica codes online. But most of the those algorithms are based on the idea that we look for similarities with Waldo's face, as described in problem 3 on's webpage. You can find papers on that problem, e.g. Friencly & Kwan (2009) (based on statistical techniques, but Waldo is here a pretext to discuss other issues actually), or more recently (but more complex) Garg et al. (2011) on matching people in images of crowds.

Collapsing Methods for DNA-Sequence Analysis — SNP & Variation Suite v7.6.5 Documentation

Traditional association techniques used in GWAS studies are not really suitable for sequence analysis, because they do not have the power to detect the significance of rare variants individually, nor do they provide tools for measuring their compound effect, referred to as rare variant burden. To do this, it is necessary to "collapse" several variants into a single covariate based on regions such as genes.

interesting to note that SVS has already added rare variant analysis into their toolshed. I believe this was a collaboration with Baylor. Wondering if new methods might be added or would researchers use CMC & KBAC more as it is avail in a commercial package .. 

Statistical Tests for Detecting Rare Variants Using Variance-Stabilising Transformations.

Item 1 of 1    (Display the citation in PubMed)

1. Ann Hum Genet. 2012 Jun 25. doi: 10.1111/j.1469-1809.2012.00718.x. [Epub ahead of print]

Statistical Tests for Detecting Rare Variants Using Variance-Stabilising Transformations.

Wang K, Fingert JH.


Department of Biostatistics, College of Public Health, The University of Iowa, Iowa City, IA, USA Department of Ophthalmology and Visual Sciences, Carver College of Medicine, The University of Iowa, IA, USA.


Next generation sequencing holds great promise for detecting rare variants underlying complex human traits. Due to their extremely low allele frequencies, the normality approximation for a proportion no longer works well. The Fisher's exact method appears to be suitable but it is conservative. We investigate the utility of various variance-stabilising transformations in single marker association analysis on rare variants. Unlike a proportion itself, the variance of the transformed proportions no longer depends on the proportion, making application of such transformations to rare variant association analysis extremely appealing. Simulation studies demonstrate that tests based on such transformations are more powerful than the Fisher's exact test while controlling for type I error rate. Based on theoretical considerations and results from simulation studies, we recommend the test based on the Anscombe transformation over tests with other transformations.

© 2012 The Authors Annals of Human Genetics © 2012 Blackwell Publishing Ltd/University College London.

PMID: 22724536 [PubMed - as supplied by publisher]

Error-correcting properties of the SOLiD Exact Call Chemistry.

Item 1 of 1    (Display the citation in PubMed)

1. BMC Bioinformatics. 2012 Jun 22;13(1):145. [Epub ahead of print]

Error-correcting properties of the SOLiD Exact Call Chemistry.

Massingham T, Goldman N.




The Exact Call Chemistry for the SOLiD Next-Generation Sequencing platform augments the two-base-encoding chemistry with an additional round of ligation, using an alternative set of probes, that allows some mistakes made when reading the first set of probes to be corrected. Additionally, the Exact Call Chemistry allows reads produced by the platform to be decoded directly into nucleotide sequence rather than its two-base 'color' encoding.


We apply the theory of linear codes to analyse the new chemistry, showing the types of sequencing mistakes it can correct and identifying those where the presence of an error can only be detected. For isolated mistakes that cannot be unambiguously corrected, we show that the type of substitution can be determined, and its location can be narrowed down to two or three positions, leading to a significant reduction in the the number of plausible alternative reads.


The Exact Call Chemistry increases the accuracy of the SOLiD platform, enabling many potential miscalls to be prevented. However, single miscalls in the color sequence can produce complex but localised patterns of error in the decoded nucleotide sequence. Analysis of similar codes shows that some exist that, if implemented in alternative chemistries, should have superior performance.

PMID: 22726842 [PubMed - as supplied by publisher]

Sunday, 24 June 2012

Ray 2.0.0 codenamed "Dark Astrocyte of Knowledge" is available for download.


Ray 2.0.0 codenamed "Dark Astrocyte of Knowledge" is available for download.
This version ships with RayPlatform 1.0.3 codenamed "Gray Pylon of Wisdom".

Not much thing changed since v2.0.0-rc8.

Ray 2.0.0 can do de novo assembly of metagenomes and also taxonomic profiling
with k-mers.

To get Ray v2.0.0:

Also, there is a new section on the website for
frequently asked questions.

MyGenome for iPad on the iTunes App Store

Inline images 1
MyGenome empowers you to explore a real human genome, find out about possible health implications, and view reports about important genetic variations. The MyGenome app provides a simple, intuitive and educational interface for genome exploration and learning.

The MyGenome app features:
•Actual genome of Illumina CEO, Jay Flatley (donated for educational purposes)
•Genome Map, Health Cards and Reports to explore the wealth of information that can be obtained through accessing the genome
•Video journey into the genome

Key Features:

Genome Map
•Tour the landscape of chromosomes and see how genetic variants in different locations translate into health impacts or biological traits.
•View individual genes, their locations, and biological impacts
•Visualize where and how genome sequences differ from the "reference" human genome
•Learn how much we understand about the variation in the human genome and how much more we have to learn

Health Cards
•Explore disease risks, genetically determined conditions and predispositions, and carrier traits
•Discover how different genetic variants contribute to health risks and can be passed on to children
•Find out how changes in the genome affect drug response

•Investigate the possible health impacts of genetic variants for > 200 conditions!
•See reports that illustrate how genetic information will likely be delivered in the future and used by medical professionals.

Soon, you and your physician will be able to sequence, download and explore your own genome. To learn more about individual genome sequencing or to find out about upcoming MyGenome app store releases, please visit

Saturday, 23 June 2012

Chief Data Scientist at EMC in Singapore - Job | LinkedIn

Interesting! I never knew the existence of Greenplum ..
a little sad/pointless though if the software is free for the single node edition ...

Job Description


  • Partner directly with APJ regional leadership and regional field, Greenplum Data Science leadership, and customers/prospects to establish a robust vision for the build-out of APJ's Data Science team.
  • While managing existing team members, lead the recruiting and onboarding of a larger APJ regional Data Science team that addresses vertical and analytical knowledge requirements.
  • Drive evangelization and education of Data Science services to Greenplum's APJ sales force, in particular educating the field on how to communicate the vision and value of advanced analytics, how to qualify interested prospects, and how to propose Data Science services.
  • While working with customers and prospects, leverage significant experience directly working with data to define analytics use-cases that address customer requirements for value generation, and partner with Data Scientists to execute on these.
  • Advise customers and prospects on technology and tool selection to best meet their emerging analytics requirements and to best drive value-generation on existing and future data.
  • Lead relationship development and technology evaluation for new prospective regional analytics-centric      partnerships.
  • Work directly with customers to educate them on Greenplum's technologies, analytical use-cases, pros/cons of emerging tools, etc.
  • Assist in customer engagement management, requirements definition, project scoping, timeline management, and results documentation to ensure professional relationship management with regional customers.
  • Travel, as needed, to meet with customers (roughly 40-50%).

Desired Skills & Experience

  • 5-10 years of experience and a proven passion for generating insights from data, with a strong familiarity with the higher-level trends in data growth, open-source platforms, and public data sets.
  • A proven track record of building the function of data science, analytics as a service, or teams of data miners / machine-learning practitioners
  • 5 to 10 years of experience in managing small to mid-sized teams, preferably in the services functions.
  • Significant experience evangelizing the value of data analytics to broad audiences.
  • At least 3 years of work  in a related role within the APJ region, showing strong understanding of country-level industry players, vertical market trends, and status of data utilization.
  • Strong knowledge of statistical methods generally, and particularly in the areas of modeling and business analytics
  • Experience working with a variety of statistical languages and packages, including R, S-Plus, SAS and Matlab, and/or Mahout
  • Experience working with relational databases and/or distributed computing platforms, and their query interfaces, such as SQL, MapReduce, PIG, and Hive.
  • Preferably, experience working hands-on with large-scale data sets
  • Familiarity with additional programming languages, including Python, Java, and C/C++.
  • Experience leveraging visualization software and techniques (including Tableau), and business intelligence (BI) software, such as Microstrategy, Cognos, Pentaho, etc.
  • Technical knowledge of distributed computing platforms, and common data process flows from data      instrumentation & generation, to ETL, to the data warehouse itself.
  • Advanced degree (PhD or Masters) in an analytical or technical field (e.g. applied mathematics, statistics, physics, computer science, operations research)
  • A strong business-orientation, able to select the appropriate complex quantitative methodologies in response to specific business goals
  • A team player, who is  excited by and motivated by hard technical challenges
  • Results-driven, self-motivated, self-starter
  • Excellent written, verbal, and presentation skills in at least 1 key language relevant for APJ in addition to English
  • Ability to travel as-needed to meet with customers, throughout the region.

Greenplum is setting the pace in the Big Data Analytics space. We are growing rapidly and providing solutions to major companies in the industry.

Company Description

EMC provides the technologies and tools that can help you release the power of your information. We can help you design, build, and manage flexible, scalable, and secure information infrastructures. And with these infrastructures, you'll be able to intelligently and efficiently store, protect, and manage your information so that it can be made accessible, searchable, shareable, and, ultimately, actionable. We believe that information is a business's most important asset. Ideas—and the people who come up with them—are the only real differentiator. Our promise is to help you take that differentiator as far as possible. We will deliver on this promise by helping organizations of all sizes manage more information more effectively than ever before. We will provide solutions that meet and exceed your most demanding business and IT challenges. We will bring your information to life. DISCUSS all things EMC, right here on LinkedIn! This page maintained by @kemipa

Friday, 22 June 2012

Why You Should Care About Segmental Duplications | Our 2 SNPs…(R)

excellent blog post showing how segmental duplications can skew ur CNV analysis & SNP calling. The latter was something I wasn't aware of ....

Excerpted ...

Alert followers of this blog may recall a cautionary statement I made

previously about working with Illumina CNV data — that males and

females sometimes have different baseline signal intensity levels (this

was more of a GenomeStudio software issue than a hardware problem).

To find out if this issue affects the Omni2.5, I ran a simple t-test to

compare the Log-R Ratio (LR) intensity values between males and

females across the genome. The results are shown in the Manhattan Plot


Thursday, 21 June 2012

BMC Genomics| Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort

MC Genomics 2012, 13:241 doi:10.1186/1471-2164-13-241

Published: 15 June 2012

Abstract (provisional)


Genotypes obtained with commercial SNP arrays have been extensively used in many large case-control or population-based cohorts for SNP-based genome-wide association studies for a multitude of traits. Yet, these genotypes capture only a small fraction of the variance of the studied traits. Genomic structural variants (GSV) such as Copy Number Variation (CNV) may account for part of the missing heritability, but their comprehensive detection requires either next-generation arrays or sequencing. Sophisticated algorithms that infer CNVs by combining the intensities from SNP-probes for the two alleles can already be used to extract a partial view of such GSV from existing data sets.


Here we present several advances to facilitate the latter approach. First, we introduce a novel CNV detection method based on a Gaussian Mixture Model. Second, we propose a new algorithm, PCA merge, for combining copy-number profiles from many individuals into consensus regions. We applied both our new methods as well as existing ones to data from 5612 individuals from the CoLaus study who were genotyped on Affymetrix 500 K arrays. We developed a number of procedures in order to evaluate the performance of the different methods. This includes comparison with previously published CNVs as well as using a replication sample of 239 individuals, genotyped with Illumina 550 K arrays. We also established a new evaluation procedure that employs the fact that related individuals are expected to share their CNVs more frequently than randomly selected individuals. The ability to detect both rare and common CNVs provides a valuable resource that will facilitate association studies exploring potential phenotypic associations with CNVs.


Our new methodologies for CNV detection and their evaluation will help in extracting additional information from the large amount of SNP-genotyping data on various cohorts and use this to explore structural variants and their impact on complex traits.

introduction to R slides for a 5 day course at King's College London, 2011

Introduction to R
slides for a 5 day course at King's College London, 2011

saw this @

Tuesday, 19 June 2012

Two-Stage Extreme Phenotype Sequencing Design for Discovering and Testing Common and Rare Genetic Variants: Efficiency and Power.

PubMed Results
Item 1 of 1    (Display the citation in PubMed)

1. Hum Hered. 2012 Jun 7;73(3):139-147. [Epub ahead of print]

Two-Stage Extreme Phenotype Sequencing Design for Discovering and Testing Common and Rare Genetic Variants: Efficiency and Power.

Kang G, Lin D, Hakonarson H, Chen J.


Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, Pa., USA.


Next-generation sequencing technology provides an unprecedented opportunity to identify rare susceptibility variants. It is not yet financially feasible to perform whole-genome sequencing on a large number of subjects, and a two-stage design has been advocated to be a practical option. In stage I, variants are discovered by sequencing the whole genomes of a small number of carefully selected individuals. In stage II, the discovered variants of a large number of individuals are genotyped to assess associations. Individuals with extreme phenotypes are typically selected in stage I. Using simulated data for unrelated individuals, we explore two important aspects of this two-stage design: the efficiency of discovering common and rare single-nucleotide polymorphisms (SNPs) in stage I and the impact of incomplete SNP discovery in stage I on the power of testing associations in stage II. We applied a sum test and a sum of squared score test for gene-based association analyses evaluating the power of the two-stage design. We obtained the following results from extensive simulation studies and analysis of the GAW17 dataset. When individuals with trait values more extreme than the 99.7-99th quantile were included in stage I, the two-stage design could achieve the same power as or even higher than the one-stage design if the rare causal variants had large effect sizes. In such design, fewer than half of the total SNPs including more than half of the causal SNPs were discovered, which included nearly all SNPs with minor allele frequencies (MAFs) ≥5%, more than half of the SNPs with MAFs between 1% and 5%, and fewer than half of the SNPs with MAFs <1%. Although a one-stage design may be preferable to identify multiple rare variants having small to moderate effect sizes, our observations support using the two-stage design as a cost-effective option for next-generation sequencing studies.

Copyright © 2012 S. Karger AG, Basel.

PMID: 22678112 [PubMed - as supplied by publisher]
Icon for S. Karger AG, Basel, Switzerland

Fwd: [Velvet-users] Velvet 1.2.07

Dear Velvet users,

Velvet 1.2.07 is now available on github or at

In it:
- David Powell added file format option '-fmtAuto' to auto-detect compression (using gunzip/bunzip2) and format (only FastA or FastQ for now).
- Yasubumi Sakakibara and Tsuyoshi Hachiya updated MetaVelvet
- I silenced a bug in unit testing spotted by Nathan Weeks
- A compilation bug was corrected.
- I corrected a memory compilation bug reported by @thakki



Velvet-users mailing list

Monday, 18 June 2012

Caution in Interpreting Results from Imputation Analysis When Linkage Disequilibrium Extends over a Large Distance: A Case Study on Venous Thrombosis.

PubMed Results
Item 1 of 1    (Display the citation in PubMed)

1. PLoS One. 2012;7(6):e38538. Epub 2012 Jun 4.

Caution in Interpreting Results from Imputation Analysis When Linkage Disequilibrium Extends over a Large Distance: A Case Study on Venous Thrombosis.

Germain M, Saut N, Oudot-Mellakh T, Letenneur L, Dupuy AM, Bertrand M, Alessi MC, Lambert JC, Zelenika D, Emmerich J, Tiret L, Cambien F, Lathrop M, Amouyel P, Morange PE, Trégouët DA.


INSERM UMR_S 937, ICAN Institute, Université Pierre et Marie Curie, Paris, France.


By applying an imputation strategy based on the 1000 Genomes project to two genome-wide association studies (GWAS), we detected a susceptibility locus for venous thrombosis on chromosome 11p11.2 that was missed by previous GWAS analyses that had been conducted on the same datasets. A comprehensive linkage disequilibrium and haplotype analysis of the whole locus where twelve SNPs exhibited association p-values lower than 2.23 10(-11) and the use of independent case-control samples demonstrated that the culprit variant was a rare variant located ∼1 Mb away from the original hits, not tagged by current genome-wide genotyping arrays and even not well imputed in the original GWAS samples. This variant was in fact the rs1799963, also known as the FII G20210A prothrombin mutation. This work may be of major interest not only for its scientific impact but also for its methodological findings.

PMCID: PMC3366937 Free PMC Article
PMID: 22675575 [PubMed - in process]
Icon for Public Library of Science Icon for PubMed Central

Sunday, 17 June 2012

Two Adaptive Weighting Methods to Test for Rare Variant Associations in Family-Based Designs - Fang - 2012 - Genetic Epidemiology - Wiley Online Library


  • family-based design;
  • rare variants;
  • adaptive weights;
  • quantitative traits

Although next-generation DNA sequencing technologies have made rare variant association studies feasible and affordable, the development of powerful statistical methods for rare variant association studies is still under way. Most of the existing methods for rare variant association studies compare the number of rare mutations in a group of rare variants (in a gene or a pathway) between cases and controls. However, these methods assume that all causal variants are risk to diseases. Recently, several methods that are robust to the direction and magnitude of effects of causal variants have been proposed. However, they are applicable to unrelated individuals only, whereas family data have been shown to improve power to detect rare variants. In this article, we propose two adaptive weighting methods for rare variant association studies based on family data for quantitative traits. Using extensive simulation studies, we evaluate and compare our proposed methods with two methods based on the weights proposed by Madsen and Browning. Our results show that both proposed methods are robust to population stratification, robust to the direction and magnitude of the effects of causal variants, and more powerful than the methods using weights suggested by Madsen and Browning, especially when both risk and protective variants are present. Genet. Epidemiol. 36:499-507, 2012. © 2012 Wiley Periodicals, Inc.

Biases and Errors on Allele Frequency Estimation and Disease Association Tests of Next-Generation Sequencing of Pooled Samples - Chen - 2012 - Genetic Epidemiology - Wiley Online Library;jsessionid=D26884CF965C6871203F0082AB35CE2E.d04t04

Next-generation sequencing is widely used to study complex diseases because of its ability to identify both common and rare variants without prior single nucleotide polymorphism (SNP) information. Pooled sequencing of implicated target regions can lower costs and allow more samples to be analyzed, thus improving statistical power for disease-associated variant detection. Several methods for disease association tests of pooled data and for optimal pooling designs have been developed under certain assumptions of the pooling process, for example, equal/unequal contributions to the pool, sequencing depth variation, and error rate. However, these simplified assumptions may not portray the many factors affecting pooled sequencing data quality, such as PCR amplification during target capture and sequencing, reference allele preferential bias, and others. As a result, the properties of the observed data may differ substantially from those expected under the simplified assumptions. Here, we use real datasets from targeted sequencing of pooled samples, together with microarray SNP genotypes of the same subjects, to identify and quantify factors (biases and errors) affecting the observed sequencing data. Through simulations, we find that these factors have a significant impact on the accuracy of allele frequency estimation and the power of association tests. Furthermore, we develop a workflow protocol to incorporate these factors in data analysis to reduce the potential biases and errors in pooled sequencing data and to gain better estimation of allele frequencies. The workflow, Psafe, is available at

Detection of identity by descent using next-generation whole genome sequencing data


Identity by descent (IBD) has played a fundamental role in the discovery of genetic loci underlying human diseases. Both pedigree-based and population-based linkage analyses rely on estimating recent IBD, and evidence of ancient IBD can be used to detect population structure in genetic association studies. Various methods for detecting IBD, including those implemented in the soft- ware programs fastIBD and GERMLINE, have been developed in the past several years using population genotype data from microarray platforms. Now, nextgeneration DNA sequencing data is becoming increasingly available, enabling the comprehensive analysis of genomes, in- cluding identifying rare variants. These sequencing data may provide an opportunity to detect IBD with higher resolution than previously possible, potentially enabling the detection of disease causing loci that were previously undetectable with sparser genetic data.


Here, we investigate how different levels of variant coverage in sequencing and microarray genotype data influences the resolution at which IBD can be detected. This includes microarray genotype data from the WTCCC study, denser genotype data from the HapMap Project, low coverage sequencing data from the 1000 Genomes Project, and deep coverage complete genome data from our own projects. With high power (78%), we can detect segments of length 0.4 cM or larger using fastIBD and GERMLINE in sequencing data. This compares to similar power to detect segments of length 1.0 cM or higher with microarray genotype data. We find that GERMLINE has slightly higher power than fastIBD for detecting IBD segments using sequencing data, but also has a much higher false positive rate.


We further quantify the effect of variant density, conditional on genetic map length, on the power to resolve IBD segments. These investigations into IBD resolution may help guide the design of future next generation sequencing studies that utilize IBD, including family-based association studies, association studies in admixed populations, and homozygosity mapping studies.

Thursday, 14 June 2012

Fwd: Family-based association studies for next-generation sequencing.

PubMed Results
Item 1 of 1    (Display the citation in PubMed)

1. Am J Hum Genet. 2012 Jun 8;90(6):1028-45.

Family-based association studies for next-generation sequencing.

Zhu Y, Xiong M.


Human Genetics Center and Division of Biostatistics, The University of Texas School of Public Health, Houston, TX 77030, USA.


An individual's disease risk is determined by the compounded action of both common variants, inherited from remote ancestors, that segregated within the population and rare variants, inherited from recent ancestors, that segregated mainly within pedigrees. Next-generation sequencing (NGS) technologies generate high-dimensional data that allow a nearly complete evaluation of genetic variation. Despite their promise, NGS technologies also suffer from remarkable limitations: high error rates, enrichment of rare variants, and a large proportion of missing values, as well as the fact that most current analytical methods are designed for population-based association studies. To meet the analytical challenges raised by NGS, we propose a general framework for sequence-based association studies that can use various types of family and unrelated-individual data sampled from any population structure and a universal procedure that can transform any population-based association test statistic for use in family-based association tests. We develop family-based functional principal-component analysis (FPCA) with or without smoothing, a generalized T(2), combined multivariate and collapsing (CMC) method, and single-marker association test statistics. Through intensive simulations, we demonstrate that the family-based smoothed FPCA (SFPCA) has the correct type I error rates and much more power to detect association of (1) common variants, (2) rare variants, (3) both common and rare variants, and (4) variants with opposite directions of effect from other population-based or family-based association analysis methods. The proposed statistics are applied to two data sets with pedigree structures. The results show that the smoothed FPCA has a much smaller p value than other statistics.

Copyright © 2012 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

PMID: 22682329 [PubMed - in process]
Icon for Elsevier Science

SEQuel: improving the accuracy of genome assemblies.

Bioinformatics. 2012 Jun 15;28(12):i188-i196.
SEQuel: improving the accuracy of genome assemblies.
Ronen R, Boucher C, Chitsaz H, Pevzner P.

Bioinformatics Graduate Program, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093 and Department of Computer Science, Wayne State University, Detroit, MI 48202, USA.


Assemblies of next-generation sequencing (NGS) data, although accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop SEQuel, a tool that corrects errors (i.e. insertions, deletions and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the positional de Bruijn graph, a graph structure that models k-mers within reads while incorporating the approximate positions of reads into the model.


SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell Escherichia coli data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently assembled Deltaproteobacterium SAR324 genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly.


SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at


PMID: 22689760 [PubMed - in process]

Wednesday, 13 June 2012

Bioinformatician "Data Finish & Quality Control" - SEQanswers

another of my series on the requirements of a BI guy ... my comments in red 

We are seeking a bioinformatician, computational biologist or software developer (I think they are really saying they want to look for a guy who is all of the above or is at least one but we will train you to do the other roles :) with knowledge of biology (oops yes you need to be a biologist as well :) )to work within the Production Bioinformatics Team. The 2 main axes of activity will be the development of integrated tools for tracking information at all stages of the sequencing process and the automated management, analysis and quality control of large amounts of sequence data. The successful candidate will report to the Production Bioinformatics Team Leader. 

- Master's degree in Bioinformatics or related discipline with at least 1 year experience, or Bachelor's degree with 3+ years of practical experience ( I really do feel that Master's doesn't help make a person a better bioinformatician but I am assuming that you want a person with 3 years of experience, 2 years of which you are doing a Msc out of it but you do not wish to hire a PhD )
- Very able in Unix, Linux and Windows operating systems and networks (are you discounting my Mac skills? )
- In depth experience with Unix shell scripts and Python 
- Excellent teamwork, organizational and communicative skills 
- Good spoken and written English 

- Experience in genomic data analysis 
- Proficiency in SQL. (I suppose this is for the LIMS)
- Knowledge of Perl, R and/or C/C++ (ah C++ is optional! am glad my C++ skills slowly atrophied to non existence)
- Experience with batch processing on a cluster 
- Experience with next generation sequencing data 
- Experience in a high-tech organization of several interacting, specialized teams 

- Ongoing development of the quality control pipeline on a computer cluster so as to assure automated continuous operation and integration with LIMS (Lab Information Management System) 
- Maintenance of an automated system for determining sequence quality based on signal quality, alignment to reference genomes and other measurements 
- Management, analysis and quality control of large amounts of sequence data 
- Efficient testing and integration into the production pipeline of new software as it becomes available 
- Development of new software and IT solutions for the fast-changing field of DNA sequencing 
- Communication and troubleshooting with sequencing and bioinformatics groups 

No offense meant for my tongue in cheek comments ! But I do find an increasing demand for biologists that know programming / scripting or computational people that know biology .. is it really a hard mix to find? or is it a communication problem? I admit i straddle both roles and not excelling in either .. I do feel that if there was an excellent communicator that can bridge the gap actually more would be accomplished than trying to find a superman that handles both roles and doesn't have a Phd to further a career in research and doesnt' mind 'research' pay .. 

k my 2 cents

Tuesday, 12 June 2012

Elements of Bioinformatics

The folks over at Eagle genomics has put up a rather creative way to navigate the maze of bioinformatics software by presenting it in a periodic table. The 'elements' are grouped according to their purpose. Creative eh? It's pretty too!

Sunday, 10 June 2012

[Biopython] EU-codefest

---------- Forwarded message ----------
From: "Peter Cock"
Date: Jun 10, 2012 6:25 PM
Subject: [Biopython] EU-codefest

Dear Biopythoneers,

Some of you might like to attend an Open-Bio Hackathon in Italy this
summer - 19 and 20 July 2012, in Lodi.

This is about a week after BOSC and the pre-BOSC CodeFest in California


---------- Forwarded message ----------
From: *Pjotr Prins*
Date: Saturday, June 9, 2012
Subject: EU-codefest

Hi Chris and Peter,

Would you mind sending a reminder of the EU-codefest to your lists?

Registration form is up:

Three main topics will be worked on during the CodeFest:

  NGS and high performance parsers for OpenBio projects.
  RDF and semantic web for bioinformatics.
  Bioinformatics pipelines definition, execution and distribution.

other tracks are welcome!

Biopython mailing list  -

An Explanation of Velvet Parameter exp_cov | Homologus

Appropriate choice of the 'exp_cov' (expected coverage) parameter in Velvet is very important to get an assembly right. In the following figure, we show data from a calculation on a set of reads taken from a 3Kb region of a genome, and reassembling them with varying exp_cov parameters. X-axis in the chart shows the exp_cov and y-axis shows the size of the largest scaffold assembled by Velvet.

Friday, 8 June 2012

Lampreys delete 20% of their genome

Smith says he faced a puzzling problem with the lamprey genome, though. Some DNA sequence he had produced from lamprey sperm cells simply wasn't lining up with the lamprey genome assembled by Sanger. Some bits aligned partially, and then veered off into unmatched DNA.  Other bits were completely without a match. "That turned out to be a red herring in a sense," he says. The sequence wasn't lining up because up to about half a billion basepairs of DNA found in the reproductive cells of lampreys is deleted from all other adult cells.

Much of Smith's work has since been trying to figure out both why and how the lamprey seems to make about 20% of its genome disappear during the development of all but its gametes.

Best Things in Life are Free: cost of NGS analysis & open source

Slides from BioIT World Asia 2012
pre-conference short course

Thursday, 7 June 2012

Released VAGUE 1.0 - a JVM-based GUI front-end for Velvet

GUI for Velvet called
VAGUE. It is written in JRuby but compiled to Java bytecode and will
run on Mac and Linux. You need to have the latest Velvet binaries (>=
1.2.06) as David has made improvements to Velvet to make VAGUE simpler
to use. You can optionally install which I announced
recently on this list.

You can look at screenshots and download it from here:


--Dr Torsten Seemann
--Scientific Director : Victorian Bioinformatics Consortium, Monash
University, AUSTRALIA
--Senior Researcher : VLSCI Life Sciences Computation Centre,
Parkville, AUSTRALIA
Velvet-users mailing list

Seagate GoFlex Desk Thunderbolt Adapter Review | - Storage Reviews

With technology like aspera, multi part S3 upload. I wonder how many people are still using portable HDD to transfer data. My lab has raw seq, archived on goflex hdd coincidentally. Getting data out fast is a pain when u have them split across ten drives.

It would be cool to have this adaptor to speed things up though!
But being sata drives as mentioned I think u can only reach USB 3 speeds. So even if I were to plug a disk via thunderbolt to a Mac via wired LAN the speed up might be just twice.

For the price, I wonder if getting a small NAS with USB ports to backup the entire portable HDD then use cron to pull the data to central storage might be better.

The cool thing about the adaptor is that it really is just a sata to thunderbolt adaptor. If u have a goflex drive, u would know what I mean.

Would be cool if someone made a SATA RAID mirror with thunderbolt output adaptor! Then u won't have to use SSDs to achieve higher speeds

Sent from my iPad

Genomic Dark Matter: The reliability of short read mapping illustrated by the Genome Mappability Score

Motivation: Genome resequencing and short read mapping are two of the primary tools of genomics and are used for many important applications. The current state-of-the-art in mapping uses the quality values and mapping quality scores to evaluate the reliability of the mapping. These attributes, however, are assigned to individual reads and don't directly measure the problematic repeats across the genome. Here we present the Genome Mappability Score (GMS) as a novel measure of the complexity of resequencing a genome. The GMS is a weighted probability that any read could be unambiguously mapped to a given position, and thus measures the overall composition of the genome itself.

Results: We have developed the Genome Mappability Analyzer (GMA) to compute the GMS of every position in a genome. It leverages the parallelism of cloud computing to analyze large genomes, and enabled us to identify the 5-14% of the human, mouse, fly, and yeast genomes that are difficult to analyze with short reads. We examined the accuracy of the widely used BWA/SAMtools polymorphism discovery pipeline in the context of the GMS, and found discovery errors are dominated by false negatives, especially in regions with poor GMS. These errors are fundamental to the mapping process and cannot be overcome by increasing coverage. As such, the GMS should be considered in every resequencing project to pinpoint the dark matter of the genome, including of known clinically relevant variations in these regions.

Availability: The source code and profiles of several model organisms are available at

Wednesday, 6 June 2012

14TB of liver cancer genome data from this available in GigaDB

YT plugging recent ACRG collaboration with BGI. 14TB of liver cancer genome data from this available in GigaDB

Genomic DNA was purified for at least 30-fold coverage paired-end (PE) sequencing, and PE reads were mapped on human reference genome (UCSC build hg19) and HBV (NC_003977).  Two sequencing libraries with different insert size were constructed for each genomic DNA sample (200bp and 800bp).  Paired end, 90bp read length sequencing was performed in the HiSeq 2000 sequencer according to the manufacturer's instructions.  Raw gene expression profiling data of these human HCC samples have been deposited to GEO with the accession number GSE25097.

Raw data

Raw data



May 31, 2012: Data released.

In accordance with our terms of use, please cite this dataset as:

Kan, Z; Zheng, H; Liu, X; Li, S; Barber, TD; Gong, Z; Gao, H; Hao, K; Willard, MD; Xu, J; Hauptschein, R; Rejto, PA; Fernandez, J; Wang, G; Zhang, Q; Wang, B; Chen, R; Wang, J; Lee, NP; Lee, WH; Ariyaratne, PN; Tennakoon, C; Mulawadi, FH; Wong, KF; Liu, AM; Chan, KL; Hu, Y; Chou, WC; Buser, C; Zhou, W; Lin, Z; Peng, Z; Yi, K; Chen, S; Li, L; Fan, X; Yang, J; Ye, R; Ju, J; Wang, K; Estrella, H; Deng, S; Wulur, IH; Liu, J; Ehsani, ME; Zhang, C; Loboda, A; Sung, WK; Aggarwal, A; Poon, RT; Fan, ST; Wang, J; Hardwick, J; Reinhard, C; Dai, H; Li, Y; Luk, JM; Mao, M; the Asian Cancer Research Group (2012): Hepatocellular carcinoma genomic data from the Asia Cancer Research Group. GigaScience.


Related manuscript available at:



Accession codes associated with this data:


MacBookPro - Debian Wiki

You have to respect a man when he installs Debian over Mac OS ... 

Fwd: [Velvet-users] - choose a good k-value for your genome automatically

From: Torsten Seemann

Hi all,

I have written a simple script to choose (or list) good k-values for
YOUR data with YOUR genome.

It's called and it needs two things:
(1) the target genome size (can supply a number eg. 4.8M) or a fasta
file of a close reference
(2) your read files (fasta/fastq  and uncompressed/bzip2/gzip should work)

Example uses might be:

# For manual examination
% --size 3.8M reads.fastq  morereads.fa.gz morereads.fq.bz2 paired.fa
K       #Kmers  Kmer-Cov
91      34649310        34.6
93      27719448        27.7
95      20789586        20.8

# For automated scripts
% --genome Ecoli.fna --best reads.fastq  morereads.fa.gz

You can download it from here:

If it is deemed to work well, then we will aim to:
1. incorporate it as "velvetk" in the Velvet distribution
2. rewrite in "C" if needed
3. add a new "auto" option instead of a fixed k-value in velveth.

--Dr Torsten Seemann
--Scientific Director : Victorian Bioinformatics Consortium, Monash
University, AUSTRALIA
--Senior Researcher : VLSCI Life Sciences Computation Centre,
Parkville, AUSTRALIA
Velvet-users mailing list

Datanami, Woe be me