Wednesday, 31 October 2012

Article: De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing

De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing

Sent via Flipboard

Sent from my phone

Rare variant discovery and calling by sequencing pooled samples with overlaps. Wang W, Yin X, Pyon YS, Hayes M, Li J.

Item 1 of 1    (Display the citation in PubMed)

1. Bioinformatics. 2012 Oct 27. [Epub ahead of print]

Rare variant discovery and calling by sequencing pooled samples with overlaps.

Wang W, Yin X, Pyon YS, Hayes M, Li J.


Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA.



For many complex traits/diseases, it is believed that rare variants account for some of the missing heritability that cannot be explained by common variants. Sequencing a large number of samples through DNA pooling is a cost effective strategy to discover rare variants and to investigate their associations with phenotypes. Overlapping pool designs provide further benefit because such approaches can potentially identify variant carriers, which is important for downstream applications of association analysis of rare variants. However, existing algorithms for analyzing sequence data from overlapping pools are limited.


We propose a complete data analysis framework for overlapping pool designs, with novelties in all three major steps: variant pool and variant locus identification, variant allele frequency estimation and variant sample decoding. The framework can be utilized in combination with any design matrix. We have investigated its performance based on two different overlapping designs, and have compared it with three state-of-the-art methods, by simulating targeted sequencing and by pooling real sequence data. Results on both datasets show that our algorithm has made significant improvements over existing ones. In conclusion, successful discovery of rare variants and identification of variant carriers using overlapping pool strategies critically depends on many steps, from generation of design matrixes to decoding algorithms. The proposed framework in combination with the design matrixes generated based on the Chinese remainder theorem achieves best overall results.


Source code of the program, termed VIP for Variant Identification by Pooling, is available at


PMID: 23104896 [PubMed - as supplied by publisher]
Icon for HighWire Press

Rare-variant association methods : Nature Genetics : Nature Publishing Group

Several methods for aggregate rare-variant association testing have recently been reported, including collapsing or weighting methods and gene- or region-based association tests. Although it is possible to estimate the average genetic effect for a group of rare variants from aggregate tests, there are potential biases, including winner's curse, selection procedures and differences between populations. Suzanne Leal and Dajiang Liu now report a new method to correct for bias in estimating the average genetic effect of a group of rare variants jointly analyzed for association consisting of a resampling-based approach and a bootstrap-sample-split algorithm (Am. J. Hum. Genet. 915855962012). They compare methods for estimating the average genetic effect and variance across a range of models in simulations, finding that the estimated variance is always less than the true locus-specific genetic variance, due to the inclusion of non-causal variants as well as causal variants with heterogeneous effects. The authors report the application of the new method to a resequencing data set of 4 genes in 1,045 individuals from the Dallas Heart Study, testing rare-variant associations with metabolic quantitative traits. The authors demonstrate the efficient estimation of average genetic effects in joint analysis of rare variants and note that estimated variance should be considered as a lower bound for the locus-specific variance.

Also featured in Nature Genetics is another paper from deCODE. Current research should be now focused on rare variants now that sequencing is getting cheaper. I think the next wave in genomics will be using longer read platforms to look at structural variants which has the greater potential to disrupt the normal working mechanism of genetic machinery. There's probably enough robustness/redundancy built-in to the human genome to prevent catastrophic  consequences for single variant or a bunch of rare variants to affect human health on a global scale. Disrupting entire regulatory mechanisms by structural variations however, seems like a surefire way to create the maddening variety of diseases states that eludes physicians attempting to cure everybody in the same way.

Icon for Nature Publishing Group

 2012 Oct 28. doi: 10.1038/ng.2437. [Epub ahead of print]

A study based on whole-genome sequencing yields a rare variant at 8q24 associated with prostate cancer.


1] deCODE genetics, Reykjavik, Iceland. [2].


In Western countries, prostate cancer is the most prevalent cancer of men and one of the leading causes of cancer-related death in men. Several genome-wide association studies have yielded numerous common variants conferring risk of prostate cancer. Here, we analyzed 32.5 million variants discovered by whole-genome sequencing 1,795 Icelanders. We identified a new low-frequency variant at 8q24 associated with prostate cancer in European populations, rs188140481[A] (odds ratio (OR) = 2.90; P(combined) = 6.2 × 10(-34)), with an average risk allele frequency in controls of 0.54%. This variant is only very weakly correlated (r(2) ≤ 0.06) with previously reported risk variants at 8q24, and its association remains significant after adjustment for all known risk-associated variants. Carriers of rs188140481[A] were diagnosed with prostate cancer 1.26 years younger than non-carriers (P = 0.0059). We also report results for a previously described HOXB13 variant (rs138213197[T]), confirming it as a prostate cancer risk variant in populations from across Europe.
[PubMed - as supplied by publisher]

Dell has sold 1M webscale servers in five years

I truly believe when genome analysis becomes routine, it will easier to run the analysis on public clouds rather than supporting inhouse servers that doesn't scale with demand.
One strong reason is that the data needs to be shared across hospitals so that you are not locked into one institute. It doesn't make sense for different hospitals to duplicate your data on their servers for cost issues.
Also the data really kinda belongs to the patient so perhaps the patient should be the one paying for the archive of the DNA sequencing.

Alternatively, service providers may wish to analyse your data and offer to store it for u in a sort of a barter trade ( Your DNA/ disease status for my mega analysis and in return i store it for u for 'free')

Dell has sold 1M webscale servers in five years

Monday, 29 October 2012

CLC bio's new read mapping algorithm beats popular open source alternatives in benchmarks - CLC bio

Comments please .. 

Read mapper beats BWA, Bowtie2, and SMALT

We just released a white paper on our new read mapping algorithm, which is included in the current releases of CLC Genomics Workbench 5.5 and CLC Genomics Server 4.5. 

It covers four human datasets and benchmarks for performance and accuracy against the open source read mappers BWA, Bowtie2, and SMALT.

› Read an executive brief and download the white paper 

Sunday, 28 October 2012

Article: Ubuntu lands on Nexus 7 slates with Canonical's one-click installer

Drools ... Must get one of these babies ... 

Ubuntu lands on Nexus 7 slates with Canonical's one-click installer

Sent via Flipboard

Sent from my phone

Saturday, 27 October 2012

Fwd: [GATK-Forum] You earned the First Comment badge.

LOL ...

---------- Forwarded message ----------
From: GATK-Forum <>

You earned the First Comment badge.

Commenting is the best way to get involved. Jump in the fray! +2 points

Follow the link below to check it out:

Have a great day!

Adding Genomic Annotations Using SnpEff and VariantAnnotator - GATK-Forum

I am surprised to find this thread recommending only using snpEFF 2.05 (and also ONLY GRCh37.64 ) with GATK. 
snpEFF is now at version 3. I wonder if this recommendation is still true. 

One of my biggest reasons to use snpEff is for the indel annotation and because it's running on java and the process of annotation is pretty fast, it's very amenable to run off my Macbook Pro (or your personal desktop) 

Fwd: [R-bloggers] R 2.15.2 now available (and 7 more aRticles)

As promised, the source distribution for R 2.15.2 is now available for download from the master CRAN repository. (Binary distributions for Windows, MacOS and Linux will be available from the CRAN mirror network in the coming days.) This latest point-update — codenamed "Trick or Treat" — improves the performance of the R engine and adds a few minor but useful features. Detailed changes can be found in the NEWS file, but highlights of the improvements include: 

  • New statistical analysis method: Multistratum MANOVA
  • Hyman's method of constructing monotonic interpolation splines is now available.
  • Improved support for Polish language users
  • Functions in the parallel package (such as parLapply) will make use of a default cluster if one is specified.
  • Improved performance and reduced memory usage for some commonly-used functions including array, rep, tabulate and hist
  • Increased memory available for data on 64-bit systems (increased to 32Gb from 16Gb)
  • Several minor bugfixes

There is likely to be at least one further update to the 2.15.x series: a round-up of any further changes will probable be released as 2.15.3 shortly before R 2.16.0 is released, most likely around March 2013.

r-announce mailing list: R 2.15.2 is released

Tuesday, 23 October 2012

RepeatSeq for accurate genotyping of microsatellite repeats - SEQanswers

Published in NAR today, RepeatSeq is a tool for genotyping tandem repeats from sequencing data. RepeatSeq is stable, very easy to install and use, with very accurate results compared to existing tools and even in Sanger validation of our genotype calls.

Open Article from NAR:

GitHub Download:

Sent from my phone

Saturday, 20 October 2012

Ensembl release 69 is out!

Subject: [ensembl-announce] Ensembl release 69 is out!

The latest Ensembl update (e!69) has been released :

Here are some highlights:

Human : Human dbSNP has been updated to release 137 and Human somatic variants from COSMIC have been updated to release 60. We now provide the 1000 genomes phase 1 data for the structural variation set, as well as DNA methylation data and ENCODE data.

New species : Ferret (Mustela putorius furo) and Platyfish (Xiphophorus maculatus).

Other news: We are happy to introduce the first inclusion of HAVANA manual curation for Ensembl genes in Pig, the release of the new scrollable region views and a new species home page redesign.

A complete list of the changes in release 69 can be found at

For the latest news on the ensembl project visit our blog at

Best Regards,
Thomas Maurel
Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton
Cambridge - CB10 1SD - UK

Thursday, 18 October 2012

"Editable Plots in R" Article: Creating SVG Plots from R

Do check out

Might be useful for those doing a lot of plots then manually tweaking them in R code and re-plot with the data again. 
You could actually produce an SVG file and edit in an open source vector graphics editor  like Inkscape

Source: R-bloggers Creating SVG Plots from R

Fwd: One petabyte of data loading into HDFS with in 10 min.

Alright the title is misleading but it's amusing following this thread in the hadoop email list .. 
Surprisingly a lot of the replies have be very helpful (even though they did ask if this question is part of a homework assignment CHUCKLES) 

The original question and further elaboration .. 
hahah comments PLEASE 

Hi Users,
Please clarify the below questions.
1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many slave (Data Nodes) machines required.
2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is the configuration setup for cloud computing.
Please suggest and help me on this.
---------- Forwarded message ----------
From: p K <p>
Date: 10 September 2012 15:40
Subject: Re: One petabyte of data loading into HDFS with in 10 min.
To: user

Hi Users,
Thanks for the response.

We have loaded 100GB data loaded into HDFS, time taken 1hr.with below configuration.

Each Node (1 machine master, 2 machines  are slave)

1.    500 GB hard disk.

2.    4Gb RAM

3.    3 quad code CPUs.

4.    Speed 1333 MHz


Now, we are planning to load 1 petabyte of data (single file)  into Hadoop HDFS and Hive table within 10-20 minutes. For this we need a clarification below.

1. what are the system configuration setup required for all the 3 machine's ?.

2. Hard disk size.

3. RAM size.

4. Mother board

5. Network cable

6. How much Gbps  Infiniband required.

 For the same setup we need cloud computing environment too?

Please suggest and help me on this.



Tuesday, 16 October 2012

Genotype calling and haplotyping in parent-offspring trios

 Genotype Calling and Haplotyping in Parent-Offspring Trios 
@gabecasis  @genomeresearch
Genotype calling and haplotyping in parent-offspring trios


Emerging sequencing technologies allow common and rare variants to be systematically assayed across the human genome in many individuals. In order to improve variant detection and genotype calling, raw sequence data are typically examined across many individuals. Here, we describe a method for genotype calling in settings where sequence data are available for unrelated individuals and parent-offspring trios and show that modeling trio information can greatly increase the accuracy of inferred genotypes and haplotypes, especially on low to modest depth sequencing data. Our method considers both linkage disequilibrium (LD) patterns and the constraints imposed by family structure when assigning individual genotypes and haplotypes. Using simulations, we show trios provide higher genotype calling accuracy across the frequency spectrum, both overall and at hard-to-call heterozygous sites. In addition, trios provide greatly improved phasing accuracy - improving the accuracy of downstream analyses (such as genotype imputation) that rely on phased haplotypes. To further evaluate our approach, we analyzed data on the first 508 individuals sequenced by the SardiNIA sequencing project. Our results show that our method reduces the genotyping error rate by 50% compared to analysis using existing methods that ignore family structure. We anticipate our method will facilitate genotype calling and haplotype inference for many ongoing sequencing projects.

Monday, 15 October 2012

LiftOver - Genome Analysis Wiki

If you found yourself ever needing to convert genome positions from b36 to b37 or similar things you might want to hop over to the wiki page at 

It gives a detailed instruction of how you might achieve it for Merlin / PLINK format files and also includes a few scripts and shows you 3 different methods to do it and lists various reasons why lift over can fail. 

A likelihood-based framework for variant calling and de novo mutation detection in families.

 2012 Oct;8(10):e1002944. doi: 10.1371/journal.pgen.1002944. Epub 2012 Oct 4.

A likelihood-based framework for variant calling and de novo mutation detection in families.


Center for Human Genetics Research, Department of Physiology and Biophysics, Vanderbilt University, Nashville, Tennessee, United States of America.


Family samples, which can be enriched for rare causal variants by focusing on families with multiple extreme individuals and which facilitate detection of de novo mutation events, provide an attractive resource for next-generation sequencing studies. Here, we describe, implement, and evaluate a likelihood-based framework for analysis of next generation sequence data in family samples. Our framework is able to identify variant sites accurately and to assign individual genotypes, and can handle de novo mutation events, increasing the sensitivity and specificity of variant calling and de novo mutation detection. Through simulations we show explicit modeling of family relationships is especially useful for analyses of low-frequency variants and that genotype accuracy increases with the number of individuals sequenced per family. Compared with the standard approach of ignoring relatedness, our methods identify and accurately genotype more variants, and have high specificity for detecting de novo mutation events. The improvement in accuracy using our methods over the standard approach is particularly pronounced for low-frequency variants. Furthermore the family-aware calling framework dramatically reduces Mendelian inconsistencies and is beneficial for family-based analysis. We hope our framework and software will facilitate continuing efforts to identify genetic factors underlying human diseases.
[PubMed - in process] 
Free full text
Icon for Public Library of Science

CD-HIT: accelerated for clustering the next generation sequencing data.

Item 1 of 1    (Display the citation in PubMed)

1. Bioinformatics. 2012 Oct 11. [Epub ahead of print]

CD-HIT: accelerated for clustering the next generation sequencing data.

Fu L, Niu B, Zhu Z, Wu S, Li W.


Center for Research in Biological Systems, University of California San Diego, La Jolla, California, USA.


SUMMARY: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ~24 cores and a quasi-linear speedup for up to ~8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. AVAILABILITY: CONTACT: SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID: 23060610 [PubMed - as supplied by publisher]

[Pub] CLEVER: Clique-Enumerating Variant Finder

Item 1 of 1    (Display the citation in PubMed)

1. Bioinformatics. 2012 Oct 11. [Epub ahead of print]

CLEVER: Clique-Enumerating Variant Finder.

Marschall T, Costa I, Canzar S, Bauer M, Klau GW, Schliep A, Schönhuth A.


Centrum Wiskunde & Informatica, Amsterdam, the Netherlands, Federal University of Pernambuco, Recife, Brazil, Illumina, Cambridge, UK, Rutgers, The State University of New Jersey, Piscataway, NJ, USA.



Next-generation sequencing techniques have facilitated large scale analysis of human genetic variation. Despite the advances in sequencing speed, the computational discovery of structural variants is not yet standard. It is likely that many variants have remained undiscovered in most sequenced individuals.


Here we present a novel internal segment size based approach, which organizes all, including concordant, reads into a read alignment graph where max-cliques represent maximal contradiction-free groups of alignments. A novel algorithm then enumerates all max-cliques and statistically evaluates them for their potential to reflect insertions or deletions. For the first time in the literature, we compare a large range of state-of-the-art approaches using simulated Illumina reads from a fully annotated genome and present relevant performance statistics. We achieve superior performance in particular for indels of length 20-100nt. This has been previously identified as a remaining major challenge in structural variation discovery, in particular for insert size based approaches. In this size range we outperform even split read aligners. We achieve competitive results also on biological data where our method is the only one to make a substantial amount of correct predictions, which, additionally, are disjoint from those by split-read aligners.


CLEVER is open source (GPL) and available from

PMID: 23060616 [PubMed - as supplied by publisher]

Saturday, 6 October 2012

Thursday, 4 October 2012

CummeRbund 2.0.0 released

---------- Forwarded message ----------
From: Loyal Goff

CummeRbund 2.0.0 release 10/3/2012

Following the release of Bioconductor 2.11, CummeRbund 2.0.0 is now available and is a recommended update for all CummeRbund users.  

Version 2.0.0 adds a host of new options/features/bugfixes and is the first stable release version to fully support Cuffdiff2.  

More information, details, vignettes, and downloads can be found at  As always, thank you for your support.

Stable public release (Bioconductor 2.11)

- 'annotation' and "annotation<-" generics were moved to BiocGenerics 0.3.2. Now using appropriate generic function, but requiring BiocGenerics >= 0.3.2

- Added replicates argument to csDistHeat to view distances between individual replicate samples.
- Appropriately distinguish now between 'annotation' (external attributes) and features (gene-level sub-features).
- csHeatmap now has 'method' argument to pass function for any dissimilarity metric you desire. You must pass a function that returns a 'dist' object applied to rows of a matrix. Default is still JS-distance.

New Features:
- Added diffTable() method to return a table of differential results broken out by pairwise comparison. (more human-readable)
- Added sigMatrix() method to CuffSet objects to draw heatmap showing number of significant genes by pairwise comparison at a given FDR.
- A call to fpkm() now emits calculated (model-derived) standard deviation field as well.
- Can now pass a GTF file as argument to readCufflinks() to integrate transcript model information into database backend
* Added requirement for rtracklayer and GenomicFeatures packages.
* You must also indicate which genome build the .gtf was created against by using the 'genome' argument to readCufflinks.
- Integration with Gviz:
* CuffGene objects now have a makeGeneRegionTrack() argument to create a GeneRegionTrack() from transcript model information
* Can also make GRanges object
* ONLY WORKS IF YOU READ .gtf FILE IN WITH readCufflinks()
- Added csScatterMatrix() and csVolcanoMatrix() method to CuffData objects.
- Added fpkmSCVPlot() as a CuffData method to visualize replicate-level coefficient of variation across fpkm range per condition.
- Added PCAplot() and MDSplot() for dimensionality reduction visualizations (Principle components, and multi-dimensional scaling respectively)
- Added csDistHeat() to create a heatmap of JS-distances between conditions.  

- Fixed diffData 'features' argument so that it now does what it's supposed to do.
- added DB() with signature(object="CuffSet") to NAMESPACE

- Once again, there have been modifications to the underlying database schema so you will have to re-run readCufflinks(rebuild=T) to re-analyze existing datasets.
- Importing 'defaults' from plyr instead of requiring entire package (keeps namespace cleaner).
- Set pseudocount=0.0 as default for csDensity() and csScatter() methods (This prevents a visual bias for genes with FPKM <1 and ggplot2 handles removing true zero values).

- Fixed bug in replicate table that did not apply make.db.names to match samples table.
- Fixed bug for missing values in *.count_tracking files.
- Now correctly applying make.db.names to *.read_group_tracking files.
- Now correctly allows for empty *.count_tracking and *.read_group_tracking files

- This represents a major set of improvements and feature additions to cummeRbund.
- cummeRbund now incorporates additional information emitted from cuffdiff 2.0 including:
- run parameters and information.
- sample-level information such as mass and scaling factors.
- individual replicate fpkms and associated statistics for all features.
- raw and normalized count tables and associated statistics all features.

New Features:
- Please see updated vignette for overview of new features.
- New dispersionPlot() to visualize model fit (mean count vs dispersion) at all feature levels.
- New runInfo() method returns cuffdiff run parameters.
- New replicates() method returns a data.frame of replicate-level parameters and information.
- getGene() and getGenes() can now take a list of any tracking_id or gene_short_name (not just gene_ids) to retrieve
a gene or geneset
- Added getFeatures() method to retrieve a CuffFeatureSet independent of gene-level attributes.  This is ideal for looking at sets of features
outside of the context of all other gene-related information (i.e. facilitates feature-level analysis)
- Replicate-level fpkm data now available.
- Condition-level raw and normalized count data now available.
- repFpkm(), repFpkmMatrix, count(), and countMatrix are new accessor methods to CuffData, CuffFeatureSet, and CuffFeature objects.
- All relevant plots now have a logical 'replicates' argument (default = F) that when set to TRUE will expose replicate FPKM values in appropriate ways.
- MAPlot() now has 'useCount' argument to draw MA plots using count data as opposed to fpkm estimates.

- Changed default csHeatmap colorscheme to the much more pleasing 'lightyellow' to 'darkred' through 'orange'.
- SQLite journaling is no longer disabled by default (The benefits outweigh the moderate reduction in load times).

- Numerous random bug fixes to improve consistency and improve performance for large datasets.

-Fixed bug in CuffFeatureSet::expressionBarplot to make compatible with ggplot2 v0.9.

New Features:
- Added 'distThresh' argument to findSimilar.  This allows you to retrieve all similar genes within a given JS distance as specified by distThresh.
- Added 'returnGeneSet' argument to findSimilar.  [default = T] If true, findSimilar returns a CuffGeneSet of genes matching criteria (default). If false, a rank-ordered data frame of JS distance values is returned.
- findSimilar can now take a 'sampleIdList' argument. This should be a vector of sample names across which the distance between genes should be evaluated.  This should be a subset of the output of samples(genes(cuff)).
- Added requirement for 'fastcluster' package.  There is very little footprint, and it makes a significant improvement in speed for the clustering analyses.



Loyal A. Goff, Ph.D
NSF Postdoctoral Fellow
Computer Science and Artificial Intelligence Laboratory - MIT 
Stem Cell and Regenerative Biology - Harvard

Paired-End Sequencing of Long-Range DNA Fragments for De Novo Assembly of Large, Complex Mammalian Genomes by Direct Intra-Molecule Ligation.

Item 1 of 1    (Display the citation in PubMed)

1. PLoS One. 2012;7(9):e46211. Epub 2012 Sep 27.

Paired-End Sequencing of Long-Range DNA Fragments for De Novo Assembly of Large, Complex Mammalian Genomes by Direct Intra-Molecule Ligation.

Asan, Geng C, Chen Y, Wu K, Cai Q, Wang Y, Lang Y, Cao H, Yang H, Wang J, Zhang X.


BGI-Shenzhen, Shenzhen, Guangdong, China.



The relatively short read lengths from next generation sequencing (NGS) technologies still pose a challenge for de novo assembly of complex mammal genomes. One important solution is to use paired-end (PE) sequence information experimentally obtained from long-range DNA fragments (>1 kb). Here, we characterize and extend a long-range PE library construction method based on direct intra-molecule ligation (or molecular linker-free circularization) for NGS.


We found that the method performs stably for PE sequencing of 2- to 5- kb DNA fragments, and can be extended to 10-20 kb (and even in extremes, up to ∼35 kb). We also characterized the impact of low quality input DNA on the method, and develop a whole-genome amplification (WGA) based protocol using limited input DNA (<1 µg). Using this PE dataset, we accurately assembled the YanHuang (YH) genome, the first sequenced Asian genome, into a scaffold N50 size of >2 Mb, which is over100-times greater than the initial size produced with only small insert PE reads(17 kb). In addition, we mapped two 7- to 8- kb insertions in the YH genome using the larger insert sizes of the long-range PE data.


In conclusion, we demonstrate here the effectiveness of this long-range PE sequencing method and its use for the de novo assembly of a large, complex genome using NGS short reads.
PMID: 23029438 [PubMed - as supplied by publisher]

Datanami, Woe be me