Monday, 30 April 2012

metaMetaCV: a composition-based algorithm to classify metagenomic reads and function - SAMPLE SIMULATED DATA avail for download

We have developed a tool MetaCV to classify the very short metagenomic reads into specific taxonomic and functional groups.
Comparing to previous methods, there are some advantages:
1, Even faster and more accuracy than before methods, even for ~500 Gb Meta-Hit data, only take 2 days to classify and characterize this metagenomic data. 
2, Except obtaining taxonomic information, it also provided functional enrichment of identified microorganisms.
3, Easy to use, the tool can download from 
there are also some sample data for simulation.

Sunday, 29 April 2012

Cloud storage: a pricing and feature guide for consumers
Storage is always a top most concern for NGS data analysis. For quick and dirty file sharing you can't beat these commercial storage providers almost everyone has a dropbox account that you can share your files with now. Ars technica has an nice article that summarises the supported platforms and the cost if you wish to upgrade your account

Tech lust

Should be a nifty machine to use on the road ... especially when I actually really need a thin client now .. although .. larger screens are always helpful .. 

Thoughtful design.It goes beyond beauty. Bonded Corning® Gorilla® Glass, machined aluminium and carbon fibre are incorporated to enhance the performance of this Ultrabook™.
Find Out More>
Turns on instantly.XPS 13 Ultrabook™ boots in as little as 8 seconds and resumes from sleep mode in just 1 second (4 seconds from deep sleep). A solid-state drive and Intel® Rapid Start Technology make it possible.
Find Out More>


XPS 13 Ultrabook™

2nd Generation Intel® Core™ i7 processor
Windows® 7
Memory 4GB
256GB Solid State Drive

Game that teaches you to use vim

VIM Adventures | via Dougal Campbell's geek ramblings

Study of a US cohort supports the role of ZNF644 and high-grade myopia susceptibility.
Mol Vis. 2012;18:937-44. Epub 2012 Apr 12.

Study of a US cohort supports the role of ZNF644 and high-grade myopia susceptibility.



Myopia, or nearsightedness, is highly prevalent in Asian countries and is considered a serious public health issue globally. High-grade myopia can predispose individuals to myopic maculopathy, premature cataracts, retinal detachment, and glaucoma. A recent study implicated zinc finger protein 644 isoform 1 (ZNF644) variants with non-syndromic high-grade myopia in a Chinese-Asian population. Herein we focused on investigating the role for ZNF644 variants in high-grade myopia in a United States (US) cohort.


DNA from a case cohort of 131 subject participants diagnosed with high-grade myopia was screened for ZNF644 variants. Spherical refractive error of -≤-6.00 diopters (D) in at least one eye was defined as affected. All coding, intron/exon boundaries were screened using Sanger sequencing. Single nucleotide allele frequencies were determined by screening 672 ethnically matched controls.


Sequencing analysis did not detect previously reported mutations. However, our analysis identified 2 novel single nucleotide variants (c.725C>T, c.821A>T) in 2 high-grade myopia individuals- one Caucasian and one African American, respectively. These variants were not found in normal controls. A rare variant - dbsSNP132 (rs12117237→c.2119A>G) - with a minor allele frequency of 0.2% was present in 6 additional cases, but was also present in 5 controls.


Our study has identified two novel variants in ZNF644 associated with high-grade myopia in a US cohort. Our results suggest that ZNF644 may play a role in myopia development.

[PubMed - in process]

Tweet by Bio-IT World on Twitter

Bio-IT World (@bioitworld)

 on Twitter
"Dag: Last month sustained 700 MB/sec for 7 HOURS pushing data into AWS. Enough to handle a genome core facility (60 genomes/day). #BioIT12."
(Sent from Flipboard)

Sent from my iPad

Article: Bioinformatics Final Presentation by YoungHoon Gim on Prezi

Bioinformatics Final Presentation by YoungHoon Gim on Prezi

(Sent from Flipboard)

Sent from my iPad

Article: RNA-SeQC: RNA-seq metrics for quality control and process optimization

RNA-SeQC: RNA-seq metrics for quality control and process optimization

(Sent from Flipboard)

Sent from my iPad

Article: 20 free R tutorials (and one reference card) | (R news & tutorials)

20 free R tutorials (and one reference card) | (R news & tutorials)

(Sent from Flipboard)

Sent from my iPad

Friday, 27 April 2012

Rapid identification of high-confidence taxonomic assignments for metagenomic data
Nucleic Acids Res. 2012 Apr 24. [Epub ahead of print]

Rapid identification of high-confidence taxonomic assignments for metagenomic data.


Faculty of Computer Science, Dalhousie University, 6050 University Avenue, PO BOX 15000, Halifax, NS B3H 4R2, Canada.


Determining the taxonomic lineage of DNA sequences is an important step in metagenomic analysis. Short DNA fragments from next-generation sequencing projects and microbes that lack close relatives in reference sequenced genome databases pose significant problems to taxonomic attribution methods. Our new classification algorithm, RITA (Rapid Identification of Taxonomic Assignments), uses the agreement between composition and homology to accurately classify sequences as short as 50 nt in length by assigning them to different classification groups with varying degrees of confidence. RITA is much faster than the hybrid PhymmBL approach when comparable homology search algorithms are used, and achieves slightly better accuracy than PhymmBL on an artificial metagenome. RITA can also incorporate prior knowledge about taxonomic distributions to increase the accuracy of assignments in data sets with varying degrees of taxonomic novelty, and classified sequences with higher precision than the current best rank-flexible classifier. The accuracy on short reads can be increased by exploiting paired-end information, if available, which we demonstrate on a recently published bovine rumen data set. Finally, we develop a variant of RITA that incorporates accelerated homology search techniques, and generate predictions on a set of human gut metagenomes that were previously assigned to different 'enterotypes'. RITA is freely available in Web server and standalone versions.

[PubMed - as supplied by publisher] 
Free full text

Friday, 20 April 2012

Google Lat Long: Balloon and kite imagery in Google Earth

Google Lat Long: Balloon and kite imagery in Google Earth: Here at Google we publish a lot of imagery, most of which comes from the satellite and aerial imagery providers with whom we partner. Last w...

cool stuff! haha maybe one day google will provide a $100 kit to let you start to do your own genome sequencing and let you explore your unique genome ..

Wednesday, 18 April 2012

Experimental Design webinar: Learning From Our GWAS Mistakes - SEQanswers

sponsored by Golden Helix 
but seems like an interesting talk pity about the timezone though. Hope there's a recorded version

Experimental Design webinar: Learning From Our GWAS Mistakes


Why is bad or non-existent experimental design so prevalent in our field?

In the April 2012 issue of Biostatistics1, Dr. Christophe Lambert co-authored an invited editorial that examines this very question with genome-wide association studies (GWAS) as context. Though GWAS is an easy target, echoes of the same mistakes go on in every corner of biological research, including gene expression and next-gen sequencing!

In this webcast, Dr. Lambert expounds upon the deep systemic problems that plague our field stepping back to look at them from a broader paradigmatic perspective. He discusses real examples of poorly designed experiments showing what should not be done and concludes with practical advice on how to avoid common and costly mistakes in your own research.

By attending this webcast you will:

Gain a greater appreciation of the scientific method
Understand the importance of experimental design
Learn how to better design and analyze your next experiment
Realize how we may be fooling ourselves into thinking current NGS bioinformatic filtering have located causative variants

Additional topics include:

Why experimental design is so often ignored or botched
Hypothesis generation, falsification, ceteris paribus and the challenge of applying the scientific method to aggregated complex systems
Moving research from correlation to causation
How applying the scientific method and learning through our mistakes is in conflict with the implicit metrics of academia and perhaps human nature

Join us and you're sure to walk away with actionable information that will help you approach your research with new perspectives. Register

Cost effective Next-Generation Phylogeography: A Targeted Approach for Multilocus Sequencing of Non-Model Organisms

Time and Cost
It took four researchers approximately six months of dedicated lab time each to complete the initial effort of sequencing and genotyping six different nDNA loci (three per species). In the end however, only four of these six loci were sequenced completely due to complications. By comparison, all lab work, including primer development and optimization, for the entire 454 sequencing phase of the project was completed by one researcher in approximately six months of lab time (Table 5). There was some overlap of primer development between project phases and these time differences do not take into account the differences in sequence processing time. Regardless, our best estimate is that 454 sequencing as outlined here is three to four times more time efficient (in terms of cost and manpower) than traditional Sanger-based methods for sequencing multiple nDNA markers.
Table 5
Table 5
Approximate lab time needed to complete each sequencing objective.
The major costs of the targeted 454 method is the full plate of sequencing which includes library quality testing, quantification, and emPCR, and the cost for 400 primers (2 labeled primers each for 20 individuals across 2 species and 5 loci). This puts the total cost at $24,560 to sequence approximately 16 populations or 3200 individual nDNA loci. At $4.00 per individual sequence, the cost to Sanger sequence 3200 loci in the forward and reverse direction is approximately $25,725, already above the 454 price point. Additionally, if any cloning becomes necessary this cost savings quickly increases (Table 6). Overall, in species with moderate to high levels of genetic diversity and heterozygosity, 454 sequencing will be a much more cost effective way to sequence multiple nDNA intron loci.
Table 6
Table 6
Costs of sequencing 3200 loci.

SEQanswers - perl script to filter errors in fastq files?

Perl script to do some sanity checks on fastq files. 
good to check if they were corrupted via FTP transfer. 

credit: Simon Andrews

  #!/usr/bin/perl use warnings; use strict;  while (<>) {    unless (/^\@/) {     warn "$_ should have had an \@ at the start and it didn't\n";     next;   }   my $id1 = $_;   my $seq = <>;   my $id2 = <>;   my $qual = <>;    if ($seq =~/^[@+]/) {     warn "Sequence '$seq' looked like an id";     next;   }   if ($qual =~/^[@+]/) {     warn "Quality '$qual' looked like an id";     next;   }   if ($id2 !~ /^\+/) {     warn "Midline '$id2' didn't start with a +";     next;   }    if ($qual =~ /[GATCN]{20,}/) {     warn "Quality '$qual' looked like sequence";     next;   }    if (length($seq) != length($qual)) {     warn "Seq $seq and Qual $qual weren't the same length";     next;   }    print $id1,$seq,$id2,$qual;   }

R Function for Stratified Sampling « Adam On Analytics

got this in my email .. without trying to sound snobbish or whatever, I realise that 'extremely large' datasets can be a very relative descriptor .. (30k data rows here are considered extremely large)

Feedback on R function for stratified sampling of extremely large datasets, with many groups to sample from.

Hey guys,

I am constantly trying to improve my R code. I ran into an issue today where I had to draw random samples from several groups, with equal size from each group, from an extremely large dataset. I tried several functions I found online and I received errors about R memory issues. Thus, I had to write my own function. I got it to work, but I'd appreciate any feedback on how to improve my code. I attached it via the link below.

Fun Editing R Graphs in Inkscape - Data and Analysis with R, every day at work
Last week, I read a chapter out of Visualize This by Nathan Yau.  I was, of course, delighted to see that he was championing the use of R.  One really cool thing that I learned from his book, and was very surprised about, was that you can export an R graph in PDF form and then easily edit individual elements of the graph in Adobe Illustrator. 

what? and spend oodles of $ with Illustrator just to make R graph look more spiffy in a publication? (read undisguised plug for open source) try Inkscape!

disclaimer: there's all sorts of good reasons to use Illustrator especially if you do design work, opensource projects make an easier entry into vector graphics editing for a cheap budget.

A lock of hair and the HiSeq 2000 system identify a human migration wave that took more than 3,000 generations and 10,000 years to complete.

Sequencing Uncovers a 9,000 Mile Walkabout
A lock of hair and the HiSeq 2000 system identify a human migration wave that took more than 3,000 generations and 10,000 years to complete.


Archaeological evidence dates the Aboriginal presence in Australia to ~50,000 years before the present (BP), making them one of the earliest known populations of modern humans outside of Africa. Recognized as Australia's founding population, scientists theorized that the ancestors of today's Aborigines arrived on the continent from a single-wave migration out of Africa into Europe and Asia. However, recent whole-genome studies date the Europe/Asia split to have occurred between 17,000 and 43,000 BP, more than 10,000 years after the earliest Aborigine archeological evidence. Could the founding Australian Aborigine population be the result of an earlier migration wave? The answer was foundi by sequencing a century- old lock of hair with the HiSeq 2000 system1.

Long Forgotten Sample Proves Valuable

Morten Rasmussen, Ph.D., postdoctoral fellow in Dr. Eske Willerslev's lab at the University of Copenhagen, stumbled upon the ancient hair sample during a visit to the University of Cambridge. The team had recently experienced success in sequencing DNA from the hair of a Saqqaq individual found in the Greenland ice, uncovering an unknown migration of Old World humans to the New World Arctic2. During a discussion of that research, the Cambridge scientists mentioned they had additional ancient hair samples in their archives, including several 100-year old Aborigine hair segments. "We were intrigued, since the samples were just old enough that we could assume they were likely from Aborigines of unmixed origin," said Dr. Rasmussen.

Tuesday, 17 April 2012

OT: Uncorking the muse: alcohol intoxication facilitates creative problem solving.

Conscious Cogn. 2012 Mar;21(1):487-93. Epub 2012 Jan 30.

Uncorking the muse: alcohol intoxication facilitates creative problem solving.


Department of Psychology, University of Illinois at Chicago, 1007 W. Harrison St. MC 285, Chicago, IL 60647, USA.


That alcohol provides a benefit to creative processes has long been assumed by popular culture, but to date has not been tested. The current experiment tested the effects of moderate alcohol intoxication on a common creative problem solving task, the Remote Associates Test (RAT). Individuals were brought to a blood alcohol content of approximately .075, and, after reaching peak intoxication, completed a battery of RAT items. Intoxicated individuals solved more RAT items, in less time, and were more likely to perceive their solutions as the result of a sudden insight. Results are interpreted from an attentional control perspective.

[PubMed - in process]

Saturday, 14 April 2012

Postnatal development- and age-related changes in DNA-methylation patterns in the human genome.

Nucleic Acids Res. 2012 Apr 11. [Epub ahead of print] Click here to read

Postnatal development- and age-related changes in DNA-methylation patterns in the human genome.


Program in Genomics of Differentiation, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA, Institute of Biology, National Centre for Scientific Research 'Demokritos', Agia Paraskevi, 153 10, Attikis, Greece, Perinatology Research Branch and Molecular Genomics Laboratory, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA.


Alterations in DNA methylation have been reported to occur during development and aging; however, much remains to be learned regarding post-natal and age-associated epigenome dynamics, and few if any investigations have compared human methylome patterns on a whole genome basis in cells from newborns and adults. The aim of this study was to reveal genomic regions with distinct structure and sequence characteristics that render them subject to dynamic post-natal developmental remodeling or age-related dysregulation of epigenome structure. DNA samples derived from peripheral blood monocytes and in vitro differentiated dendritic cells were analyzed by methylated DNA Immunoprecipitation (MeDIP) or, for selected loci, bisulfite modification, followed by next generation sequencing. Regions of interest that emerged from the analysis included tandem or interspersed-tandem gene sequence repeats (PCDHG, FAM90A, HRNR, ECEL1P2), and genes with strong homology to other family members elsewhere in the genome (FZD1, FZD7 and FGF17). Our results raise the possibility that selected gene sequences with highly homologous copies may serve to facilitate, perhaps even provide a clock-like function for, developmental and age-related epigenome remodeling. If so, this would represent a fundamental feature of genome architecture in higher eukaryotic organisms.
[PubMed - as supplied by publisher] 
Free full text

Datanami, Woe be me