Tuesday, 26 February 2013

LOL! hilarious! Human Genome Proj vs F35

Source: http://www.facebook.com/IFeakingLoveScience 

Monday, 25 February 2013

:)) 5 seconds shutdown time Ubuntu !

27 seconds boot time for Ubuntu :)

Michael Schatz:Assembling Crop Genomes With SMS

PDF of the presentation on Feb 22, 2013 AGBT, Marco Island, FL


if you need an intro

"In a talk during the evening session, Mike Schatz, an assistant professor at Cold Spring Harbor Laboratory, spoke about “Assembling Crop Genomes with Single Molecule Sequencing.” Crops are important to sequence — 15 crops represent 90% of the world’s food, Schatz said — but are notoriously difficult to study because of their large genome size, high repeat content, and higher ploidy. Along with Sergey Koren and Adam Phillippy, he has built a pipeline to create hybrid genome assemblies using PacBio long reads combined with shorter-read sequence — either CCS reads from PacBio or data from another sequencing platform. In an example he offered of a rice strain, an attempted genome assembly using just Illumina reads yielded an N50 contig of 16Kb, but adding PacBio long reads to that boosted the N50 contig to 25Kb. Ultimately, Schatz said, he expects that as PacBio's readlength improves, this kind of approach could routinely generate megabase-size contigs or even pull plant chromosomes into single contigs.

For more information on Mike Schatz’s work using SMRT Sequencing, check out this case studydescribing an automated pipeline for genome finishing with PacBio long reads."

source: http://blog.pacificbiosciences.com/2013/02/notes-from-agbt-long-read-sequence-data.html

He includes a snippet of code to answer this question from twitter
'What's the longest single contig from a de Bruijn assembler without PE or a jumping library?'

$ perl -e 'print ">random\n"; @D=split //,"ACGT"; \for (1...100000000){print $D[int(rand(4))];} \print "\n"’ | fold > random.fa$ wgsim –r 0 -e 0 -N 50000000 -1 100 -2 1 \random.fa random.reads.fq /dev/null$ SOAPdenovo-63mer all –s random.cfg -K 63 -o random.63$ getlengths random.63.contig           1 99999990

[Bio-bwa-help] A new alignment algorithm merged into master (beta phase)

---------- Forwarded message ----------
From: Heng Li
Date: Mon, Feb 25, 2013 at 1:34 PM
Subject: [Bio-bwa-help] A new alignment algorithm merged into master (beta phase)

Bwa-mem (mem for maximal exact match) is a new algorithm that essentially seeds alignments with the fastmap/fermi-exact algorithm and then extends seeds with SW. It combines some key features from both bwa-backtrack, the first algorithm, and bwa-sw and aims to replace them for high-quality 100bp-100kbp sequences. I made this move because bwa-backtrack will fail to deliver satisfactory results for 150bp+ reads (which a few centers have observed), while bwa-sw is relatively slow without achieving the accuracy that I think is possible given longer reads. I would recommend the current bwa-backtrack users to keep an eye on the progress of bwa-mem. You will have to change the mapper when hiseq reads reach 150bp+.

At present, bwa-mem has the basic elements of a typical aligner. On a couple of simulated data sets, it shows a similar speed to bowtie2 and bwa-backtrack, twice as fast as bwa-sw, and is more accurate. It can also achieve the same accuracy as bowtie2/bwa-sw at halved computing time. There are, though, still a few important things on the TODO list: fine tuning the algorithm for better performance; testing on more data sets; testing for BAC-sized long sequences; bug fixes. As I have merged bwa-mem to the master branch, I need your feedbacks to push it forward. This is also a good time to request features in bwa-mem when I am actively working on it.

Thank you,


PS: I plan not to add BAM support to bwa-mem, but because bwa-mem reads fastq only once, this is actually less a concern. You can:

samtools bam2fq reads.bam | bwa mem -p ref.fa -

to map paired-end reads. Here '-p' indicates the inputs are interleaved fasta/q. You can also put bam optional tags (e.g. barcodes) in the fasta/q comment. With "-C", mem/bwasw will copy these tags to the final SAM output. Bwa-mem also supports more advanced piping such as:

bwa mem ref.fa '<bzip2 -dc read1.bz2' '<bzip2 -dc read2.bz2'

which is equivalent to

bwa mem ref.fa <(bzip2 -dc read1.bz2) <(bzip2 -dc read2.bz2)

but without the bash support. The former is still working when you launch bwa in tcsh or outside a shell.

 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.

Bio-bwa-help mailing list

C. Jimmy Lin talks about crowdsourced funded Genomics research

Have you heard of http://raregenomics.org/ ? They are a 'non-profit organization that gives families afflicted by rare genetic disorders access to genome sequencing and expert analysis. One can't help but notice the similarity to BGI-Shenzhen which is the 'first citizen-managed , non-profit research institute in China'. Perhaps this is the start of a trend for research that benefits citizens directly. Instead of getting/waiting for government funding / which complicates and slows down the research (see grant cycle) you can get funding directly from the ones that benefit. The tip of the iceberg is when publicly funded research is unavailable for public access without paying subscription fees.

Direct to consumer treatments and research for rare genetic disorders makes sense if you think about it. Getting funding for diabetes is likely easier and makes more economical sense since the dollars benefit a larger populace.
If you have a rare genetic disorder that might be limited to your family or a few families, you are kinda out of luck.
Your options are

  1. doing the research yourself
  2. convince your physician to take it up as a research project
  3. offer your patient sample for others to research (not the best example but TED Fellow Salvatore Iaconese open sourced his brain cancer for a cure or art)
  4. well now you can go to raregenomics

Citizen science is the latest trending topic/tag. From ecology projects like http://iseahorse.org/ and http://www.birds.cornell.edu/citsci/ to sequencing projects like uBiome to 'Bioinformatics as a game' like Phylo website it will be interesting to see how this plays out for research and science education as a trend. IMHO, you can only get rare genetics disorder research started as a non-profit now since the cost of sequencing isn't cheap enough yet (see 'A $1000 genome by 2013?' )

Related Links
'la cura, the cure' http://artisopensource.net/cure/

'Interview with C. Jimmy Lin'

Friday, 22 February 2013

A Simple Method for Obtaining Original Data from Published Graphs and Plots

Was thinking of how to extract data points for infant age and weight distribution from a printed graph and I landed at this old paper http://www.ajronline.org/content/174/5/1241.full . it pointed me to NIH Image which reminds me of an old software i used to use for lab practicals as an undergrad .. and upon reaching the NIH Image site, Indeed! imageJ is an 'update' of sorts to the NIH Image software .. 

Experimentation continues... 

How To Add Expiration Date To Shared Google Drive Folders http://www.hongkiat.com/blog/expiration-date-google-drive-folders/

MIght be useful! 

How To Add Expiration Date To Shared Google Drive Folders http://www.hongkiat.com/blog/expiration-date-google-drive-folders/

Thursday, 21 February 2013

Rare variant detection using family-based sequencing analysis



Next-generation sequencing is revolutionizing genomic analysis, but this analysis can be compromised by high rates of missing true variants. To develop a robust statistical method capable of identifying variants that would otherwise not be called, we conducted sequence data simulations and both whole-genome and targeted sequencing data analysis of 28 families. Our method (Family-Based Sequencing Program, FamSeq) integrates Mendelian transmission information and raw sequencing reads. Sequence analysis using FamSeq reduced the number of false negative variants by 14–33% as assessed by HapMap sample genotype confirmation. In a large family affected with Wilms tumor, 84% of variants uniquely identified by FamSeq were confirmed by Sanger sequencing. In children with early-onset neurodevelopmental disorders from 26 families, de novo variant calls in disease candidate genes were corrected by FamSeq as Mendelian variants, and the number of uniquely identified variants in affected individuals increased proportionally as additional family members were included in the analysis. To gain insight into maximizing variant detection, we studied factors impacting actual improvements of family-based calling, including pedigree structure, allele frequency (common vs. rare variants), prior settings of minor allele frequency, sequence signal-to-noise ratio, and coverage depth (20× to >200×). These data will help guide the design, analysis, and interpretation of family-based sequencing studies to improve the ability to identify new disease-associated genes.

Tuesday, 19 February 2013

[R-bloggers] 10 R packages every data scientist should know about (and 6 more aRticles)

Do you have R packages pertaining to genomics that you might feel should make it to a 'Top 10 R packages for Bioinformatics' ? 

The yhat blog lists 10 R packages they wish they'd known about earlier. Drew Conway calls them "10 reasons to always start your analysis in R". They're all very useful R packages that every data scientist should be aware of. They are:

  1. sqldf (for selecting from data frames using SQL)
  2. forecast (for easy forecasting of time series)
  3. plyr (data aggregation)
  4. stringr (string manipulation)
  5. Database connection packages RPostgreSQL, RMYSQL, RMongo, RODBC, RSQLite
  6. lubridate (time and date manipulation)
  7. ggplot2 (data visulization)
  8. qcc (statistical quality control and QC charts)
  9. reshape2 (data restructuring)
  10. randomForest (random forest predictive models)

You can find links to all of these packages and tips on how to use them at link below.

yhat blog: 10 R packages I wish I knew about earlier

Thursday, 14 February 2013

Comparison of Sequencing Platforms for Single Nucleotide Variant Calls in a Human Sample

Saw figure 2 from this paper from Stephan Schuster in talks wayyy back and his point about using different platforms/chemistry to reduce bias was always in the back of my head when handling single platform data.

Great work getting this finally published. 

His criteria for variant calling should be also a good starting reference point. 

"We used SAMtools version 0.1.16 to call the variants in the Illumina reads. We required a minimum coverage of 4, a maximum coverage of 60 and a minimum quality of 20 for the SNPs and indels that were found to be on the autosomes. We reduced the maximum coverage requirement to 45 for the sex chromosomes and increased it to 10,000 for the mitochondrial DNA. Only homozygous SNP and indels calls were kept from the sex chromosomes and mtDNA."


PLoS One. 2013;8(2):e55089. doi: 10.1371/journal.pone.0055089. Epub 2013 Feb 6.

Comparison of sequencing platforms for single nucleotide variant calls in a human sample.


Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania, United States of America.


Next-generation sequencings platforms coupled with advanced bioinformatic tools enable re-sequencing of the human genome at high-speed and large cost savings. We compare sequencing platforms from Roche/454(GS FLX), Illumina/HiSeq (HiSeq 2000), and Life Technologies/SOLiD (SOLiD 3 ECC) for their ability to identify single nucleotide substitutions in whole genome sequences from the same human sample. We report on significant GC-related bias observed in the data sequenced on Illumina and SOLiD platforms. The differences in the variant calls were investigated with regards to coverage, and sequencing error. Some of the variants called by only one or two of the platforms were experimentally tested using mass spectrometry; a method that is independent of DNA sequencing. We establish several causes why variants remained unreported, specific to each platform. We report the indel called using the three sequencing technologies and from the obtained results we conclude that sequencing human genomes with more than a single platform and multiple libraries is beneficial when high level of accuracy is required.

[PubMed - in process]

Wednesday, 6 February 2013

Article: FedEx's file-transfer capacity versus the Internet

Ever considered FedEx maximum bandwidth ? Lazy to do the math? It's in here. 

Though I wonder having all the data split across HDDs really count since u can't possibly access all the data at once. 

FedEx's file-transfer capacity versus the Internet

Sent via Flipboard

Sent from myPhone

Article: Source Code for Biology and Medicine | Abstract | CrypticIBDcheck: an R package for checking cryptic relatedness in nominally unrelated individuals

Source Code for Biology and Medicine | Abstract | CrypticIBDcheck: an R package for checking cryptic relatedness in nominally unrelated individuals

Sent via Flipboard

Sent from myPhone

Fwd: Save 15% on a New MiSeq System – Limited-Time Offer

Ask for your discount!

Sent from myPhone

Begin forwarded message:

From: "Illumina" <Community@illumina.com>
Date: 6 February, 2013 5:35:59 AM GMT+08:00
Subject: Save 15% on a New MiSeq System – Limited-Time Offer
Reply-To: Community@illumina.com

Bundle and Save with MiSeq
To view this email as a web page, click here


Propel Your Research Forward With MiSeqPropel Your Research Forward With MiSeq

Dear Valued Researcher,

If you are ready to experience the power of MiSeq, the most accurate and easiest-to-use benchtop sequencer—why not save a little money on the way?

Now through March 31, 2013, purchase at least 15 MiSeq Sequencing Reagent Kits at list price and receive 15% off your MiSeq System.

Next-generation sequencing doesn't get any easier than this. Take advantage of this offer and get started today.

Get it Now


*Offer not valid with other discounts and promotions. Eligible kits are MiSeq Reagent Kits v2 (50-, 300- and 500-cycles), catalog numbers: MS-102-2001, MS-102-2002, and MS-102-2003.


If you no longer wish to receive these emails

Handling R packages Feb 2013 issue Linux Journal

The kind folks at http://www.linuxjournal.com/ have provided me an 2013 Feb issue. Can't tell you how much of Linux I have picked up from there with its easy prose and graphical howtos. In the Feb 2013 issue, they have focused on the theme sys admin. Definitely useful things inside for the starting bioinformatician who wishes to dabble with working directly off a *nix machine :)

Other topics in this issue includes

In the February 2013 issue:
  • Manage Your Virtual Deployment with ConVirt
  • Use Fabric for Sysadmin Tasks on Remote Machines
  • Spin up Linux VMs on Azure
  • Make Your Android Device Play with Your Linux Box
  • Create a Colocated Server with Raspberry Pi

You can check out a preview of the contents here

February 2013 Issue of Linux Journal: System Administration

Tuesday, 5 February 2013

Fwd: [BioRuby] Genomer: a ruby project to simplify genome finishing

Looks intriguing! anyone else using it? 

---------- Forwarded message ----------
From: Michael Barton <mail AT michaelbarton.me.uk>
Date: Sun, Feb 3, 2013 at 5:34 AM
Subject: [BioRuby] Genomer: a ruby project to simplify genome finishing
To: BioRuby Mailing List

Hi Everyone,

I've been working a sequencing microbial genomes during my current post doc
position. I've combined many of the ruby scripts I was using into a single tool
called "genomer" which might be of interest to other bioinformaticians working
in the same area.

I used this tool to simplify the smaller, mundane tasks associated with a
genome. For instance moving contigs and associated annotations around,
generating the required files to submit to GenBank, and generating summaries of
the genome scaffold.

I created a small screencast for anyone who is interested in finding out more:

I wrote this tool to satisfy my own needs and use genomer extensively for the
microbial genome projects in our lab. My GNU Makefile (http://bit.ly/WMKlCZ)
from a P. fluorescens project illustrates how genomer combined with GenBank's
tbl2asn can be used to build the all files required for genome submission.

Hopefully genomer may be useful to other bioinformaticians and simplify the
steps required to finish and submit a genome.


Michael Barton

BioRuby Project - http://www.bioruby.org/

I have created an additional screencast to follow up. This provides a more
concrete example of using genomer to generate the files required to submit a
genome project to GenBank.




Datanami, Woe be me