Tuesday 31 January 2012

How do u share ur results from genome studies?

RT @emblebi: A new @sangerinstitute project is exploring how researchers share results from genome studies. Take the questionnaire: http://bit.ly/xw8Y0k

[Samtools-help] Picard release 1.61


Picard release 1.61
30 January 2012

- PicardException used to extend SAMException, which extends RuntimeException.  Now PicardException extends RuntimeException directly.  If you have code that catches SAMException, you may want to add a catch clause for PicardException if you use the net.sf.picard classes.  If you only use classes in net.sf.samtools, you should never see PicardException thrown.

- IlluminaDataProviderFactory.java: Ensure position data type is in set of data types in ctors rather than in makeDataProvider(), in order to avoid ConcurrentModificationException.

-Alec

------------------------------------------------------------------------------
_______________________________________________
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help

Saturday 28 January 2012

galaxy-user] January 27, 2012 Galaxy Distribution & News Brief



---------- Forwarded message ----------
From: Jennifer Jackson
Date: Saturday, 28 January 2012
Subject: [galaxy-user] January 27, 2012 Galaxy Distribution & News Brief



January 27, 2012 Galaxy Distribution &  News Brief


Complete News Brief
* http://wiki.g2.bx.psu.edu/DevNewsBriefs/2012_01_27


Highlights:

* Important metadata and Python 2.5 support corrections
* SAMtools upgraded for version 0.1.18. Mpileup added.
* Dynamic filtering, easy color options, and quicker
 indexing enhance Trackster
* Set up your Galaxy instance to run cluster jobs as
 the real user, not the Galaxy owner
* Improvements to metadata handling and searching in
 the Tool Shed
* Improved solutions for schema access, jobs management,
 & workflow imports and inputs.
* New datatypes (Eland, XML), multiple tool enhancements,
 and bug fixes.


Get Galaxy!
* http://getgalaxy.org

new:     % hg clone http://www.bx.psu.edu/hg/galaxy galaxy-dist
upgrade: % hg pull -u -r 26920e20157f


Read the release announcement and see the prior release history
* http://wiki.g2.bx.psu.edu/DevNewsBriefs/


Need help with a local instance?

Search with our custom google tools!
* http://wiki.g2.bx.psu.edu/Mailing%20Lists#Searching

And consider subscribing to the galaxy-dev mailing list!
* http://wiki.g2.bx.psu.edu/Mailing%20Lists#Subscribing_and_Unsubscribing




--
Jennifer Jackson
Galaxy Team

http://usegalaxy.org
http://galaxyproject.org
http://galaxyproject.org/wiki/Support



___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

Changes to Google Privacy Policy and Terms of Service

hmmm unsure if this is for better or worse

---------- Forwarded message ----------
From: Google
Date: Saturday, 28 January 2012
Subject: Changes to Google Privacy Policy and Terms of Service


Is this email not displaying properly?
View it in your browser.

Dear Google user,

We're getting rid of over 60 different privacy policies across Google and replacing them with one that's a lot shorter and easier to read. Our new policy covers multiple products and features, reflecting our desire to create one beautifully simple and intuitive experience across Google.

We believe that this stuff matters so please take a few minutes to read our updated Privacy Policy and Terms of Service at http://www.google.com/policies. These changes will take effect on 1 March, 2012.

One policy, one Google experience
________________________________
Easy to work across Google

Our new policy reflects a single product experience that does what you need, when you want it to. Whether you're reading an email that reminds you to schedule a family get-together or finding a favourite video that you want to share, we want to ensure that you can move across Gmail, Calendar, Search, YouTube or whatever your life calls for, with ease.

Tailored for you

If you're signed in to Google, we can do things like suggest search queries – or tailor your search results – based on the interests that you've expressed in Google+, Gmail and YouTube. We'll better understand which version of Pink or Jaguar you're searching for and get you those results faster.

Easy to share and collaborate

When you post or create a document online, you often want others to see and contribute. By remembering the contact information of the people you want to share with, we make it easy for you to share in any Google product or service with minimal clicks and errors.

________________________________
Protecting your privacy hasn't changed

Our goal is to provide you with as much transparency and choice as possible through products like Google Dashboard and Ad Preferences Manager, alongside other tools. Our privacy principles remain unchanged. And we'll never sell your personal information or share it without your permission (other than rare circumstances like valid legal requests).

Have questions?
We have answers.

Visit our FAQ at http://www.google.com/policies/faq to read more about the changes. (We reckoned our users might have a question or twenty-two.)

________________________________
Notice of Change

1 March, 2012 is when the new Privacy Policy and Terms will come into effect. If you choose to keep using Google once the change occurs, you will be doing so under the new Privacy Policy and Terms of Service.

Please do not reply to this email. Mail sent to this address cannot be answered. Also, never enter your Google Account password after following a link in an email or chat to an untrusted site. Instead, go directly to the site, such as mail.google.com or www.google.com/accounts. Google will never email you to ask for your password or other sensitive information.

Friday 27 January 2012

Manuel's Personal Exome Now Publicly Released | Manuel Corpas' Blog

After 5 months of having performed the sequencing of my personal exome, I now make it available to the community for public use. I release it under a CC BY-SA 3.0 license, giving you permission to use this data in any way, as long as it provides attribution to the source and it is shared under a similar license.

http://manuelcorpas.com/2012/01/23/my-personal-exome-now-publicly-released/

Thursday 26 January 2012

Roche in $5.7 billion bid for Illumina

Wow ... Things are going to be interesting now ....

ZURICH/LONDON (Reuters) -Swiss drugmaker Roche Holding AG (ROG.VX) has offered $5.7 billion in cash in a hostile bid to take over Illumina Inc (ILMN.O), and investors are already betting that the U.S. gene sequencing company will command a significantly higher price.

http://mobile.reuters.com/article/innovationNews/idUSTRE80O0FR20120125?irpc=932

Cheers
Kevin
Sent from an Android

Tuesday 24 January 2012

Why we need the Assemblathon - The Assemblathon


Why we need the Assemblathon - The Assemblathon
excerpted 
If you want the best genome assembly possible, you may have to accept some trade-offs. The assemblers that may perform well in one area may not perform as well in other areas. Everybody wants to be told 'genome assembler X will give you the best assembly' but at the moment it doesn't seem fair to make such bold assertions. What we did find out from Assemblathon 1 was that a number of genome assemblers performed admirably across many, but not all, of the different metrics. For example, the assembler that did the best job at increasing coverage (the amount of the known genome present in the assembly), ranked 9th when considering the number of substitution errors present in the assembly. Conversely, the assembler that did the best job at minimizing substitution errors ranked 8th in terms of coverage. You pays your money and you takes your choice. We should mention, however, that these two assemblers (SOAPdenovo and SGA), along with ALLPATHS were consistently ranked among the best assemblers for the vast majority of metrics and were the three best overall assemblers in Assemblathon 1.  

Sunday 22 January 2012

Free Science, One Paper at a Time | Wired Science | Wired.com


Part of the answer, strangely, is the very thing at the center of science: the paper. Once science's main conduit, the paper has become its choke point.

It's not just that the paper is slow, though that is a huge problem. A researcher who submits a paper to a traditional journal right now, for instance, won't see the published piece for about a year. She must wait while the paper gets passed around among editors, then goes through rounds of peer review by experts in her field, who might and often do object not just to her methods or data but to her findings and interpretations. Finally, she must wait while it moves through an editing, layout, and publishing pipeline that itself might run anywhere from 2 to 12 weeks.

Yet the paper is not simply slow; it's heavy. Even as increasingly data-rich science has outgrown the paper's ability to deliver and describe all that science has to offer — its deep databases, its often elaborate methods — we've loaded it up needlessly with reputational weight and vital functions other than carrying data.

The paper is meant to be a conduit for the real content and currency of the science: the ideas, methods, data, and findings of the people who do science. But the tremendous publishing and commercial infrastructure built around the academic paper over the last half-century has concentrated so many functions and so much value in the journal that the paper itself, rather than the information in it, has become science's main currency. It is the paper you must buy; the paper you must publish; the paper you must cite; the paper on which not just citations but tenure, reputation, status, and even school rankings are built

Saturday 21 January 2012

SGA uses less memory for de novo assembly

excerpted from Genomeweb
According to a Genome Research paper describing the method, the secret to the String Graph Assembler's reduced memory footprint is that it uses compressed data structures to "exploit the redundancy" in sequence reads and to "substantially lower the amount of memory required to perform de novo assembly."

SGA relies on an algorithm its developers published in 2010 that constructs an assembly string graph from a so-called “full-text minute-space” index, or FM-index, which enables searching over a compressed representation of a text. Unlike other short-read assemblers that rely on the de Bruijn graph model, which breaks reads up into k-mers, the string graph model “keeps all reads intact and creates a graph from overlaps between reads,” the Sanger team wrote.  

his approach makes even more sense now that sequencing instruments like the Pacific Biosciences RS are generating longer reads.
In the paper, Durbin and co-developer Jared Simpson report that SGA successfully assembled 1.2 billion human genome sequence reads using 54 GB of memory. This was compared with SOAPdenovo, which required 118 GB for the same task.

Results from comparisons of SGA with Velvet, ABySS, and SOAPdenovo using a C. elegans dataset showed that its assembled contigs covered 95.9 percent of the reference genome while the other three programs covered 94.5 percent, 95.6 percent, and 94.8 percent respectively.
Furthermore, SGA required only 4.5 gigabytes of memory to assemble the C. elegans dataset compared to 14.1 GB, 23 GB, and 38.8 GB required for ABySS, Velvet, and SOAPdenovo respectively.

SGA is slower that its counterparts, however. SGA took 1,427 CPU hours to complete a human genome assembly, while SOAPdenovo required 479 CPU hours.
"We explicitly trade off a bit longer CPU time for lower memory usage," Simpson, a doctoral student in Durbin's lab, told BioInform. "We feel that fits better into most of the clusters that are available right now."
On the other hand, SGA is parallelizable, so its most compute-intensive activities — error-correcting reads and building the FM-index of corrected reads — can be distributed across a compute cluster to reduce run time, the researchers explain in the paper.


He explained that while de bruijn assemblers like Velvet require a separate processing step after completing the genome assembly, SGA doesn’t and potentially avoids "some of the errors or incomplete analysis" that can occur in the extra processing step.
This reduced error risk plus its lower memory requirement ensures that tools like SGA have a "future," Durbin said.
For their next steps, Durbin and Simpson are adapting SGA to work with longer read data from the Roche 454 sequencer and the Life Technologies Ion Torrent Personal Genome Machine. They are also exploring ways of discovering variants using the program.
The approach could also be used to analyze metagenomic data, the researchers said in the paper.



 Read the full article here http://www.genomeweb.com/informatics/sanger-teams-de-novo-assembler-adopts-compressed-approach-reduce-memory-footprin


1427 CPU hours is ALOT more than 479 CPU hours (~ 3x) but of course when you can parallelize it, it's definitely a worthwhile tradeoff especially when it's more likely that one has a lot of lower memory clusters then one single cluster with a lot of memory and is likely to be hogged by a 479 CPU hour job. I wonder if this might encourage investigators to relook at existing data by de novo assembly. Of course it would be great if the NGS data is made public, then I guess other groups can actually do the de novo assembly comparison for them. 
 

Friday 20 January 2012

What's the difference between an accession number (AC) and the entry name (ID)?


RT @pride_ebi: .@uniprot 101: What's the difference between an accession number (AC) and the entry name (ID)? http://www.uniprot.org/faq/6

Read it online: http://twitter.com/emblebi/status/154584407729119233

Cancer Commons is a non-profit open science initiative dedicated to improving outcomes for today's cancer patients.

This should be interesting to watch out for ... 

http://cancercommons.org/about/

cancercommons.org/
Cancer Commons is a non-profit open science initiative dedicated to improving outcomes for today's cancer patients. Our goals are to: 1) give each patient the .

Thursday 19 January 2012

Nosql on SSD ! Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications - All Things Distributed

http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html

wow nosql dbs on SSDs .... Can only imagine how fast that might be ....

Maybe I should sell off my cluster lol ...

Bionimbus Cloud - Complete Genomics Chooses the Bionimbus as Mirror Site for CGI 60 Genomes Release

http://www.bionimbus.org/

Complete Genomics Chooses the Bionimbus as Mirror Site for CGI 60 Genomes Release

Complete Genomics Inc. has chosen the Bionimbus Community Cloud as a mirror site for their 60 Genomes dataset.

The 60 Genomes dataset can be found here, as part of the public data that Bionimbus makes available to researchers. With the Bionimbus Community Cloud, the data is available via both the commodity Internet, as well as via high performance research networks, such as the National LambdaRail and Internet2.

The genomes in the dataset have on average more than 55x mapped read coverage, and the sequencing of these 60 genomes generated more than 12.2 terabases (Tb) of total mapped reads. This dataset will complement other publicly available whole genome data sets, such as the 1000 Genomes Project's recent publication of six high-coverage and 179 low-coverage human genomes. Forty of the sixty genomes are available now and the remainder will be available at the end of March.

The 60 genomes included in this dataset were drawn from two resources housed at the Coriell Institute for Medical Research: the National Institute of General Medical Sciences (NIGMS) Human Genetic Repository and the NHGRI Sample Repository for Human Genetic Research. Included in the release is a 17-member, three-generation CEPH pedigree from the NIGMS Repository and ethnically diverse samples from the NHGRI Repository that represent nine different populations. The samples selected are unrelated, with the exception of the three-generation CEPH pedigree, a Yoruba trio and a Puerto Rican trio. The majority of these samples have been previously analyzed as part of the International HapMap Project or 1000 Genomes Project.

Bionimbus version 1.7 Released

We have just made a beta release of version 1.7 of Bionimbus. If you would like to host and operate your own Bionimbus cloud then you should consider this release. We expect to release version 1.8 in March/April, which will provide several additional features, including improved project management and the ability to edit an experiment's metadata.

Bionimbus Virtual Machine released on Amazon EC2

A virtual machine image with common peak calling pipelines was made available on Amazon Web Services Elastic Cloud. Upon boot, it fetches pipeline library data, providing everything needed for processing user's data.

Amazon EC2 ID: ami-aead58c7

Startup command: ec2-run-instances -n 1 -t m1.large ami-aead58c7

Upon connecting to your instance, wait for /READY-PIPELINE-DATA file to appear before commencing pipelines. This file signifies that pipeline data libraries installed successfully on your instance.

For more information see the Bionimbus Machine Images section of the Using Bionimbus page.

Bionimbus 1.6.0-0 web server software release

Download URL: bionimbus-1.6.0-0.tar.bz2

Installation Instructions: bionimbus-1.6.0-0-INSTALL.txt

modENCODE Fly Data Added to BSPS

The modENCODE Fly data produced by the White Lab is now available in the Bionimbus Simple Persistent Storage (BSPS) in the directory /glusterfs/fly.

All the data in BSPS is accessible to any virtual machine launched in the Bionimbus Elastic Compute Cloud (BEC2).

The Fly data produced by the White Lab can also be browsed, accessed and downloaded in bulk from Cistrack.

If you would like data added to BSPS, please send an email to support at bionimbus.org.

Bionimbus Workspace

The Bionimbus Workspace (BWS) is a storage space that we have set up for those in the modENCODE fly/worm joint analysis group who would like to exchange data but do not want to use BEC2 and its associated storage. The Bionimbus Workspace (BWS) is accessed via ftp.

Here is a link to a tutorial about how to use BWS.

BWS is synced daily and on demand to the Bionimbus Simple Persistent Storage Space (BSPS), which is one of the storage services that is available to all the Bionimbus virtual machines that are run in the Bionimbus Elastic Compute Cloud (BEC2). In other words, the data that is moved by ftp to the BWS can be analyzed within the BEC2 using any of the Bionimbus supported machine images.

Please note that data in BSPS is not synced back to BWS. On the other hand, any user can manually write data to BWS assuming he or she has write permission to the target directory.

To set up an account, please send email to support at bionimbus.org.

Yahoo! Donates Equipment to Bionimbus

Yahoo! announced today that they will be donating a 2,000 processor core system to the Open Cloud Consortium (OCC) for use by the OCC Open Cloud Testbed and the OCC Open Science Data Cloud.

Two of the donated racks will be used by Bionimbus, which is part of the OCC Open Science Data Cloud.


DASH Associates Shared Haplotypes

http://www.cs.columbia.edu/~gusev/dash/

Genomewide association has been a powerful tool for detecting common disease variants. However, this approach has been underpowered in identifying variation that is poorly represented on commercial SNP arrays, being too rare or population-specific. Recent multipoint methods including SNP tagging and imputation boost the power of detecting and localizing the true causal variant, leveraging common haplotypes in a densely typed panel of reference samples. However, they are limited by the need to obtain a robust population-specific reference panel with sampling deep enough to observe a rare variant of interest. We set out to overcome these challenges by using long stretches of genomic sharing that are identical by descent (IBD). We use such evident sharing between pairs and small subsets of individuals to recover the underlying shared haplotypes that have been co-inherited by these individuals.

We have created a software tool, DASH (DASH Associates Shared Haplotypes), that builds upon pairwise IBD shared segments to infer clusters of IBD individuals. Briefly, for each locus, DASH constructs a graph with links based on IBD at that locus, and uses an iterative min-cut approach to identify clusters. These are densely connected components, each sharing a haplotype. As DASH slides the local window along the genome, links representing new shared segments are added and old ones expire; these changes cause the resultant connected components to grow and shrink. We code the corresponding haplotypes as genetic markers and use them for association testing.

Everyone needs more memory ....

Traceback (most recent call last):
  File "summary-stats.py", line 50, in ?
    depths.append(depth)
MemoryError
Command exited with non-zero status 1
real 38810.60
user 38788.69
sys 13.94


Sigh ... will have to think how to solve this with efficient programming instead of just throwing computing power at it and pray it works ... 

Wikipedia's blackout ...


Imagine a World Without Free Knowledge
For over a decade, we have spent millions of hours building the largest encyclopedia in human history. Right now, the U.S. Congress is considering legislation that could fatally damage the free and open Internet. For 24 hours, to raise awareness, we are blacking out Wikipedia. Learn more.

Has it affected you?

You can circumvent it by checking out the instructions at
http://lifehacker.com/5876833/how-to-take-wikipedia-offline-so-you-can-keep-using-it-during-tomorrows-anti+sopa-blackout

But when u do, still pay a couple of minutes thinking about the SOPA bill.
http://lifehacker.com/5860205/all-about-sopa-the-bill-thats-going-to-cripple-your-internet

Wednesday 18 January 2012

VPA: an R tool for analyzing sequencing variants with user-specified frequency pattern

http://www.biomedcentral.com/1756-0500/5/31/abstract

BMC Res Notes. 2012 Jan 14;5(1):31. [Epub ahead of print]
VPA: an R tool for analyzing sequencing variants with user-specified frequency pattern.

Abstract

ABSTRACT:

BACKGROUND:

The massive amounts of genetic variant generated by the next generation sequencing systems demand the development of effective computational tools for variant prioritization.

FINDINGS:

VPA (Variant Pattern Analyzer) is an R tool for prioritizing variants with specified frequency pattern from multiple study subjects in next-generation sequencing study. The tool starts from individual files of variant and sequence calls and extract variants with user-specified frequency pattern across the study subjects of interest. Several position level quality criteria can be incorporated into the variant extraction. It can be used in studies with matched pair design as well as studies with multiple groups of subjects.

CONCLUSIONS:

VPA can be used as an automatic pipeline to prioritize variants for further functional exploration and hypothesis generation. The package is implemented in the R language and is freely available from http://vpa.r-forge.r-project.org.


BarraCUDA - a fast short read sequence aligner using graphics processing units

http://www.biomedcentral.com/1756-0500/5/27/abstract
BMC Res Notes. 2012 Jan 13;5(1):27. [Epub ahead of print]
BarraCUDA - a fast short read sequence aligner using graphics processing units.

Abstract

ABSTRACT:

BACKGROUND:

With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence.

FINDINGS:

Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput.

CONCLUSIONS:

BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net.


FastQ Screen A screening application for high througput sequence data

FastQ Screen A screening application for high througput sequence data

Tuesday 17 January 2012

Big data in R: Error: negative length vectors are not allowed

It is not immediately apparent from this error message that I ran out of memory with R


Error: negative length vectors are not allowed
Execution halted

basically I loaded read depth information per genomic position (3 billion data points) into R, hoping it will work out. After googling around, it turns out that 

  1. R Works on RAM
  2. Maximum length of an object is 2^31-1

Not sure if there's a solution out there ... still pretty noob in R .. 

These didn't seem to offer help on my simple problem. 
http://www.r-bloggers.com/why-we-need-to-deal-with-big-data-in-r/
http://www.revolutionanalytics.com/products/enterprise-big-data.php

VIM on Macs :)

I concur!
One gripe that I have is that gnome-vim isn't installed by default which is a slight annoyance since I am able to do X windows tunneling to benefit from high bandwidth and GUI convenience but I need to beg the sysad to install. BUT gnome-vim is terribly laggy locally after I upgraded so not sure if I want to trade GUI for speed.

trying to think of a workflow where i can edit locally (preferably in a dropbox folder which syncs and backups up work in progress) and when done, it can upload (and execute)

hmmm need time to google for this .. I am sure it can be done!


I've been using MacVIM as my editor of choice for a couple of years now, yet in many ways I still feel like a beginner. Every day I am learning more and more about my editor, but it takes a conscious effort to become proficient with an editor like Vim. Here's why I make that effort.

The downside to something like Vim, and other highly configurable editors, is that it does require an investment of time to see the real benefits of it. I have several friends that desire to learn Vim but aren't willing to make the investment to switch from something like TextMate. Thankfully there are quite a few resources out there to help you get up to speed quickly.

  • PeepCode Screencasts - Their offerings of Smash into Vim and Smash into Vim 2 are great videos to help you get started with Vim. I learned some fundamental things about Vim in these screencasts that I wasn't aware of previously. I also find them real gems to visit again and again. Well worth the money.
  • VimCasts - Free short videos highlighting features of Vim. VimCasts is produced by Drew Neil. These are high quality professionally done trainings. A highly recommended addition to your podcast reader.
  • Vim Scripts - Part of the Vim site that is devoted to third-party plugins to expand the capabilities of Vim. It's well worth your time to find plugins that make things easier. For instance, I have a plugin that highlights errors in my Python code as I type, such as finding unused imports or making sure my code is PEP8 compliant. I have a plugin that makes commenting painless.
  • Justin Lilly's Vim Screencasts - My good friend Justin Lilly has a number of great screencasts on Vim. Additionally, his post titled Vim: My New IDE is an excellent introduction to some of the plugins available on Vim.

One other thing that will get you up to speed on Vim is to start with someone else's Vim configuration. Mine is available on GitHub. I caution you to not adopt a complex Vim configuration until you have the basics down. The main reason being is that some configurations alter basic builtin behavior. For instance in my configuration I disable navigation using the arrow keys. If you're not aware of this it could impact your understanding of what are Vim defaults and what things are modifications.


Python Ecosystem - An Introduction » mirnazim.org

http://mirnazim.org/writings/python-ecosystem-introduction/

Python Ecosystem - An Introduction

When developers shift from PHP, Ruby or any other platform to Python, the very first road block they face (most often) is a lack of an overall understanding of the Python ecosystem. Developers often yearn for a tutorial or resource that explains how to accomplish most tasks in a more or less standard way.

What follows is an extract from the internal wiki at my workplace, which documents the basics of the Python ecosystem for web application development for our interns, trainees and experienced developers who shift to Python from other platforms.

This is not a complete resource. My target is to make it a work in perpetual progress. Hopefully, over time, this will develop into an exhaustive tutorial.

Intended Audience

This is not about teaching Python - the programming language. This tutorial will not magically transform you into a Python ninja. I am assuming that you already know the basics of Python. If you don't, then stop right now. Go read Zed Shaw's brilliant free book Learn Python The Hard Way first and then come back.

I am assuming you are working on Linux (preferably Ubuntu/Debian) or a Linux-like operating system. Why? Because that is what I know best. I have not done any serious programming related work on MS Windows or Mac OS X, other than testing for cross-browser compatibility. Check out the following tutorials on how to install Python on other platforms:


Google Apps Developer Blog: Optimizing bandwidth usage with gzip compression

I use on the fly gzip compression whenever I can with my R, python ,bash scripts ... little did i know this has crept on to mobile apps as well.


Google Apps Developer Blog: Optimizing bandwidth usage with gzip compression: All developers agree that saving bandwidth is a critical factor for the success of a mobile application. Less data usage means faster respon...

Sunday 15 January 2012

Illumina Introduces the HiSeq 2500 | Business Wire

Illumina also announced the following erformance enhancements to the MiSeq personal equencer:

Threefold Increase in Throughput –capable of generating up to 7 Gb per run, expanding the number of applications and increasing sample throughput.

Longer and More Reads –a new 500-cycle reagent kit supports 2 x 250 bp runs, generating over 15 million clusters per run and enabling mor accurate small-genome assembly and small RNA sequencing projects.

http://www.businesswire.com/news/home/20120110006665/en/Illumina-Introduces-HiSeq-2500

Life Technologies Introduces the Benchtop Ion Proton™ Sequencer; Designed to Decode a Human Genome in One Day for $1,000 | Life Technologies

The Ion Proton™ Sequencer and Ion Reporter analysis software are designed to analyze a single genome in one day on a stand-alone server —eliminating the informatics bottleneck and high-capital, IT investment associated with optical-based sequencers. The optical-based sequencers require costly IT infrastructure to analyze the large volume of data generated by running batches of six or more genomes at once. The approach drastically slows analysis, which can take weeks to complete and creates the bottleneck in the process.

http://www.lifetechnologies.com/us/en/home/about-us/news-gallery/press-releases/2012/life-techologies-itroduces-the-bechtop-io-proto.html

Thursday 12 January 2012

true - do nothing, successfully

While I can totally understand the need for such a program but reading the succinct description of the program made me chuckle

NAME
       true - do nothing, successfully

SYNOPSIS
       true [ignored command line arguments]
       true OPTION

DESCRIPTION
       Exit with a status code indicating success.

       --help display this help and exit

       --version
              output version information and exit

       NOTE:  your  shell may have its own version of true, which usually supersedes the version described here.  Please refer to your shell's documenta‐
       tion for details about the options it supports.

AUTHOR
       Written by Jim Meyering.

Cortex assembler paper


"De novo assembly and genotyping of variants using colored de Bruijn graphs", Iqbal, Caccamo, Turner, Flicek, McVean
Nature Genetics, (doi:10.1038/ng.1028)

This link will work for a bit
http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.1028.html

You may be interested in some of the following things we cover

 - low and predictable memory use
 - simultaneous assembly of multiple samples, and variant calling done directly (without assembling a consensus first)
  (eg you could assemble over 2000 S. aureus in 32Gb of RAM or 10 humans in 256Gb of RAM).
 - a mathematical model extending the Lander-Waterman statistics to include information on repeat content,
  allowing you to make choices of kmer-size depending on what you want to achieve
 - validation using fully sequenced fosmids
 - comparison of Cortex variant calls with 1000genomes pilot calls
 - showing you can make good variant calls without using a reference if you sequence multiple samples from a population (we did this with chimps)
 - a proof-of-concept of HLA-typing at HLA-B using whole genome (not pull-down) data

Wednesday 11 January 2012

Scientists, Share Secrets or Lose Funding: Stodden and Arbesman - Bloomberg

This article expresses my views in a much more eloquent way

The Journal of Irreproducible Results, a science-humor magazine, is, sadly, no longer the only publication that can lay claim to its title. More and more published scientific studies are difficult or impossible to repeat.
It’s not that the experiments themselves are so flawed they can’t be redone to the same effect -- though this happens more than scientists would like. It’s that the data upon which the work is based, as well as the methods employed, are too often not published, leaving the science hidden.

Too Little Transparency

Consider, for example, a recent notorious incident in biomedical science. In 2006, researchers at Duke University seemed to have discovered relationships between lung cancer patients’ personal genetic signatures and their responsiveness to certain drugs. The scientists published their results in respected journals (the New England Journal of Medicine and Nature Medicine), but only part of the genetic signature data used in the studies was publicly available, and the computer codes used to generate the findings were never revealed. This is unfortunately typical for scientific publications.
The Duke research was considered such a breakthrough that other scientists quickly became interested in replicating it, but because so much information was unavailable, it took three years for them to uncover and publicize a number of very serious errors in the published reports. Eventually, those reports were retracted, and clinical trials based on the flawed results were canceled.
In response to this incident, the Institute of Medicine convened a committee to review what data should appropriately be revealed from genomics research that leads to clinical trials. This committee is due to release its report early this year.
Unfortunately, the research community rarely addresses the problem of reproducibility so directly. Inadequate sharing is common to all scientific domains that use computers in their research today (most of science), and it hampers transparency.
By making the underlying data and computer code conveniently available, scientists could open a new era of innovation and growth. In October, the White House released a memorandum titled “Accelerating Technology Transfer and Commercialization of Federal Research in Support of High-Growth Businesses,” which outlines ways for federal funding agencies to improve the rate of technology transfer from government-financed laboratories to the private business sector.


As Jon Claerbout, a professor emeritus of geophysics at Stanford University, has noted, scientific publication isn’t scholarship itself, but only the advertising of scholarship. The actual work -- the steps needed to reproduce the scientific finding -- must be shared.  


read the full article at http://www.bloomberg.com/news/2012-01-10/scientists-share-secrets-or-lose-funding-stodden-and-arbesman.html

Tuesday 10 January 2012

Ion Torrent Retrospective – 2011 « Edge Bio – Views From the Edge

http://www.edgebio.com/blog/?p=842

Cost

6 months ago a run at EdgeBio of a 314 chip cost $2500 for ~25Mb of sequence.  That a buck for every 10,000 bases.  Now we charge $2350 for a 316 chip (the 314 is discounted to $1550) and generate on average 250Mb.  That's a buck every 106,000 bases.  So, for the same price, we have doubled the assembly metrics in de Novo assemblies above. All done in less than 7-10 business days. 

Now/Future

We have recently been validating the long read chemistry further, have done our first 2 Ampliseq Cancer panel runs, are gearing up to validate the mate pair protocol, and are piloting the 318 Chips.  Look for a few blog posts over the coming weeks about the mate pair data, our custom SnpEff plug-in,  and our progress with capture and 318 chips (maybe Exomes you say???)

Kopimism: the world's newest religion explained - opinion - 06 January 2012 - New Scientist; Open access and Ecological Society of America

http://www.newscientist.com/article/dn21334-kopimism-the-worlds-newest-religion-explained.html

:D

Why is information, and sharing it, so important to you?
Information is the building block of everything around me and everything I believe in. Copying it is a way of multiplying the value of information.

What's your stance on illegal file-sharing?
I think that the copyright laws are very problematic, and at least need to be rewritten, but I would suggest getting rid of most of them.

So all file-sharing should be legal?
Absolutely.

Are you just trying to make a point, or is this religion for real?
We've had this faith for several years.



I would love to hear their stance on open access in science ..

YHGTBFKM: Ecological Society of America letter regarding #OpenAccess is disturbing

Granted that when I go into the exciting bits of human genomics research that I do at gatherings, my well heeled friends often give me a glass eyed look and acknowledge that what I do is interesting but show no particular interest in what I do ..
I still strongly feel that scientific information is best served as a free for all buffet.
If a majority of key scientific information becomes closed access. I think that there is a risk that science as practiced globally might be end up looking like cliched science projects at high school science fairs.
I personally feel that what open access journals and work from journal of negative results, is working towards is a reduction of duplication of efforts at adding to the sum of human knowledge.
Reading about the plausible reasons for why politicians might be backing a bill to shut down NIH's open access policy is saddening to say the least. quite honestly the sum for the payment for publication is negligible to the sum paid for publicly funded research. There isn't a strong reason that I can understand to make publicly funded research privy info unless it's defence issues which would mean it shouldn't be published in a scientific journal at all.

For more on this see


Monday 9 January 2012

8.3. collections — High-performance container datatypes — Python v2.7.2 documentation

http://docs.python.org/library/collections.html

So many goodies .. but unavail on my servers which run Python 2.4.3 stock :( 

was particularly interested in using 
defaultdict dict subclass that calls a factory function to supply missing values

New in version 2.5.

TopHat 1.4.0


TopHat 1.4.0 release 1/5/2012

Version 1.4.0 includes the following new features and fixes:

when a set of known transcripts is provided (-G/--GTF option) Tophat now takes the approach of mapping the reads on the transcriptome first, with only the unmapped reads being further aligned to the whole genome and going through the novel junction discovery process like before. This new approach was implemented by Harold Pimentel.
new command line options have been added for the new mapping-to-transcriptome approach; please check their documentation which includes important notes about the new --transcriptome-index option for efficient use of this approach
the unmapped reads are now reported in the output directory as unmapped_left.fq.z (and unmapped_right.fq.z for paired reads)
the --initial-read-mismatches value now also applies to final alignments resulted from joining segment mappings
we adjusted the selection of hits to be reported in case of multi-mapped segments, reads and read pairs
enhancements in junction discovery for the segment-search method in the case of paired-end reads
the reported running time now includes days
fixed the non-deterministic behavior that could cause some differences in the output of repeated Tophat runs
fixed a regression bug that prevented the use of SOLiD reads with certain length of quality values

Sunday 8 January 2012

PLoS ONE: Identification of Sequence Variants in Genetic Disease-Causing Genes Using Targeted Next-Generation Sequencing

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0029500

Background

Identification of gene variants plays an important role in research on and diagnosis of genetic diseases. A combination of enrichment of targeted genes and next-generation sequencing (targeted DNA-HiSeq) results in both high efficiency and low cost for targeted sequencing of genes of interest.

Methodology/Principal Findings

To identify mutations associated with genetic diseases, we designed an array-based gene chip to capture all of the exons of 193 genes involved in 103 genetic diseases. To evaluate this technology, we selected 7 samples from seven patients with six different genetic diseases resulting from six disease-causing genes and 100 samples from normal human adults as controls. The data obtained showed that on average, 99.14% of 3,382 exons with more than 30-fold coverage were successfully detected using Targeted DNA-HiSeq technology, and we found six known variants in four disease-causing genes and two novel mutations in two other disease-causing genes (the STS gene for XLI and the FBN1 gene for MFS) as well as one exon deletion mutation in the DMD gene. These results were confirmed in their entirety using either the Sanger sequencing method or real-time PCR.

Conclusions/Significance

Targeted DNA-HiSeq combines next-generation sequencing with the capture of sequences from a relevant subset of high-interest genes. This method was tested by capturing sequences from a DNA library through hybridization to oligonucleotide probes specific for genetic disorder-related genes and was found to show high selectivity, improve the detection of mutations, enabling the discovery of novel variants, and provide additional indel data. Thus, targeted DNA-HiSeq can be used to analyze the gene variant profiles of monogenic diseases with high sensitivity, fidelity, throughput and speed.


PLoS ONE: A Viral Discovery Methodology for Clinical Biopsy Samples Utilising Massively Parallel Next Generation Sequencing

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028879


Abstract

Here we describe a virus discovery protocol for a range of different virus genera, that can be applied to biopsy-sized tissue samples. Our viral enrichment procedure, validated using canine and human liver samples, significantly improves viral read copy number and increases the length of viral contigs that can be generated by de novo assembly. This in turn enables the Illumina next generation sequencing (NGS) platform to be used as an effective tool for viral discovery from tissue samples.


Thursday 5 January 2012

how to split BED file according to chromsome - SEQanswers

http://seqanswers.com/forums/showthread.php?t=8115

Gosh i was thinking hey it's easy to code this in python or perl and after 30 mins of messing around  it hit me that csplit might be an easier way to do this .. but with googling it turns out that awk is probably the best solution .. [ Credit quinlana ] 

  awk '{print $0 >> $1".bed"}' example.bed

less typing more fun

I keep forgetting this and other handy shortcuts ...

 zcat  is  identical  to  gunzip  -c

but anyway I am surprised at the time difference for the below

on the fly uncompress, sed, compress
----------------------------------------------------------------------- 12:27:21
$ time gunzip -c  bam.cov.csv.gz.chr20.gz |sed 's/ //g' |gzip -c > bam.cov.csv.gz.chr20

real 1m39.638s
user 2m41.098s
sys 0m6.524s

on the fly uncompress, sed
----------------------------------------------------------------------- 13:15:34
$ time gunzip -c  bam.cov.csv.gz.chr20.gz |sed 's/ //g' > bam.cov.csv.gz.chr20


real 1m39.865s
user 1m45.999s
sys 0m5.755s

FAQ Biological replicates with cuffdiff, cummeRbund - SEQanswers

Read this discussion at seqanswers for dealing with biological replicates with cuffdiff
http://seqanswers.com/forums/showthread.php?t=16528

Sunday 1 January 2012

ChIP seq exercise Tutorial @ Galaxy

http://main.g2.bx.psu.edu/u/james/p/exercise-chip-seq

For this exercise we will use a ChIP-seq dataset for CTCF in the murine G1E_ER4 cell line. This dataset has been reduced to (mostly) contain only reads aligning to chr19:

PoPoolation2: Identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq) Open Access

http://bioinformatics.oxfordjournals.org/content/early/2011/10/23/bioinformatics.btr589.short?rss=1

Summary: Sequencing pooled DNA samples (Pool-Seq) is the most cost-effective approach for the genome-wide comparison of population samples. Here, we introduce PoPoolation2, the first software tool specifically designed for the comparison of populations with Pool-Seq data. PoPoolation2 implements a range of commonly used measures of differentiation (FST, Fisher's exact test and Cochran-Mantel-Haenszel test) that can be applied on different scales (windows, genes, exons, SNPs). The result may be visualized with the widely used Integrated Genomics Viewer.

Availability and implementation: PoPoolation2 is implemented in Perl and R. It is freely available on http://code.google.com/p/popoolation2/


The SEQanswers wiki: a wiki database of tools for high-throughput sequencing analysis Open Access

http://nar.oxfordjournals.org/content/early/2011/11/15/nar.gkr1058.full

Abstract

Recent advances in sequencing technology have created unprecedented opportunities for biological research. However, the increasing throughput of these technologies has created many challenges for data management and analysis. As the demand for sophisticated analyses increases, the development time of software and algorithms is outpacing the speed of traditional publication. As technologies continue to be developed, methods change rapidly, making publications less relevant for users. The SEQanswers wiki (SEQwiki) is a wiki database that is actively edited and updated by the members of the SEQanswers community (http://SEQanswers.com/). The wiki provides an extensive catalogue of tools, technologies and tutorials for high-throughput sequencing (HTS), including information about HTS service providers. It has been implemented in MediaWiki with the Semantic MediaWiki and Semantic Forms extensions to collect structured data, providing powerful navigation and reporting features. Within 2 years, the community has created pages for over 500 tools, with approximately 400 literature references and 600 web links. This collaborative effort has made SEQwiki the most comprehensive database of HTS tools anywhere on the web. The wiki includes task-focused mini-reviews of commonly used tools, and a growing collection of more than 100 HTS service providers. SEQwiki is available at: http://wiki.SEQanswers.com/.


Datanami, Woe be me