RT @emblebi: A new @sangerinstitute project is exploring how researchers share results from genome studies. Take the questionnaire: http://bit.ly/xw8Y0k
Tuesday, 31 January 2012
[Samtools-help] Picard release 1.61
Picard release 1.61
30 January 2012
- PicardException used to extend SAMException, which extends RuntimeException. Now PicardException extends RuntimeException directly. If you have code that catches SAMException, you may want to add a catch clause for PicardException if you use the net.sf.picard classes. If you only use classes in net.sf.samtools, you should never see PicardException thrown.
- IlluminaDataProviderFactory.java: Ensure position data type is in set of data types in ctors rather than in makeDataProvider(), in order to avoid ConcurrentModificationException.
-Alec
------------------------------------------------------------------------------
_______________________________________________
Samtools-help mailing list
https://lists.sourceforge.net/lists/listinfo/samtools-help
Saturday, 28 January 2012
galaxy-user] January 27, 2012 Galaxy Distribution & News Brief
---------- Forwarded message ----------
From: Jennifer Jackson
Date: Saturday, 28 January 2012
Subject: [galaxy-user] January 27, 2012 Galaxy Distribution & News Brief
January 27, 2012 Galaxy Distribution & News Brief
Complete News Brief
* http://wiki.g2.bx.psu.edu/DevNewsBriefs/2012_01_27
Highlights:
* Important metadata and Python 2.5 support corrections
* SAMtools upgraded for version 0.1.18. Mpileup added.
* Dynamic filtering, easy color options, and quicker
indexing enhance Trackster
* Set up your Galaxy instance to run cluster jobs as
the real user, not the Galaxy owner
* Improvements to metadata handling and searching in
the Tool Shed
* Improved solutions for schema access, jobs management,
& workflow imports and inputs.
* New datatypes (Eland, XML), multiple tool enhancements,
and bug fixes.
Get Galaxy!
* http://getgalaxy.org
new: % hg clone http://www.bx.psu.edu/hg/galaxy galaxy-dist
upgrade: % hg pull -u -r 26920e20157f
Read the release announcement and see the prior release history
* http://wiki.g2.bx.psu.edu/DevNewsBriefs/
Need help with a local instance?
Search with our custom google tools!
* http://wiki.g2.bx.psu.edu/Mailing%20Lists#Searching
And consider subscribing to the galaxy-dev mailing list!
* http://wiki.g2.bx.psu.edu/Mailing%20Lists#Subscribing_and_Unsubscribing
--
Jennifer Jackson
Galaxy Team
http://usegalaxy.org
http://galaxyproject.org
http://galaxyproject.org/wiki/Support
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
Changes to Google Privacy Policy and Terms of Service
---------- Forwarded message ----------
From: Google
Date: Saturday, 28 January 2012
Subject: Changes to Google Privacy Policy and Terms of Service
Is this email not displaying properly?
View it in your browser.
Dear Google user,
We're getting rid of over 60 different privacy policies across Google and replacing them with one that's a lot shorter and easier to read. Our new policy covers multiple products and features, reflecting our desire to create one beautifully simple and intuitive experience across Google.
We believe that this stuff matters so please take a few minutes to read our updated Privacy Policy and Terms of Service at http://www.google.com/policies. These changes will take effect on 1 March, 2012.
One policy, one Google experience
________________________________
Easy to work across Google
Our new policy reflects a single product experience that does what you need, when you want it to. Whether you're reading an email that reminds you to schedule a family get-together or finding a favourite video that you want to share, we want to ensure that you can move across Gmail, Calendar, Search, YouTube or whatever your life calls for, with ease.
Tailored for you
If you're signed in to Google, we can do things like suggest search queries – or tailor your search results – based on the interests that you've expressed in Google+, Gmail and YouTube. We'll better understand which version of Pink or Jaguar you're searching for and get you those results faster.
Easy to share and collaborate
When you post or create a document online, you often want others to see and contribute. By remembering the contact information of the people you want to share with, we make it easy for you to share in any Google product or service with minimal clicks and errors.
________________________________
Protecting your privacy hasn't changed
Our goal is to provide you with as much transparency and choice as possible through products like Google Dashboard and Ad Preferences Manager, alongside other tools. Our privacy principles remain unchanged. And we'll never sell your personal information or share it without your permission (other than rare circumstances like valid legal requests).
Have questions?
We have answers.
Visit our FAQ at http://www.google.com/policies/faq to read more about the changes. (We reckoned our users might have a question or twenty-two.)
________________________________
Notice of Change
1 March, 2012 is when the new Privacy Policy and Terms will come into effect. If you choose to keep using Google once the change occurs, you will be doing so under the new Privacy Policy and Terms of Service.
Please do not reply to this email. Mail sent to this address cannot be answered. Also, never enter your Google Account password after following a link in an email or chat to an untrusted site. Instead, go directly to the site, such as mail.google.com or www.google.com/accounts. Google will never email you to ask for your password or other sensitive information.
Friday, 27 January 2012
Manuel's Personal Exome Now Publicly Released | Manuel Corpas' Blog
After 5 months of having performed the sequencing of my personal exome, I now make it available to the community for public use. I release it under a CC BY-SA 3.0 license, giving you permission to use this data in any way, as long as it provides attribution to the source and it is shared under a similar license.
http://manuelcorpas.com/2012/01/23/my-personal-exome-now-publicly-released/
Thursday, 26 January 2012
Roche in $5.7 billion bid for Illumina
Wow ... Things are going to be interesting now ....
ZURICH/LONDON (Reuters) -Swiss drugmaker Roche Holding AG (ROG.VX) has offered $5.7 billion in cash in a hostile bid to take over Illumina Inc (ILMN.O), and investors are already betting that the U.S. gene sequencing company will command a significantly higher price.
http://mobile.reuters.com/article/innovationNews/idUSTRE80O0FR20120125?irpc=932
Cheers
Kevin
Sent from an Android
Tuesday, 24 January 2012
Why we need the Assemblathon - The Assemblathon
Why we need the Assemblathon - The Assemblathon
excerpted
Sunday, 22 January 2012
Free Science, One Paper at a Time | Wired Science | Wired.com
Part of the answer, strangely, is the very thing at the center of science: the paper. Once science's main conduit, the paper has become its choke point.
It's not just that the paper is slow, though that is a huge problem. A researcher who submits a paper to a traditional journal right now, for instance, won't see the published piece for about a year. She must wait while the paper gets passed around among editors, then goes through rounds of peer review by experts in her field, who might and often do object not just to her methods or data but to her findings and interpretations. Finally, she must wait while it moves through an editing, layout, and publishing pipeline that itself might run anywhere from 2 to 12 weeks.
Yet the paper is not simply slow; it's heavy. Even as increasingly data-rich science has outgrown the paper's ability to deliver and describe all that science has to offer — its deep databases, its often elaborate methods — we've loaded it up needlessly with reputational weight and vital functions other than carrying data.
The paper is meant to be a conduit for the real content and currency of the science: the ideas, methods, data, and findings of the people who do science. But the tremendous publishing and commercial infrastructure built around the academic paper over the last half-century has concentrated so many functions and so much value in the journal that the paper itself, rather than the information in it, has become science's main currency. It is the paper you must buy; the paper you must publish; the paper you must cite; the paper on which not just citations but tenure, reputation, status, and even school rankings are built
Saturday, 21 January 2012
SGA uses less memory for de novo assembly
Read the full article here http://www.genomeweb.com/informatics/sanger-teams-de-novo-assembler-adopts-compressed-approach-reduce-memory-footprin
1427 CPU hours is ALOT more than 479 CPU hours (~ 3x) but of course when you can parallelize it, it's definitely a worthwhile tradeoff especially when it's more likely that one has a lot of lower memory clusters then one single cluster with a lot of memory and is likely to be hogged by a 479 CPU hour job. I wonder if this might encourage investigators to relook at existing data by de novo assembly. Of course it would be great if the NGS data is made public, then I guess other groups can actually do the de novo assembly comparison for them.
Friday, 20 January 2012
What's the difference between an accession number (AC) and the entry name (ID)?
Read it online: http://twitter.com/emblebi/status/154584407729119233
Cancer Commons is a non-profit open science initiative dedicated to improving outcomes for today's cancer patients.
Thursday, 19 January 2012
Nosql on SSD ! Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications - All Things Distributed
http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
wow nosql dbs on SSDs .... Can only imagine how fast that might be ....
Maybe I should sell off my cluster lol ...
Bionimbus Cloud - Complete Genomics Chooses the Bionimbus as Mirror Site for CGI 60 Genomes Release
Complete Genomics Chooses the Bionimbus as Mirror Site for CGI 60 Genomes Release
Complete Genomics Inc. has chosen the Bionimbus Community Cloud as a mirror site for their 60 Genomes dataset.
The 60 Genomes dataset can be found here, as part of the public data that Bionimbus makes available to researchers. With the Bionimbus Community Cloud, the data is available via both the commodity Internet, as well as via high performance research networks, such as the National LambdaRail and Internet2.
The genomes in the dataset have on average more than 55x mapped read coverage, and the sequencing of these 60 genomes generated more than 12.2 terabases (Tb) of total mapped reads. This dataset will complement other publicly available whole genome data sets, such as the 1000 Genomes Project's recent publication of six high-coverage and 179 low-coverage human genomes. Forty of the sixty genomes are available now and the remainder will be available at the end of March.
The 60 genomes included in this dataset were drawn from two resources housed at the Coriell Institute for Medical Research: the National Institute of General Medical Sciences (NIGMS) Human Genetic Repository and the NHGRI Sample Repository for Human Genetic Research. Included in the release is a 17-member, three-generation CEPH pedigree from the NIGMS Repository and ethnically diverse samples from the NHGRI Repository that represent nine different populations. The samples selected are unrelated, with the exception of the three-generation CEPH pedigree, a Yoruba trio and a Puerto Rican trio. The majority of these samples have been previously analyzed as part of the International HapMap Project or 1000 Genomes Project.
Bionimbus version 1.7 Released
We have just made a beta release of version 1.7 of Bionimbus. If you would like to host and operate your own Bionimbus cloud then you should consider this release. We expect to release version 1.8 in March/April, which will provide several additional features, including improved project management and the ability to edit an experiment's metadata.
Bionimbus Virtual Machine released on Amazon EC2
A virtual machine image with common peak calling pipelines was made available on Amazon Web Services Elastic Cloud. Upon boot, it fetches pipeline library data, providing everything needed for processing user's data.
Amazon EC2 ID: ami-aead58c7
Startup command: ec2-run-instances -n 1 -t m1.large ami-aead58c7
Upon connecting to your instance, wait for /READY-PIPELINE-DATA file to appear before commencing pipelines. This file signifies that pipeline data libraries installed successfully on your instance.
For more information see the Bionimbus Machine Images section of the Using Bionimbus page.
Bionimbus 1.6.0-0 web server software release
Download URL: bionimbus-1.6.0-0.tar.bz2
Installation Instructions: bionimbus-1.6.0-0-INSTALL.txt
modENCODE Fly Data Added to BSPS
The modENCODE Fly data produced by the White Lab is now available in the Bionimbus Simple Persistent Storage (BSPS) in the directory /glusterfs/fly.
All the data in BSPS is accessible to any virtual machine launched in the Bionimbus Elastic Compute Cloud (BEC2).
The Fly data produced by the White Lab can also be browsed, accessed and downloaded in bulk from Cistrack.
If you would like data added to BSPS, please send an email to support at bionimbus.org.
Bionimbus Workspace
The Bionimbus Workspace (BWS) is a storage space that we have set up for those in the modENCODE fly/worm joint analysis group who would like to exchange data but do not want to use BEC2 and its associated storage. The Bionimbus Workspace (BWS) is accessed via ftp.
Here is a link to a tutorial about how to use BWS.
BWS is synced daily and on demand to the Bionimbus Simple Persistent Storage Space (BSPS), which is one of the storage services that is available to all the Bionimbus virtual machines that are run in the Bionimbus Elastic Compute Cloud (BEC2). In other words, the data that is moved by ftp to the BWS can be analyzed within the BEC2 using any of the Bionimbus supported machine images.
Please note that data in BSPS is not synced back to BWS. On the other hand, any user can manually write data to BWS assuming he or she has write permission to the target directory.
To set up an account, please send email to support at bionimbus.org.
Yahoo! Donates Equipment to Bionimbus
Yahoo! announced today that they will be donating a 2,000 processor core system to the Open Cloud Consortium (OCC) for use by the OCC Open Cloud Testbed and the OCC Open Science Data Cloud.
Two of the donated racks will be used by Bionimbus, which is part of the OCC Open Science Data Cloud.
DASH Associates Shared Haplotypes
Genomewide association has been a powerful tool for detecting common disease variants. However, this approach has been underpowered in identifying variation that is poorly represented on commercial SNP arrays, being too rare or population-specific. Recent multipoint methods including SNP tagging and imputation boost the power of detecting and localizing the true causal variant, leveraging common haplotypes in a densely typed panel of reference samples. However, they are limited by the need to obtain a robust population-specific reference panel with sampling deep enough to observe a rare variant of interest. We set out to overcome these challenges by using long stretches of genomic sharing that are identical by descent (IBD). We use such evident sharing between pairs and small subsets of individuals to recover the underlying shared haplotypes that have been co-inherited by these individuals.
We have created a software tool, DASH (DASH Associates Shared Haplotypes), that builds upon pairwise IBD shared segments to infer clusters of IBD individuals. Briefly, for each locus, DASH constructs a graph with links based on IBD at that locus, and uses an iterative min-cut approach to identify clusters. These are densely connected components, each sharing a haplotype. As DASH slides the local window along the genome, links representing new shared segments are added and old ones expire; these changes cause the resultant connected components to grow and shrink. We code the corresponding haplotypes as genetic markers and use them for association testing.
Everyone needs more memory ....
Wikipedia's blackout ...
Has it affected you?
You can circumvent it by checking out the instructions at
http://lifehacker.com/5876833/how-to-take-wikipedia-offline-so-you-can-keep-using-it-during-tomorrows-anti+sopa-blackout
But when u do, still pay a couple of minutes thinking about the SOPA bill.
http://lifehacker.com/5860205/all-about-sopa-the-bill-thats-going-to-cripple-your-internet
Wednesday, 18 January 2012
VPA: an R tool for analyzing sequencing variants with user-specified frequency pattern
Abstract
ABSTRACT:
BACKGROUND:
The massive amounts of genetic variant generated by the next generation sequencing systems demand the development of effective computational tools for variant prioritization.
FINDINGS:
VPA (Variant Pattern Analyzer) is an R tool for prioritizing variants with specified frequency pattern from multiple study subjects in next-generation sequencing study. The tool starts from individual files of variant and sequence calls and extract variants with user-specified frequency pattern across the study subjects of interest. Several position level quality criteria can be incorporated into the variant extraction. It can be used in studies with matched pair design as well as studies with multiple groups of subjects.
CONCLUSIONS:
VPA can be used as an automatic pipeline to prioritize variants for further functional exploration and hypothesis generation. The package is implemented in the R language and is freely available from http://vpa.r-forge.r-project.org.
BarraCUDA - a fast short read sequence aligner using graphics processing units
Abstract
ABSTRACT:
BACKGROUND:
With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence.
FINDINGS:
Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput.
CONCLUSIONS:
BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net.
FastQ Screen A screening application for high througput sequence data
FastQ Screen A screening application for high througput sequence data
- README
- Release Notes Please read these before using the program.
- FastQ Screen v0.2.1
Tuesday, 17 January 2012
Big data in R: Error: negative length vectors are not allowed
Error: negative length vectors are not allowed
Execution halted
- R Works on RAM
- Maximum length of an object is 2^31-1
VIM on Macs :)
One gripe that I have is that gnome-vim isn't installed by default which is a slight annoyance since I am able to do X windows tunneling to benefit from high bandwidth and GUI convenience but I need to beg the sysad to install. BUT gnome-vim is terribly laggy locally after I upgraded so not sure if I want to trade GUI for speed.
trying to think of a workflow where i can edit locally (preferably in a dropbox folder which syncs and backups up work in progress) and when done, it can upload (and execute)
hmmm need time to google for this .. I am sure it can be done!
The downside to something like Vim, and other highly configurable editors, is that it does require an investment of time to see the real benefits of it. I have several friends that desire to learn Vim but aren't willing to make the investment to switch from something like TextMate. Thankfully there are quite a few resources out there to help you get up to speed quickly.
- PeepCode Screencasts - Their offerings of Smash into Vim and Smash into Vim 2 are great videos to help you get started with Vim. I learned some fundamental things about Vim in these screencasts that I wasn't aware of previously. I also find them real gems to visit again and again. Well worth the money.
- VimCasts - Free short videos highlighting features of Vim. VimCasts is produced by Drew Neil. These are high quality professionally done trainings. A highly recommended addition to your podcast reader.
- Vim Scripts - Part of the Vim site that is devoted to third-party plugins to expand the capabilities of Vim. It's well worth your time to find plugins that make things easier. For instance, I have a plugin that highlights errors in my Python code as I type, such as finding unused imports or making sure my code is PEP8 compliant. I have a plugin that makes commenting painless.
- Justin Lilly's Vim Screencasts - My good friend Justin Lilly has a number of great screencasts on Vim. Additionally, his post titled Vim: My New IDE is an excellent introduction to some of the plugins available on Vim.
One other thing that will get you up to speed on Vim is to start with someone else's Vim configuration. Mine is available on GitHub. I caution you to not adopt a complex Vim configuration until you have the basics down. The main reason being is that some configurations alter basic builtin behavior. For instance in my configuration I disable navigation using the arrow keys. If you're not aware of this it could impact your understanding of what are Vim defaults and what things are modifications.
Python Ecosystem - An Introduction » mirnazim.org
Python Ecosystem - An Introduction
When developers shift from PHP, Ruby or any other platform to Python, the very first road block they face (most often) is a lack of an overall understanding of the Python ecosystem. Developers often yearn for a tutorial or resource that explains how to accomplish most tasks in a more or less standard way.
What follows is an extract from the internal wiki at my workplace, which documents the basics of the Python ecosystem for web application development for our interns, trainees and experienced developers who shift to Python from other platforms.
This is not a complete resource. My target is to make it a work in perpetual progress. Hopefully, over time, this will develop into an exhaustive tutorial.
Intended Audience
This is not about teaching Python - the programming language. This tutorial will not magically transform you into a Python ninja. I am assuming that you already know the basics of Python. If you don't, then stop right now. Go read Zed Shaw's brilliant free book Learn Python The Hard Way first and then come back.
I am assuming you are working on Linux (preferably Ubuntu/Debian) or a Linux-like operating system. Why? Because that is what I know best. I have not done any serious programming related work on MS Windows or Mac OS X, other than testing for cross-browser compatibility. Check out the following tutorials on how to install Python on other platforms:
Google Apps Developer Blog: Optimizing bandwidth usage with gzip compression
I use on the fly gzip compression whenever I can with my R, python ,bash scripts ... little did i know this has crept on to mobile apps as well.
Sunday, 15 January 2012
Illumina Introduces the HiSeq 2500 | Business Wire
Illumina also announced the following erformance enhancements to the MiSeq personal equencer:
Threefold Increase in Throughput –capable of generating up to 7 Gb per run, expanding the number of applications and increasing sample throughput.
Longer and More Reads –a new 500-cycle reagent kit supports 2 x 250 bp runs, generating over 15 million clusters per run and enabling mor accurate small-genome assembly and small RNA sequencing projects.
http://www.businesswire.com/news/home/20120110006665/en/Illumina-Introduces-HiSeq-2500
Life Technologies Introduces the Benchtop Ion Proton™ Sequencer; Designed to Decode a Human Genome in One Day for $1,000 | Life Technologies
The Ion Proton™ Sequencer and Ion Reporter analysis software are designed to analyze a single genome in one day on a stand-alone server —eliminating the informatics bottleneck and high-capital, IT investment associated with optical-based sequencers. The optical-based sequencers require costly IT infrastructure to analyze the large volume of data generated by running batches of six or more genomes at once. The approach drastically slows analysis, which can take weeks to complete and creates the bottleneck in the process.
Thursday, 12 January 2012
true - do nothing, successfully
Cortex assembler paper
"De novo assembly and genotyping of variants using colored de Bruijn graphs", Iqbal, Caccamo, Turner, Flicek, McVean
Nature Genetics, (doi:10.1038/ng.1028)
This link will work for a bit
http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.1028.html
You may be interested in some of the following things we cover
- low and predictable memory use
- simultaneous assembly of multiple samples, and variant calling done directly (without assembling a consensus first)
(eg you could assemble over 2000 S. aureus in 32Gb of RAM or 10 humans in 256Gb of RAM).
- a mathematical model extending the Lander-Waterman statistics to include information on repeat content,
allowing you to make choices of kmer-size depending on what you want to achieve
- validation using fully sequenced fosmids
- comparison of Cortex variant calls with 1000genomes pilot calls
- showing you can make good variant calls without using a reference if you sequence multiple samples from a population (we did this with chimps)
- a proof-of-concept of HLA-typing at HLA-B using whole genome (not pull-down) data
Wednesday, 11 January 2012
Scientists, Share Secrets or Lose Funding: Stodden and Arbesman - Bloomberg
The Journal of Irreproducible Results, a science-humor magazine, is, sadly, no longer the only publication that can lay claim to its title. More and more published scientific studies are difficult or impossible to repeat.
It’s not that the experiments themselves are so flawed they can’t be redone to the same effect -- though this happens more than scientists would like. It’s that the data upon which the work is based, as well as the methods employed, are too often not published, leaving the science hidden.
Too Little Transparency
Consider, for example, a recent notorious incident in biomedical science. In 2006, researchers at Duke University seemed to have discovered relationships between lung cancer patients’ personal genetic signatures and their responsiveness to certain drugs. The scientists published their results in respected journals (the New England Journal of Medicine and Nature Medicine), but only part of the genetic signature data used in the studies was publicly available, and the computer codes used to generate the findings were never revealed. This is unfortunately typical for scientific publications.The Duke research was considered such a breakthrough that other scientists quickly became interested in replicating it, but because so much information was unavailable, it took three years for them to uncover and publicize a number of very serious errors in the published reports. Eventually, those reports were retracted, and clinical trials based on the flawed results were canceled.
In response to this incident, the Institute of Medicine convened a committee to review what data should appropriately be revealed from genomics research that leads to clinical trials. This committee is due to release its report early this year.
Unfortunately, the research community rarely addresses the problem of reproducibility so directly. Inadequate sharing is common to all scientific domains that use computers in their research today (most of science), and it hampers transparency.
By making the underlying data and computer code conveniently available, scientists could open a new era of innovation and growth. In October, the White House released a memorandum titled “Accelerating Technology Transfer and Commercialization of Federal Research in Support of High-Growth Businesses,” which outlines ways for federal funding agencies to improve the rate of technology transfer from government-financed laboratories to the private business sector.
As Jon Claerbout, a professor emeritus of geophysics at Stanford University, has noted, scientific publication isn’t scholarship itself, but only the advertising of scholarship. The actual work -- the steps needed to reproduce the scientific finding -- must be shared.
read the full article at http://www.bloomberg.com/news/2012-01-10/scientists-share-secrets-or-lose-funding-stodden-and-arbesman.html
Tuesday, 10 January 2012
Ion Torrent Retrospective – 2011 « Edge Bio – Views From the Edge
Cost
6 months ago a run at EdgeBio of a 314 chip cost $2500 for ~25Mb of sequence. That a buck for every 10,000 bases. Now we charge $2350 for a 316 chip (the 314 is discounted to $1550) and generate on average 250Mb. That's a buck every 106,000 bases. So, for the same price, we have doubled the assembly metrics in de Novo assemblies above. All done in less than 7-10 business days.
Now/Future
We have recently been validating the long read chemistry further, have done our first 2 Ampliseq Cancer panel runs, are gearing up to validate the mate pair protocol, and are piloting the 318 Chips. Look for a few blog posts over the coming weeks about the mate pair data, our custom SnpEff plug-in, and our progress with capture and 318 chips (maybe Exomes you say???)
Kopimism: the world's newest religion explained - opinion - 06 January 2012 - New Scientist; Open access and Ecological Society of America
:D
Why is information, and sharing it, so important to you?
Information is the building block of everything around me and everything I believe in. Copying it is a way of multiplying the value of information.
What's your stance on illegal file-sharing?
I think that the copyright laws are very problematic, and at least need to be rewritten, but I would suggest getting rid of most of them.
So all file-sharing should be legal?
Absolutely.
Are you just trying to make a point, or is this religion for real?
We've had this faith for several years.
I would love to hear their stance on open access in science ..
YHGTBFKM: Ecological Society of America letter regarding #OpenAccess is disturbing
Granted that when I go into the exciting bits of human genomics research that I do at gatherings, my well heeled friends often give me a glass eyed look and acknowledge that what I do is interesting but show no particular interest in what I do ..I still strongly feel that scientific information is best served as a free for all buffet.
If a majority of key scientific information becomes closed access. I think that there is a risk that science as practiced globally might be end up looking like cliched science projects at high school science fairs.
I personally feel that what open access journals and work from journal of negative results, is working towards is a reduction of duplication of efforts at adding to the sum of human knowledge.
Reading about the plausible reasons for why politicians might be backing a bill to shut down NIH's open access policy is saddening to say the least. quite honestly the sum for the payment for publication is negligible to the sum paid for publicly funded research. There isn't a strong reason that I can understand to make publicly funded research privy info unless it's defence issues which would mean it shouldn't be published in a scientific journal at all.
For more on this see
- Elsevier-funded NY Congresswoman Carolyn Maloney Wants to Deny Americans Access to Taxpayer Funded Research from my brother Michael Eisen http://www.michaeleisen.org/blog/?p=807
- Why Is Open-Internet Champion Darrell Issa Supporting an Attack on Open Science? from Rebecca Rosen at the Atlantic http://www.theatlantic.com/technology/archive/2012/01/why-is-open-internet-champion-darrell-issa-supporting-an-attack-on-open-science/250929/
- New bill to block open access to publicly-funded research from Peter Suber https://plus.google.com/u/0/109377556796183035206/posts/QYAH1jSJG6L
- Scholarly Societies: It's time to abandon the AAP over The Research Works Act from John Dupais http://scienceblogs.com/confessions/2012/01/scholarly_societies_its_time_t.php
Monday, 9 January 2012
8.3. collections — High-performance container datatypes — Python v2.7.2 documentation
defaultdict | dict subclass that calls a factory function to supply missing values | New in version 2.5. |
TopHat 1.4.0
TopHat 1.4.0 release 1/5/2012
Version 1.4.0 includes the following new features and fixes:
when a set of known transcripts is provided (-G/--GTF option) Tophat now takes the approach of mapping the reads on the transcriptome first, with only the unmapped reads being further aligned to the whole genome and going through the novel junction discovery process like before. This new approach was implemented by Harold Pimentel.
new command line options have been added for the new mapping-to-transcriptome approach; please check their documentation which includes important notes about the new --transcriptome-index option for efficient use of this approach
the unmapped reads are now reported in the output directory as unmapped_left.fq.z (and unmapped_right.fq.z for paired reads)
the --initial-read-mismatches value now also applies to final alignments resulted from joining segment mappings
we adjusted the selection of hits to be reported in case of multi-mapped segments, reads and read pairs
enhancements in junction discovery for the segment-search method in the case of paired-end reads
the reported running time now includes days
fixed the non-deterministic behavior that could cause some differences in the output of repeated Tophat runs
fixed a regression bug that prevented the use of SOLiD reads with certain length of quality values
Sunday, 8 January 2012
PLoS ONE: Identification of Sequence Variants in Genetic Disease-Causing Genes Using Targeted Next-Generation Sequencing
Background
Identification of gene variants plays an important role in research on and diagnosis of genetic diseases. A combination of enrichment of targeted genes and next-generation sequencing (targeted DNA-HiSeq) results in both high efficiency and low cost for targeted sequencing of genes of interest.
Methodology/Principal Findings
To identify mutations associated with genetic diseases, we designed an array-based gene chip to capture all of the exons of 193 genes involved in 103 genetic diseases. To evaluate this technology, we selected 7 samples from seven patients with six different genetic diseases resulting from six disease-causing genes and 100 samples from normal human adults as controls. The data obtained showed that on average, 99.14% of 3,382 exons with more than 30-fold coverage were successfully detected using Targeted DNA-HiSeq technology, and we found six known variants in four disease-causing genes and two novel mutations in two other disease-causing genes (the STS gene for XLI and the FBN1 gene for MFS) as well as one exon deletion mutation in the DMD gene. These results were confirmed in their entirety using either the Sanger sequencing method or real-time PCR.
Conclusions/Significance
Targeted DNA-HiSeq combines next-generation sequencing with the capture of sequences from a relevant subset of high-interest genes. This method was tested by capturing sequences from a DNA library through hybridization to oligonucleotide probes specific for genetic disorder-related genes and was found to show high selectivity, improve the detection of mutations, enabling the discovery of novel variants, and provide additional indel data. Thus, targeted DNA-HiSeq can be used to analyze the gene variant profiles of monogenic diseases with high sensitivity, fidelity, throughput and speed.
PLoS ONE: A Viral Discovery Methodology for Clinical Biopsy Samples Utilising Massively Parallel Next Generation Sequencing
Abstract
Here we describe a virus discovery protocol for a range of different virus genera, that can be applied to biopsy-sized tissue samples. Our viral enrichment procedure, validated using canine and human liver samples, significantly improves viral read copy number and increases the length of viral contigs that can be generated by de novo assembly. This in turn enables the Illumina next generation sequencing (NGS) platform to be used as an effective tool for viral discovery from tissue samples.
Thursday, 5 January 2012
how to split BED file according to chromsome - SEQanswers
awk '{print $0 >> $1".bed"}' example.bed
less typing more fun
zcat is identical to gunzip -c
but anyway I am surprised at the time difference for the below
on the fly uncompress, sed, compress
----------------------------------------------------------------------- 12:27:21
$ time gunzip -c bam.cov.csv.gz.chr20.gz |sed 's/ //g' |gzip -c > bam.cov.csv.gz.chr20
real 1m39.638s
user 2m41.098s
sys 0m6.524s
on the fly uncompress, sed
----------------------------------------------------------------------- 13:15:34
$ time gunzip -c bam.cov.csv.gz.chr20.gz |sed 's/ //g' > bam.cov.csv.gz.chr20
real 1m39.865s
user 1m45.999s
sys 0m5.755s
FAQ Biological replicates with cuffdiff, cummeRbund - SEQanswers
http://seqanswers.com/forums/showthread.php?t=16528
Sunday, 1 January 2012
ChIP seq exercise Tutorial @ Galaxy
For this exercise we will use a ChIP-seq dataset for CTCF in the murine G1E_ER4 cell line. This dataset has been reduced to (mostly) contain only reads aligning to chr19:
PoPoolation2: Identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq) Open Access
Summary: Sequencing pooled DNA samples (Pool-Seq) is the most cost-effective approach for the genome-wide comparison of population samples. Here, we introduce PoPoolation2, the first software tool specifically designed for the comparison of populations with Pool-Seq data. PoPoolation2 implements a range of commonly used measures of differentiation (FST, Fisher's exact test and Cochran-Mantel-Haenszel test) that can be applied on different scales (windows, genes, exons, SNPs). The result may be visualized with the widely used Integrated Genomics Viewer.
Availability and implementation: PoPoolation2 is implemented in Perl and R. It is freely available on http://code.google.com/p/popoolation2/
The SEQanswers wiki: a wiki database of tools for high-throughput sequencing analysis Open Access
Abstract
Recent advances in sequencing technology have created unprecedented opportunities for biological research. However, the increasing throughput of these technologies has created many challenges for data management and analysis. As the demand for sophisticated analyses increases, the development time of software and algorithms is outpacing the speed of traditional publication. As technologies continue to be developed, methods change rapidly, making publications less relevant for users. The SEQanswers wiki (SEQwiki) is a wiki database that is actively edited and updated by the members of the SEQanswers community (http://SEQanswers.com/). The wiki provides an extensive catalogue of tools, technologies and tutorials for high-throughput sequencing (HTS), including information about HTS service providers. It has been implemented in MediaWiki with the Semantic MediaWiki and Semantic Forms extensions to collect structured data, providing powerful navigation and reporting features. Within 2 years, the community has created pages for over 500 tools, with approximately 400 literature references and 600 web links. This collaborative effort has made SEQwiki the most comprehensive database of HTS tools anywhere on the web. The wiki includes task-focused mini-reviews of commonly used tools, and a growing collection of more than 100 HTS service providers. SEQwiki is available at: http://wiki.SEQanswers.com/.