Tuesday 31 July 2012

Picard release 1.74


Picard release 1.74
http://picard.sourceforge.net/
30 July 2012
- Added a new "ProgressLogger" class that facilitates more useful and
standard progress logging for any program that iterates through a stream
of SAMRecords.  Adapted most command line programs to use it.
- Add support for targetedPcrMetrics and collected common HsMetrics and
TargetedPcrMetrics behavior into TargetMetricsCollector
- New program CollectTargetedPcrMetrics
- MultiHitAlignedReadIterator.java: Handle case where an alignment
record has no cigar elements that consume both the read and the
reference (e.g. the read is all soft-clipped)

Monday 30 July 2012

bioawk- AWK for gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.


Alerted to this on biostars.org

https://github.com/ialbert/bioawk/blob/master/README.bio.rst

About bioawk

Bioawk is an extension to Brian Kernighan's awk that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.
Bioawk adds a new -c fmt option that specifies the input format. The behavior of bioawk will vary depending on the value of fmt.
For the formats that awk recognizes specially named variables will be created. For example for the supported sequence formats the$name$seq and, if applicable $qual variable names may be used to access the name, sequence and quality string of the sequence record in each iteration. Here is an example of iterating over a fastq file to print the sequences:
    awk -c fastq '{ print $seq }' test.fq  
For known interval formats the columns can be accessed via the variables called $start$end$chrom (etc). For example to print the feature lenght of a file in BED format one could write:
    awk -c bed '{ print $end - $start }' test.bed  
One important change (and innovation) over the original awk is that bioawk will treat sequences that may span multiple lines as a single record. The parsing, implemented in C, may be several orders of magnitude faster than similar code programmed in interpreted languages: Perl, Python, Ruby.
When the format mode is header or hdr, bioawk parses named columns. It automatically adds variables whose names are taken from the first line and values from the column index. Special characters are converted to a underscore.
Bioawk also adds a few built-in functions including, as of now, and(), or(), xor(), and others (see comprehensive list below).
Detailed help is maintained in the bioawk manual page, to access it type:
    man ./awk.1  

Usage Examples

  1. Extract unmapped reads without header:
        awk -c sam 'and($flag,4)' aln.sam.gz  
  2. Extract mapped reads with header:
        awk -c sam -H '!and($flag,4)'

Saturday 28 July 2012

Speed up Chromium downloading in Ubuntu with PyAxelWS Accelerator

PyAxelWS Download Accelerator is a Chromium browser plugin for Linux which is a clone of pyaxel that accelerated downloads, persistent reconnection, resumable downloads, download speed limiting, and download progress indication. To make the accelerator work need a Python script running as server and a Chromium extension running as client.
http://ubuntuguide.net/speed-up-chromium-downloading-in-ubuntu-with-pyaxelws-accelerator

checkout the chrome client

    The original pyaxel is a CLI-based download accelerator in Python that works seamlessly in networks that are behind proxy servers and for protocols like http and ftp. Features present in pyaxel 0.1 are accelerated downloads, persistent reconnection, resumable downloads, download speed limiting, and download progress indication.        Pyaxelws is a clone of pyaxel that introduces new features such as HTML5 Websocket server implementation, a Javascript library that provides an interface to the server, as well as a client application designed for the Chrome web browser.

Reduced Bams, neatest thing out of GATK 2.0

http://www.broadinstitute.org/gatk/guide/workflows

Reducing BAMs to minimize file sizes and improve calling performance

ReduceReads is a novel (perhaps even breakthrough?) GATK2 data compression algorithm. The purpose of ReducedReads is to take a BAM file with NGS data and reduce it down to just the information necessary to make accurate SNP and indel calls, as well as genotype reference sites (hard to achieve) using GATK tools like UnifiedGenotyper or HaplotypeCaller. ReduceReads accepts as an input a BAM file and produces a valid BAM file (it works in IGV!) but with a few extra tags that the GATK can use to make accurate calls.
You can find more information about reduced reads in some of our presentations in the archive.
ReduceReads works well for exomes or high-coverage (at least 20x average coverage) whole genome BAM files. In this case we highly recommend using ReduceReads to minimize the file sizes. Note that ReduceReads performs a lossy compression of the sequencing data that works well with the downstream GATK tools, but may not be supported by external tools. Also, we recommend that you archive your original BAM file, or at least a copy of your original FASTQs, as ReduceReads is highly lossy and doesn't quality as an archive data compression format.
Using ReduceReads on your BAM files will cut down the sizes to approximately 1/100 of their original sizes, allowing the GATK to process tens of thousands of samples simultaneously without excessive IO and processing burdens. Even for single samples ReduceReads cuts the memory requirements, IO burden, and CPU costs of downstream tools significantly (10x or more) and so we recommend you preprocess analysis-ready BAM files with ReducedReads.
for each sample
    sample.reduced.bam <- ReduceReads(sample.bam)

Thursday 26 July 2012

Interactive Plotting with Manipulate / Advanced Topics / Knowledge Base - RStudio Support

Oh gosh I can't believe how much time I wasted hardcoding values and tweak by editing the numbers and rerunning the scripts. 


RStudio includes a manipulate package that enables the addition of interactive capabilities to standard R plots. This is accomplished by binding plot inputs to custom controls rather than static hard-coded values.
http://support.rstudio.org/help/kb/advanced/interactive-plotting-with-manipulate

Basic Usage

The manipulate function accepts a plotting expression and a set of controls (e.g. slider, picker, or checkbox) which are used to dynamically change values within the expression. When a value is changed using its corresponding control the expression is automatically re-executed and the plot is redrawn.

For example, to create a plot that enables manipulation of a parameter using a slider control you could use syntax like this:

library(manipulate)  manipulate(plot(1:x), x = slider(1, 100))  

After this code is executed the plot is drawn using an initial value of 1 for x. A manipulator panel is also opened adjacent to the plot which contains a slider control used to change the value of x from 1 to 100.

Slider Control

The slider control enables manipulation of plot variables along a numeric range. For example:

manipulate(    plot(cars, xlim=c(0,x.max)),      x.max=slider(15,25))  

Results in this plot and manipulator:

Alt text

Slider controls also support custom labels and step increments.

Wednesday 25 July 2012

GeneTalk: an expert exchange platform for assessing rare sequence variants in personal genomes.



Item 1 of 1    (Display the citation in PubMed)

1. Bioinformatics. 2012 Jul 23. [Epub ahead of print]

GeneTalk: an expert exchange platform for assessing rare sequence variants in personal genomes.

Kamphans T, Krawitz PM.

Source

GeneTalk, Finckensteinallee 84, 12205 Berlin, Germany.

Abstract

Summary Next-generation sequencing (NGS) has become a powerful tool in personalized medicine. Exomes or even whole genomes of patients suffering from rare diseases are screened for sequence variants. After filtering out common polymorphisms, the assessment and interpretation of detected personal variants in the clinical context is an often time consuming effort. We have developed GeneTalk, a web-based platform that serves as an expert exchange network for the assessment of personal and potentially disease relevant sequence variants. GeneTalk assists a clinical geneticist who is searching for information about specific sequence variants and connects this user to other users with expertise for the same sequence variant. AVAILABILITY: GeneTalk is available at www.gene-talk.de. Users can login without registering in a demo account. CONTACT: peter.krawitz@gene-talk.de.

PMID: 22826540 [PubMed - as supplied by publisher]

howto SKAT R library

The R package is simple to install though you will need a recent version of R (compile as local user if you have no admin rights)

download from 

~/Downloads$ R CMD INSTALL SKAT_0.76.tgz 
* installing to library '/Library/Frameworks/R.framework/Versions/2.14/Resources/library'
* installing *binary* package 'SKAT' ...

* DONE (SKAT) 

Tuesday 24 July 2012

Win an iPad by identifying the casual variant anyone?

TAKE THE CHALLENGE NOW:

You will be provided with an Analysis Case that includes existing Complete Genomics whole human genome sequencing data and phenotypic information.  Use Ingenuity Variant Analysis to determine the casual variant.  When you decide which variant is most likely causing the symptoms in the Case, submit your answer via the Feedback button within the application.  All correct entries will be entered into a drawing for a chance to win an Apple iPad!
http://pages.ingenuity.com/CaseofMonthMay2012_Landingpage2.html

I am not eligible :( 

ELIGIBILITY:  Participation in the Challenge is open only to those 21 years and older legal residents of the 50 United States (excluding Puerto Rico and Rhode Island) and District of Columbia, and Canada (excluding Quebec) at the date of entry, who complete registration and submit an entry.  N

Saturday 21 July 2012

MolBioLib: A C++11 Framework for Rapid - PubMed Mobile

Abstract MOTIVATION: We developed MolBioLib to address the need for adaptable next-generation sequencing analysis tools. The result is a compact, portable, and extensively tested C++11 software framework and set of applications tailored to the demands of next-generation sequencing data and applicable to many other applications. MolBioLib is designed to work with common file formats and data types used both in genomic analysis and general data analysis. A central relational-database-like Table class is a flexible and powerful object to intuitively represent and work with a wide variety of tabular datasets, ranging from alignment data to annotations. MolBioLib has been used to identify causative SNPs in whole genome sequencing, detect balanced chromosomal rearrangements, and compute enrichment of mRNAs on microtubules, typically requiring applications of under 200 lines of code.

http://www.ncbi.nlm.nih.gov/m/pubmed/22815363/

Sequencing the genome of an entire population | ScienceNordic

http://sciencenordic.com/sequencing-genome-entire-population
The FarGen project is preparing to sequence the genetic material of the entire 50,000 population of the Faroe Islands, and could become a model for personalised medicine throughout the world. "We will not only be creating a genetic biobank but a completely new health system," says program director Bogi Eliasen. ScienceNordic

Galaxy July 20, 2012 Distribution & News Brief


Galaxy July 20, 2012 Distribution & News Brief

Complete News Brief
http://wiki.g2.bx.psu.edu/DevNewsBriefs/2012_07_20

Highlights:
http://wiki.g2.bx.psu.edu/News/Jul202012%20Distribution%20News%20Brief
  • Freebayes has moved from the Galaxy distribution to the Galaxy's Main Tool Shed

  • EMBOSS version 5.0.0 tool dependencies in the emboss_5 repository of the Galaxy Main Tool Shed updated to include information for automatically installing.

  • Tool Shed now also supports specifying the third party tool dependencies to be automatically installed in new repositories

  • Admin Genome Indexing is now in BETA. Download, index, and track progress right from the admin UI!

  • Improved Error Handling that captures EXIT codes, STDOUT, and STDERR from tools in XML. Be sure to read full details.

  • TopHat2/Bowtie2 latest support includes option to 'report discordant pairs', updated tests, and more preset options.

  • Trackster new parameter space visualization. Includes BRAND NEW Features!! More details coming soon, but give a test drive now.


http://getgalaxy.org

new:     % hg clone http://www.bx.psu.edu/hg/galaxy galaxy-dist  upgrade: % hg pull -u -r ec29ce8e27a1    

Thanks for using Galaxy!
We hope to see everyone in Chicago @ GCC2012!!


The Galaxy Team

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Crossbow 1.2.0 released


Version 1.2.0 - July 20, 2012
   * Added support for Hadoop version 0.20.205.
   * Dropped support for Hadoop versions prior to 0.20.
   * Updated default Hadoop version for EMR jobs to 0.20.205.
   * Updated Bowtie version used to 0.12.8.
   * Fixed issues with streaming jar version parsing
   * Fixed documentation bugs regarding --sra-toolkit option, which is
     superseded by the --fastq-dump option.

http://bowtie-bio.sourceforge.net/crossbow

Thanks,
Ben

------------------------------------------------------------------------------

_______________________________________________
Bowtie-bio-announce mailing list
https://lists.sourceforge.net/lists/listinfo/bowtie-bio-announce

Friday 20 July 2012

GATK 2.0 On July 23rd, 2012


GATK 2.0

On July 23rd, 2012, the Genome Sequencing and Analysis (GSA) team will release a beta of GATK 2.0. GATK 2.0 includes all of the original GATK 1.x tools as well as many newer and more advanced tools for error modeling, data compression, and variant calling:
  • Base quality score recalibration (BQSR) v2, an upgrade to BQSR that generates a base substitution, insertion, and deletion error model.
  • ReduceReads, a BAM compression algorithm that reduces file sizes by 20x-100x while preserving all information necessary for accurate SNP and indel calling. ReduceReads enables the GATK to call tens of thousands of deeply sequenced NGS samples simultaneously.
  • The HaplotypeCaller, a multi-sample local de novo assembly and integrated SNP, indel, and short SV caller.
  • Powerful extensions to the Unified Genotyper to support variant calling of pooled samples, mitochondrial DNA, and non-diploid organisms. Additionally, the extended Unified Genotyper introduces a novel error modeling approach that uses a reference sample to build a site-specific error model for SNPs and indels that vastly improves calling accuracy.
    Mixed open/closed source model
source:http://gatk.vanillaforums.com/discussion/17/gatk-2-0-announcement

is there an "average" chromosome or a good abridged chromosome?

was reading the tweet conversation of
 

And i often wondered about the practice of using the 1st 10Mb of Chr1 as a test alignment target for sequencing runs on the SOLiD machine and how informative that might be.

I wonder since someone probably has some sort of summary statistics of individual chromosomes, is there a particular chromosome that's representative of the whole genome or perhaps a chimera that represents the rest of the chromosomes that one might be good to run through various sequencing platforms to validate results or compare sequencing platform error profiles ...

just a random thought

I've always thought chr20 is the most gentlemanly. 1,6,9 : crazy hetreochromatin. 19 - a zoo of zinc fingers. [1/2]

Stay away from the acrocentrics (13,14,15,21,22), and chr17 has >> duplications. X and Y obviously odd.


Better than chrY, that filthy degenerate. RT : Chromosome six, what is WRONG with you???

Myrna 1.2.0 released


Version 1.2.0 - July 19, 2012
   * Added support for Hadoop version 0.20.205.
   * Dropped support for Hadoop versions prior to 0.20.
   * Updated default Hadoop version for EMR jobs to 0.20.205.
   * Updated Bowtie version used to 0.12.8.
   * Updated R version used to 2.14.2.
   * Updated jar files to use Ensembl v67 (used to be v61).  In the
     process, fixed an issue whereby $MYRNA_HOME/reftools scripts
     would die due to unexpected new format of Ensembl database schema.
   * Fixed issues with streaming jar version parsing
   * Fixed documentation bugs regarding --sra-toolkit option, which is
     superseded by the --fastq-dump option.
   * Removed some diagnostic counters because Hadoop began to enforce
     an upper limit on the number of counters allowed per job.  For
     instance, per-label summary statistics are no longer printed in
     the Normalize step.

Thanks,
Ben

------------------------------------------------------------------------------
_______________________________________________
Bowtie-bio-announce mailing list

https://lists.sourceforge.net/lists/listinfo/bowtie-bio-announce

Thursday 19 July 2012

Efficiency and power as a function of sequence coverage, SNP array density, and imputation.

Item 1 of 1    (Display the citation in PubMed)

PMID:
 
22807667
 
[PubMed - in process] 
PMCID:
 
PMC3395607
 
Free PMC Article

1. PLoS Comput Biol. 2012 Jul;8(7):e1002604. Epub 2012 Jul 12.

Efficiency and power as a function of sequence coverage, SNP array density, and imputation.

Flannick J, Korn JM, Fontanillas P, Grant GB, Banks E, Depristo MA, Altshuler D.

Source

Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.

Abstract

High coverage whole genome sequencing provides near complete information about genetic variation. However, other technologies can be more efficient in some settings by (a) reducing redundant coverage within samples and (b) exploiting patterns of genetic variation across samples. To characterize as many samples as possible, many genetic studies therefore employ lower coverage sequencing or SNP array genotyping coupled to statistical imputation. To compare these approaches individually and in conjunction, we developed a statistical framework to estimate genotypes jointly from sequence reads, array intensities, and imputation. In European samples, we find similar sensitivity (89%) and specificity (99.6%) from imputation with either 1× sequencing or 1 M SNP arrays. Sensitivity is increased, particularly for low-frequency polymorphisms ([Formula: see text]), when low coverage sequence reads are added to dense genome-wide SNP arrays - the converse, however, is not true. At sites where sequence reads and array intensities produce different sample genotypes, joint analysis reduces genotype errors and identifies novel error modes. Our joint framework informs the use of next-generation sequencing in genome wide association studies and supports development of improved methods for genotype calling.
PMID: 22807667 [PubMed - in process]

GRCh38 in the summer of 2013! Genome Reference Consortium

Are you prepared? ..... 


We are planning to update the human reference assembly to GRCh38 in the summer of 2013. If you have questions or concerns about this 
let us know

See our blog for more information on why we think this is important.
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/

FAQs » Mobile Element Insertion Detection

Mobile Element Insertion (MEI) Detection

http://www.completegenomics.com/FAQs/MEI-Detection/

[galaxy-user] Galaxy intro webinar

---------- Forwarded message ----------
From: wlathe


Greetings all!

We will be giving a 1 hour 15 minute Intro to Galaxy webinar tomorrow (Thursday, July 19th) at 11am PDT, 2pm EDT. Registration is open and free. Registrants will also have access to a recording, slides and exercises. The webinar will go through the basics of using Galaxy, tools, histories and workflows. If you'd like to attend or would like to see more information, check our our webinar registration page: http://www.openhelix.com/cgi/webinars.cgi

Trey
(OpenHelix)


___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Slim-Filter: an interactive windows-based application for illumina genome analyzer data assessment and manipulation.

http://www.ncbi.nlm.nih.gov/pubmed/22800377

Slim-Filter: An interactive windows-based application for Illumina Genome Analyzer data assessment and manipulation

 

G. Golovko1,2, K. Khanipov1, M. Rojas1,2, A. Martinez-Alcántara1, J. J. Howard1, E. Ballesteros1, S. Gupta1, W. Widger1,3, and Y. Fofanov1,2,3

1Center for BioMedical and Environmental Genomics, University of Houston, Houston, TX, USA. Department of Computer Science2 and the Department of Biology and Biochemistry3, University of Houston, Houston, TX, USA 77204

      

    The emergence of Next Generation Sequencing technologies has made it possible for individual investigators to generate gigabases of sequencing data per week.  Effective analysis and manipulation of these data is limited due to large file sizes, so even simple tasks such as data filtration and quality assessment have to be performed in several steps.  This requires (potentially problematic) interaction between the investigator and a bioinformatics/computational service provider.  Furthermore, such services are often performed using specialized computational facilities. 

    We present a windows-based application, Slim-Filter designed to interactively examine the statistical properties of sequencing reads produced by Illumina Genome Analyzer and to perform a broad spectrum of data manipulation tasks including: filtration of low quality and low complexity reads; filtration of reads containing undesired subsequences (such as parts of adapters and PCR primers used during the sample and sequencing libraries preparation steps); excluding duplicated reads (while keeping each read's copy number information in a specialized data format); and sorting reads by copy numbers allowing for easy access and manual editing of the resulting files.  Slim-Filter is organized as a sequence of windows summarizing the statistical properties of the reads.  Each data manipulation step has roll-back abilities, allowing for return to previous steps of the data analysis process.Slim-Filter is written in C++ and is compatible with fastafastq, and specialized AS file formats. 

Slim-Filter Performance was estimated using following computer configurations:
WindowsLinux
OSWindows 2008 Server SP1 CentOS 5.6
CPUDual Quad Core Intel™ Xeon W5590 3.33GHz,8M L3AMD Magny Cours 6128 8-Core Processor, 2.0 GHz, 12MB Cache
RAM128GB, DDR3 RDIMM, 1066MHz, ECC512 GB DDR3 1333Mhz ECC
HDD2TB SATA 3.0Gb/s, 7200 RPM 2TB SATA 3.0Gb/s, 7200 RPM

Number of
reads 36 bases
long
RAM required
to perform
computations
for Linux (Mb)
RAM required
to perform
computations
Windows (Mb)
Time to
apply all
possible
filter
settings in
Windows
(seconds)
Time to apply all
possible filter
settings in Linux
(seconds)
10,0001032-40<1<1
100,0002585-1006.54.5
1,000,000300600-800 6640
10,000,0002,0003,500590442
50,000,00013,00045,0003,0002,156


Crowd-funded Exome Sequencing for Rare Genetic Diseases

Alerted to this by Massgenomics

Crowd-funded Exome Sequencing for Rare Genetic Diseases
http://massgenomics.org/2012/07/crowd-funded-exome-sequencing-for-rare-genetic-diseases.html

Crowdfunding and Families with Rare Diseases

That being said, I'm writing about this because it's a good story. Friends, relatives, and total strangers made cash donations, in tough economic times, to help this little girl.

As Daniel MacArthur (@dgmacarthur) put it on Twitter, this story makes me feel good about humanity.

Exome and whole-genome sequencing have enabled the discovery of many causal variants behind rare disorders, but this is the first time it's been accomplished by raising funds on the internet. The "crowd-funding" model, as it's called, may offer some hope to the thousands of families dealing with a rare genetic disorder. The majority of them won't have the opportunity to be studied in a government-funded research. And next-gen sequencing isn't usually covered by health insurance. If they can raise the funds on their own, however, a genetic diagnosis may be possible.

It will not come easy. The analysis and interpretation of sequence data requires considerable time and expertise. And as I've recently written,exome sequencing does not guarantee an answer even for Mendelian diseases. Even so, discoveries are possible. That likely provides a glimmer of hope for those with rare genetic disorders.

Wednesday 18 July 2012

we all have these days ... identify-indel-regions.pl - popoolation2 - Allows comparision of allele frequencies between two ore more populations - Google Project Hosting

http://code.google.com/p/popoolation2/source/browse/trunk/indel_filtering/identify-indel-regions.pl?r=147

my $nucs="";
    while(@ar)
    {
        my $cov=shift @ar;
        my $n=shift @ar;
        my $q=shift @ar;
        die "mpileup fucked" unless(defined($q));
        $nucs.=$n;
    }
    #


lol chanced on this trying to find out why I got signal 139 on mpileup segfaulted on me ...

Principal Components Analysis Using R - P1 - YouTube

"You can learn everything via YouTube these days" - Anonymous quote ...

http://www.youtube.com/watch?v=5zk93CpKYhg&feature=related

Part 1 - This video tutorial guides the user through a manual
principal components analysis of some simple data. The goal is to
acquaint the viewer with the underlying concepts and terminology
associated with the PCA process. This will be helpful when the user
employs one of the "canned" R procedures to do PCA (e.g. princomp,
prcomp), which requires some knowledge of concepts such as loadings
and scores. You may download the R code used in this tutorial from

http://www.bimcore.emory.edu/BB_phys_stats_ex1.R

Tuesday 17 July 2012

SNAP Sequence Aligner

SNAP is a new sequence aligner that is 10-100x faster and simultaneously more accurate than existing tools like BWA, Bowtie2 and SOAP2. It runs on commodity x86 processors, and supports a rich error model that lets it cheaply match reads with more differences from the reference than other tools. This gives SNAP up to 2x lower error rates than existing tools and lets it match larger mutations that they may miss.
http://snap.cs.berkeley.edu/

Sunday 15 July 2012

Udacity - 21st Century University

http://www.udacity.com/

Hmm seem to be on a roll of OOT blog posts today .. chanced on this link for the perennial FAQ of where do I learn python/R

Udacity is a totally new kind of learning experience. You learn by solving challenging problems and pursuing udacious projects with world-renowned university instructors (not by watching long, boring lectures). At Udacity, we put you, the student, at the center of the universe. Keep Reading


Psychopathy Prediction Based on Twitter Usage - Kaggle

fascinating!
would be cool if it was a model based on this AND sequence data ...

https://www.kaggle.com/c/twitter-psychopathy-prediction

The  aim of the competition is to determine to what degree it's possible to predict people with a sufficiently high degree of Psychopathy based on Twitter usage and Linguistic Inquiry.

The organizers provide all interested participants an anonymised dataset of users self assessed psychopathy scores together with 337 variables derived from functions of Twitter information, useage and lingusitc analysis. Psychopathy scores are based on a checklist developed by Professor Del Paulhus at the University of British Columbia.

The model should aim to identify people scoring high in Psychopathy, for the purpose of this competition, defined as 2 SD's above a mean of 1.98. This accounts for roughly 3% of the entire sample and therefore the challenge with this dataset is developing a model to work with a highly imbalanced dataset.

The best performing model(s) will be formally cited in a future paper/papers. The authors of the winning model may also be invited to attend future conferences to discuss their model.


Saturday 14 July 2012

Fwd: An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people.



Item 1 of 1    (Display the citation in PubMed)

1. Science. 2012 Jul 6;337(6090):100-4. Epub 2012 May 17.

An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people.

Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, Shen J, Tang Z, Bacanu SA, Fraser D, Warren L, Aponte J, Zawistowski M, Liu X, Zhang H, Zhang Y, Li J, Li Y, Li L, Woollard P, Topp S, Hall MD, Nangle K, Wang J, Abecasis G, Cardon LR, Zöllner S, Whittaker JC, Chissoe SL, Novembre J, Mooser V.

Source

Department of Quantitative Sciences, GlaxoSmithKline (GSK), Research Triangle Park, NC 27709, USA. 

Comment in

Abstract

Rare genetic variants contribute to complex disease risk; however, the abundance of rare variants in human populations remains unknown. We explored this spectrum of variation by sequencing 202 genes encoding drug targets in 14,002 individuals. We find rare variants are abundant (1 every 17 bases) and geographically localized, so that even with large sample sizes, rare variant catalogs will be largely incomplete. We used the observed patterns of variation to estimate population growth parameters, the proportion of variants in a given frequency class that are putatively deleterious, and mutation rates for each gene. We conclude that because of rapid population growth and weak purifying selection, human populations harbor an abundance of rare variants, many of which are deleterious and have relevance to understanding disease risk.

PMID: 22604722 [PubMed - in process]
Icon for HighWire Press


Friday 13 July 2012

Archon Genomics X PRIZE presented by Express Scripts, an incentivized prize competition that will award $10 million

 Shawna McLean <Shawna.McLean xprize.org>


Dear Colleague:

Thank you for signing up to join the June 20th Cambridge Health Institute (CHI) webinar, "Towards a Medically Actionable Whole Genome Sequence."  Since you registered for the session, I wanted to take this opportunity to let you know the link to listen to the sessions still active http://www.bio-itworldsymposia.com/Bioitsymposia_Content.aspx?id=115850.

My name is Grant Campany, Senior Director of the Archon Genomics X PRIZE presented by Express Scripts, an incentivized prize competition that will award $10 million to the first team to rapidly, accurately and economically sequence 100 whole human genomes within 30 days, beginning on 5 Sep 2013.

Since you expressed interest in the webcast, I thought you might like to learn more about our Competition and how it will accelerate WGS.  I would like to invite you to actively participate in our upcoming public phase.  Please visit this link to learn how you can benefit:


Please stay connected with us on Facebook--we have an exciting announcement on Monday, July 23!


Thank you,  

Grant

Grant R. Campany | Senior Director & Prize Lead Archon Genomics X PRIZE

 

 



Thursday 12 July 2012

SQLite Python tutorial

The adage 'use it or lose it' is so true for programming. I often decide to fall back to a language that I used before and find that I have completely forgotten how to code in it (python/sqlite) 

my code drawer also doesn't have the appropriate script hence I have to fall back to google and try to understand the syntax again. 

got stuck in a couple of deadends until I stumbled on this gem which has extensive code examples and methods 

Have fun if you are new to this. 


Prerequisites

To work with this tutorial, we must have Python language, SQLite database, pysqlite language binding and the sqlite3 command line tool installed on the system. If we have Python 2.5+ then we only need to install the sqlite3 command line tool. Both the SQLite library and the pysqlite language binding are built into the Python languge.
http://zetcode.com/db/sqlitepythontutorial/

Monday 9 July 2012

Cufflinks 2.0.2 released


2.0.2 release - 7/8/2012

This release fixes several bugs:

Some users were experience a crash on exit in Cufflinks when run with bias correction. The source of the crash has been fixed.
A few minor fixes in the estimation routines for cross-replicate variability.
Providing the same BAM file multiple times was producing inconsistent expression values. This has been corrected.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Bowtie-bio-announce mailing list
https://lists.sourceforge.net/lists/listinfo/bowtie-bio-announce

Saturday 7 July 2012

Installing breakdancer.sourceforge.net

http://breakdancer.sourceforge.net/moreperl.html

Dependencies for breakdancer-1.1_2011_02_21


sudo perl -MCPAN -e 'install Statistics::Descriptive'
sudo perl -MCPAN -e 'install Math::CDF'  
sudo perl -MCPAN -e 'install GD::Graph::histogram'

#if you haven't done it already, install samtools and add the binary to your path bam2cfg.pl needs to call it.

Writing and Optimizing Parallel Programs — A complete example | Future Chips

Histogram Problem

I am using the histogram kernel because it is very simple and clearly demonstrates some very important concepts in parallel programming: thread spawning, critical sections, atomic operations, barriers, false sharing, and thread join. Here is our problem statement:

Problem: Count the number of times each ASCII character occurs on a page of text.

Input: ASCII text stored as an array of characters.

Output: A histogram with 128 buckets –one for each ascii character– where each entry stores the number of occurrences of the corresponding ascii character on the page.
http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

OTT: Stephen Wolfram Blog : A Moment for Particle Physics: The End of a 40-Year Story?

Out of Topic but I do wonder if there are > 40 year old mysteries in biology that might invite so much interest and well money to find out .... 


The announcement early yesterday morning of experimental evidence for what's presumably the Higgs particle brings a certain closure to a story I've watched (and sometimes been a part of) for nearly 40 years. In some ways I felt like a teenager again. Hearing about a new particle being discovered. And asking the same questions I would have asked at age 15. "What's its mass?" "What decay channel?" "What total width?" "How many sigma?" "How many events?"
http://blog.stephenwolfram.com/2012/07/a-moment-for-particle-physics-the-end-of-a-40-year-story/

Friday 6 July 2012

OmegaPlus: A Scalable Tool for Rapid Detection of Selective Sweeps in Whole-Genome Datasets.

 (Display the citation in PubMed)

1. Bioinformatics. 2012 Jul 3. [Epub ahead of print]

OmegaPlus: A Scalable Tool for Rapid Detection of Selective Sweeps in Whole-Genome Datasets.

Alachiotis N, Stamatakis A, Pavlidis P.

Source

The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies.

Abstract

MOTIVATION: Recent advances in sequencing technologies have led to the rapid accumulation of molecular sequence data. Analyzing whole-genome data (as obtained from next-generation sequencers) from intra-species samples allows to detect signatures of positive selection along the genome and therefore identify potentially advantageous genes in the course of the evolution of a population.We introduce OmegaPlus, an open-source tool for rapid detection of selective sweeps in whole-genome data based on linkage dis-equilibrium. The tool is up to two orders of magnitude faster than existing programs for this purpose and also exhibits up to two orders of magnitude smaller memory requirements. AVAILABILITY: OmegaPlus is available under GNU GPL at http://www.exelixis-lab.org/software.html. CONTACT: pavlos.pavlidis@h-its.org SUPPLEMENTARY INFORMATION: Available at Bioinformatics online.

Free Article
PMID: 22760304 [PubMed - as supplied by publisher]
Icon for HighWire Press
http://bioinformatics.oxfordjournals.org/content/early/2012/07/03/bioinformatics.bts419.long

JoVE: Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER

The video tutorial along with the main text does make it easier for non-bioinfo ppl to follow the workflow, (perhaps even convince a few to buy a Mac to work as a bioinformatics console lol ) the wet lab portion looks quite pointless though the graphical explainations are useful. I wonder if this might become a trend, Galaxy styled tutorial videos for papers. Hmmm


JoVE: Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER


Cite this Article

Vallania, F., Ramos, E., Cresci, S., Mitra, R. D., Druley, T. E. Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER. J. Vis. Exp. (64), e3943, DOI: 10.3791/3943 (2012).

Abstract

As DNA sequencing technology has markedly advanced in recent years2, it has become increasingly evident that the amount of genetic variation between any two individuals is greater than previously thought3. In contrast, array-based genotyping has failed to identify a significant contribution of common sequence variants to the phenotypic variability of common disease4,5. Taken together, these observations have led to the evolution of the Common Disease / Rare Variant hypothesis suggesting that the majority of the "missing heritability" in common and complex phenotypes is instead due to an individual's personal profile of rare or private DNA variants6-8. However, characterizing how rare variation impacts complex phenotypes requires the analysis of many affected individuals at many genomic loci, and is ideally compared to a similar survey in an unaffected cohort. Despite the sequencing power offered by today's platforms, a population-based survey of many genomic loci and the subsequent computational analysis required remains prohibitive for many investigators.
To address this need, we have developed a pooled sequencing approach1,9 and a novel software package1for highly accurate rare variant detection from the resulting data. The ability to pool genomes from entire populations of affected individuals and survey the degree of genetic variation at multiple targeted regions in a single sequencing library provides excellent cost and time savings to traditional single-sample sequencing methodology. With a mean sequencing coverage per allele of 25-fold, our custom algorithm, SPLINTER, uses an internal variant calling control strategy to call insertions, deletions and substitutions up to four base pairs in length with high sensitivity and specificity from pools of up to 1 mutant allele in 500 individuals. Here we describe the method for preparing the pooled sequencing library followed by step-by-step instructions on how to use the SPLINTER package for pooled sequencing analysis (http://www.ibridgenetwork.org/wustl/splinter). We show a comparison between pooled sequencing of 947 individuals, all of whom also underwent genome-wide array, at over 20kb of sequencing per person. Concordance between genotyping of tagged and novel variants called in the pooled sample were excellent. This method can be easily scaled up to any number of genomic loci and any number of individuals. By incorporating the internal positive and negative amplicon controls at ratios that mimic the population under study, the algorithm can be calibrated for optimal performance. This strategy can also be modified for use with hybridization capture or individual-specific barcodes and can be applied to the sequencing of naturally heterogeneous samples, such as tumor DNA.

Wednesday 4 July 2012

STAT-Seq: Rapid WGS on the HiSeq 2500 - Implications for a Neonatal Intensive Care Unit recorded webinar







  www.illumina.com
Illumina
STAT-Seq: Rapid WGS on the HiSeq 2500: Implications for a Neonatal Intensive Care Unit
Thank you for registering for our recent webinar with Dr. Stephen Kingsmore, Director, Center for Pediatric Genomic Medicine at Children's Mercy Hospital, Kansas City, MO. Due to unforeseen travel delays Dr. Kingsmore was unable to make the live event, however a recording is now available for viewing.

We appreciate your interest in the Illumina Webinar Series and hope you will join us for future events. To have an Illumina representative contact you about our products and services, please complete our request form and someone will be in contact with you shortly.


View the Webinar

Sent by Illumina
1.800.809.4566 toll-free (U.S.)  |  +1.858.202.4566 tel

follow us on twitter YouTube Illumina on Facebook

Datanami, Woe be me