Showing posts with label opensource. Show all posts

Tuesday, 22 May 2012

data visualisation: Getting matplotlib on MacOS X

Decided I should probably play around with matplotlib finally since I found a manhattan plot python script that required it.

But omg I thought that installing stuff on Ubuntu was troublesome, Macs actually ups the level of troublesome one notch up

Gonna leave it undone .. but in chronological order of discovery

python eggs = FAILED (not sure why the script insists that it doesn't have write permissions to create files

next up was trying this helpful post
Installing matplotlib in Lion
http://the.taoofmac.com/space/blog/2011/07/24/2222

oh okay I need homebrew
http://mxcl.github.com/homebrew/
https://github.com/mxcl/homebrew/wiki/installation

(oh wow I didn't know ruby is installed by default)
but hit another snag as per below ..
continue another day ..

Press enter to continue
==> /usr/bin/sudo /bin/chmod g+rwx /usr/local/. /usr/local/bin /usr/local/lib
==> /usr/bin/sudo /usr/bin/chgrp admin /usr/local/. /usr/local/bin /usr/local/lib
==> Downloading and Installing Homebrew...
==> Installation successful!
You should run `brew doctor' *before* you install anything.
Now type: brew help
------------------------------------------------------------------------------------------------------------------------------------ 03:07:56
k@k:~$ brew doctor

Error: You have no /usr/bin/cc.
This means you probably can't build *anything*. You need to install the Command
Line Tools for Xcode. You can either download this from http://connect.apple.com
or install them from inside Xcode's Download preferences. Homebrew does not
require all of Xcode! You only need the Command Line Tools package!
Error: Git could not be found in your PATH.
Homebrew uses Git for several internal functions, and some formulae use Git
checkouts instead of stable tarballs. You may want to install Git:
brew install git
Error: Your compilers are different from the standard versions for your Xcode.
If you have Xcode 4.3 or newer, you should install the Command Line Tools for
Xcode from within Xcode's Download preferences.
Otherwise, you should reinstall Xcode.
Error: Your Xcode is configured with an invalid path.
You should change it to the correct path. Please note that there is no correct
path at this time if you have *only* installed the Command Line Tools for Xcode.
If your Xcode is pre-4.3 or you installed the whole of Xcode 4.3 then one of
these is (probably) what you want:

sudo xcode-select -switch /Developer
sudo xcode-select -switch /Applications/Xcode.app/Contents/Developer

DO NOT SET / OR EVERYTHING BREAKS!

Saturday, 19 May 2012

[Denovoassembler-users] Ray v2.0.0-rc7 is available online !

---------- Forwarded message ----------
From: Sébastien Boisvert
Date: Thu, May 17, 2012 at 11:02 PM
Subject: [Denovoassembler-users] Ray v2.0.0-rc7 is available online !

Hello !

I am proud to announce the immediate availability of the Ray assembler
version 2.0.0 release candidate 7, code name "Dark Astrocyte of Knowledge".

This version ships with RayPlatform v1.0.2, code name "Timely Gate of
Yields".

Link for download: http://denovoassembler.sourceforge.net/

Changes in Ray

* The CMakeList file was updated.
* GC content for contigs are dumped in XML files.
* New option -one-color-per-file for graph coloring.
* Optimized file system input/output operations.
* Network testing is more verbose.
* Fixed an integer overflow bug in the scaffolder.
* New guide in Documentation/ for software message routing.
* Fixed an integer overflow bug in the profiler.
* Fixed a synchronization bug in the coloring algorithm.
* Increased the sensitivity of the biological profiling algorithms.
* Disabled the plugin for neighbourhoods.
* New plugin to compute gene ontology profiles.
* Added various missing code headers.
* Simplified the plugin creation process.
* Fixed some divisions per 0.
* Fixed a synchronization bug for gene ontology.
* Added simple profile files for sequence abundance, taxonomy profiles
and gene ontology profiles.
* A bug that caused k-mers with >= 65536 coverage to have less coverage
was fixed.
---> This was a long-standing bug that caused some issues.
* Added some datatypes.

Changes in RayPlatform

* Command line arguments can be obtained.
* Simplified the plugin creation process.
* Fixed two divisions per 0.
* Added some datatypes.

seb

_______________________________________________
Denovoassembler-users mailing list

https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Tuesday, 8 May 2012

I love IGV!

Well sorry for the proclamation

But I think it's really nice that you have a relatively small footprint for a application that's cross platform.

anyway I needed to view a few regions of a merged bam file that's aligned to hg18 from 1000 genomes project.
You would think I might have copies of the hg18 lying around on every server so that I can just use 'samtools tview in.bam ref.fasta'

But unfortunately I don't and I am not really looking forward to downloading the hg18 just for a verification exercise.

Luckily I remembered that IGV has prebuilt human genome references that's available on the fly.
with a few clicks, I was happily viewing the regions of interest. (I downloaded the compiled binary instead of using the 'launch from browser' option but that would have been cool too)

among the list of reference human genomes avail there was actually one available for 1KG specifically. How cool is that?

Go on take the application for a run if you haven't!

http://www.broadinstitute.org/software/igv/download

Friday, 18 November 2011

VCFtools BEDtools compare, intersect, merge

A Good computational biologist is only as good as the tools he uses (or maybe how good he is at google) rofl kidding ...
Life is always easier when you find the correct tool.
I also adopt the path of least resistance when trying to solve problems that are more common than I imagine.
There's always the good old linux tools for comparing SNPs called from different programs / options

grep | sed | awk | cut | diff | comm
see http://ged.msu.edu/angus/tutorials-2011/snp_tutorial.html

and if you are working with NGS data, you most probably already have samtools installed on your system and you might have used bcftools
Did you also know that there's also a (unrelated) set of tools called vcftools?
http://vcftools.sourceforge.net/

The VCFtools package is broadly split into two sections:

The vcftools binary program, generally used to analyse VCF files.
The Vcf.pm perl module, which is a general Perl API containing a core of the utilities vcf-convert, vcf-merge, vcf-compare, vcf-isec, and others.
Documentation
Examples of usage by topic

Then there's also the highly used BEDTools http://code.google.com/p/bedtools/

which I highly recommend to keep as part of your tools collection. Check out the link below

Do watch out for this 'oversight' in vcftools as pointed out in seqanswers.
Overlap number discrepancy between VCFTools and BEDTools

Usage	Examples of common usage. Featured

Friday, 10 June 2011

BWA to support multiple hits as separate lines in SAM with addon pl script

This is the reason why I love open source communities / software.
After a brief discussion and request for BWA to also report multiple hits as separate entries in sam/bam files. The author of BWA (Li Heng) promptly released a addon perl script to allow for this feature.

commercial providers: try to beat that for speed for new feature release!

Anyway if you are interested on the usage:

A new script xa2multi.pl is added to convert XA:Z tag to multiple lines.

bwa samse ref.fa reads.sai reads.fq.gz | xa2multi.pl > out.sam

A related question was also posted on biostars

Question: How to force 'bwa samse' to output multiple hits in .sam format?
http://www.biostars.org/p/45430/

Saturday, 4 June 2011

Posting of Ion Torrent protocols online is a violation of Terms and Conditions

http://seqanswers.com/forums/showthread.php?t=10400

Just got to know of this rather disturbing fact that seqanswers admins were informed (nicely) to take down online posted protocols for the Ion Torrent.

I wished to post the adaptor sequences for RNA multiplex libraries online before as a help for bioinformaticians that might have gotten their data from a service provider or have problems getting a prompt response from the ever friendly FAS. I mean if it's online, I need not bother them yeah?

Now I wonder if I might be violating terms and conditions somewhere out there.

I would argue for posting of protocols online.
Lab Protocols are meant to be optimised in every lab.
case in point? you promote active discussion on the product and once you have that, it is an active support community that beats a whole army of FAS with trained responses to problems in protocols.
see this imaginary conversation

Researcher A: making that incubation step longer for 10 secs improves your yield? good for you! but it didn't work for me, any advice on where else I can do it?
Researcher B: yeah sure, you see page 15 step 8A ? don't over do that step as it affects yield but be warned it might affect the quality of the final output but let's solve one problem at a time. . I tried that last week!

Agilent grants for systems biology software development

RE: Agilent grants for systems biology software development

Dear Kevin,
I am writing to you on behalf of Leo Bonilla, Director of Marketing for Integrated Biology, Agilent Technologies, Inc. Leo and the Integrated Biology team at Agilent have been reading your blog, My Weblog on Bioinformatics, Genome Science, Next Generation Sequencing, and thought you may be interested in sharing a funding opportunity with your readers. Agilent is fostering integrated, whole-systems approaches to biological research through two $75,000 US grants (application deadline August 12, 2011). Funds will support academic or nonprofit research projects covering the development of open source software tools for integrating data from different omics platforms—genomics, transcriptomics, proteomics, and metabolomics. For full details on eligibility, submission, and review process, please visit www.Agilent.com/lifesciences/emerginginsights.
If you have any questions or would like to interview Leo about the grant program, I’d be happy to set up a phone call. Just reply to my email and I’ll connect you with Leo.

Readers if you have any questions post them in the comments and I shall pass them on :)

Integrated Biology - eMerging Insights Grants

Integrated Biology - eMerging Insights Grants

Fostering integrated, whole-systems approaches to biological research with two $75,000US grants for open source data-integration tool development The different omics platforms—genomics, transcriptomics, proteomics and metabolomics—are generating new insights into how biological systems work at a molecular level. Although each individual omics approach provides a global view of a specific cellular process, this view is limited to only one aspect of the biological system. In order to gain a comprehensive understanding of the system as a whole, researchers are faced with the challenge of merging these very different data sets.
Agilent is supporting scientists who are taking on this challenge through our eMerging Insights Grant Program. We currently have two open initiatives for academic and non-profit researchers developing and/or improving open source, Agilent-compatible software tools to integrate multi-omics data. Each initiative will provide $75,000US to a single academic or non-profit research lab in fiscal year 2011. A proof-of-concept prototype or working solution must be demonstrated at the end of one year, using either existing data sets from the investigator’s own lab or institution, or from new or existing datasets produced at Agilent.
One of the most important outcomes of our eMerging Insights Grant Program is the development of open source* solutions for the analytical life science community. Any tools developed with this funding will be freely available, open source tools for the research community.
The submission deadline for these two initiatives is August 12, 2011.
Awards will be announced September 30, 2011.
*All free or open source licenses are acceptable except "any license requiring , as a condition of use, modification and/or distribution of the software subject to the license, that the software or other software combined and/or distributed with it be (i) disclosed or distributed in source code form; (ii) licensed for the purpose of making derivative works; or (iii) redistributable at no charge. Excluded licenses include, but are not limited to, the GPLv3 License."

Download Application

Sunday, 13 March 2011

Using Galaxy for NGS sample submission and tracking for service providers

Over at the Blue Collar Bioinformatics
Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy.
This post will show videos of the interface in action, describe installation and extension of the system, and detail the implementation architecture.

Saturday, 12 March 2011

RStudio IDE for R ... looks promising!

RStudio, released yesterday, is a new open-source IDE for R. It’s getting a lot of attention at R-bloggers and it’s easy to see why: this is open-source software development done right.
from

What You’re Doing Is Rather Desperate

Notes from the life of a bioinformatics researcher

RNA-Seq Analysis Tools from the Broad Institute from RNA-Seq Blog

RNA-Seq Analysis Tools from the Broad Institute

from RNA-Seq Blog by admin

GenePattern offers a suite of tools to support a wide variety of RNA-seq analyses, including short-read mapping, identification of splice junctions, transcript and isoform detection, quantitation, and differential expression. The modules have been adapted from widely-used tools. GenePattern also provides pipelines that allow you to perform a number of multi-step RNA-seq analyses automatically.

Alignment: Bowtie.aligner
Differential Expression: Cufflinks.cuffdiff
Genome Annotation: Cufflinks, Cufflinks.cuffcompare, Scripture
Isoform Detection: Tophat
RNA Quantitation: Cufflinks, Scripture
Utilities: Bowtie.indexer, BamToSam, ExprToGct, SamToBam, SortSam
Visualizers: IGV

RNA-Seq Analysis Tools from the Broad Institute is a post from: RNA-Seq Blog More information about RNA-Seq can be here.

Thursday, 24 February 2011

RNA seq analysis workflow on Galaxy (Bristol workflow)

Dr David Matthews has posted a starter thread to discuss RNA seq analysis workflow on Paired End Seq with Tophat on Galaxy. in the mailling list.

His post and the discussion thread is here.
http://gmod.827538.n3.nabble.com/Replicates-tt2397672.html#a2560404

I thought I'd write to get a discussion of a workflow for people doing RNA seq that I have found very useful and addresses some issues in mapping mRNA derived RNA-seq paired end data to the genome using tophat. Here is the approach I use (I have a human mRNA sample deep sequenced with a 56bp paired end read on an illumina generating 29 million reads):

Bristol Method

1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for each sequence read
2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is mapped in a proper pair"
3. Use "group" to group the filtered sam file on c1 (which is the "bio-sequencer" read number) and set an operation to count on c1 as well. This provides a list of the reads and how many times they map to the human genome, because you have filtered the set for reads that have a mate pair there will be an even number for each read. For most of the reads the number will be 2 (indicating the forward read maps once and the reverse read maps once and in a proper pair) but for reads that map ambiguously the number will be multiples of 2. If you count these up I find that 18 million reads map once, 1.3 million map twice, 400,000 reads map 3 times and so on until you get down to 1 read mapping 30 times, 1 read mapping 31 times and so on...
4. Filter the reads to remove any reads that map more than 2 times.
5. Use "compare two datasets" to compare your new list of reads that map only twice to pull out all the reads in your sam file that only map twice (i.e. the mate pairs).
6. You'll need to sort the sam file before you can use it with other applications like IGV.

What you end up with is a sam file where all the reads map to one site only and all the reads map as a proper pair. This may seem similar to setting tophat to ignore non-unique reads. However, it is not. This approach gives you 10-15% more reads. I think it is because if tophat finds (for example) that the forward read maps to one site but the reverse read maps to two sites it throws away the whole read. By filtering the sam file to restrict it to only those mappings that make sense you increase the number of unique reads by getting rid of irrational mappings.

Has anyone else found this? Does this make sense to anyone else? Am I making a huge mistake somewhere?

A nice aspect of this (or at least I think so!) is that by filtering in this manner you can also create a sam file of non-unique mappings which you can monitor. This can be useful if one or more genes has a problem of generating a lot of non-unique maps which may give problems accurately estimating its expression. Also, you also get a list of how many multi hits you have in your data so you know the scale of the problem.

Best Wishes,

David.

__________________________________

Dr David A. Matthews

Senior Lecturer in Virology

Room E49

Department of Cellular and Molecular Medicine,

School of Medical Sciences

University Walk,

University of Bristol

Saturday, 29 January 2011

VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R

Visualization of orthogonal (disjoint) or overlapping datasets is a common task in bioinformatics. Few tools exist to automate the generation of extensively-customizable, high-resolution Venn and Euler diagrams in the R statistical environment. To fill this gap we introduce VennDiagram, an R package that enables the automated generation of highly-customizable, high-resolution Venn diagrams with up to four sets and Euler diagrams with up to three sets.

Pybedtools - python wrapper for bedtools

Looks interesting!
URL above links to tutorial for use to generate a 3 way venn diagram.

Wednesday, 27 October 2010

Tophat adds support for strand-specific RNA-Seq alignment and colorspace

Hooray!
testing Tophat 1.1.2 now
1.1.1

on a 8 Gb Ram CentOS box managed to align 1 million reads to hg18 in 33 mins and 2 million reads in 59 mins. using 4 threads
Nice scalability! But it was slower than I was used to for bowtie. I kept killing my full set of 90 million reads thinking there's something wrong. Guess I need to be more patient and wait for 45 hours.

I do wonder if the process can be mapped to separate nodes to speed up.

Friday, 27 August 2010

METAREP is a new open source tool developed for high-performance comparative metagenomics

Found this blog post at JCVI
Are your carrying out large scale metagenomics analyses to identify differences among multiple sample sites? Are you looking for suitable analysis tools?
If you have not yet found the right analysis tool, you may be interested in the latest beta version of JCVI Metagenomics Reports (METAREP) [Test It].
METAREP is a new open source tool developed for high-performance comparative metagenomics .
It provides a suite of web based tools to help scientists view, query, browse, and compare metagenomics annotation data derived from ORFs called on metagenomics reads or assemblies.

Wednesday, 25 August 2010

howto do BWA mapping in colorspace

Here's what I use for bwa alignment (without removing PCR dups).
You can replace the paths with your own and put into a bash script for automation
comments or corrections welcome!

#Visit kevin-gattaca.blogspot.com to see updates of this template!
#http://kevin-gattaca.blogspot.com/2010/08/howto-do-bwa-mapping-in-colorspace.html
#updated 16th Mar 2011
#Creates colorspace index
bwa index -a bwtsw -c hg18.fasta

#convert to fastq.gz
perl /opt/bwa-0.5.7/solid2fastq.pl Sample-input-prefix-name Sample

#aln using 4 threads
#-l 25        seed length
#-k 2         mismatches allowed in seed
#-n 10      total mismatches allowed

bwa aln -c -t 4 -l 25 -k 2 -n 10 /data/public/bwa-color-index/hg18.fasta Sample.single.fastq.gz > Sample.bwa.hg18.sai

#for bwa samse
bwa samse /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sai Sample.single.fastq.gz > Sample.bwa.hg18.sam

#creates bam file from pre-generated .fai file

samtools view -bt /data/public/hg18.fasta.fai -o Sample.bwa.hg18.sam.bam Sample.bwa.hg18.sam

#sorts bam file

samtools sort Sample.bwa.hg18.sam.bam{,.sorted}

#From a sorted BAM alignment, raw SNP and indel calls are acquired by:

samtools pileup -vcf /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup

#resultant output should be further filtered by:

/opt/samtools/misc/samtools.pl varFilter Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup | awk '$6>=20' > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup.final.pileup

#new section using mpileup and bcftools to generate vcf files
samtools mpileup -ugf hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

Do note the helpful comments below! Repost here for clarity.

Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.

Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.

Also:

1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".

2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."

The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).

2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.

3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)

4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.

5) View the alignments with samtools tview.

Wednesday, 18 August 2010

A Programmer’s Discussion: Procedural vs. OO

A Programmer’s Discussion: Procedural vs. OO
now that's something that I didn't expect to pop-up, people making a stand for Procedural programming vs OO. (in the comments, the author of the post stands neutral though)

After so many years, I think I have only managed a few OO code although I am totally convinced that OO is the way to go. I have attributed this to my poor programming skills and often I need the ad hoc script to just work and cut down on development time. It is more rare than often that I have to reuse my code in a way that copy and paste doesn't solve the issue at hand fast. (vs remembering where did I deposit that method and how to access the method)

Hmmm perhaps I am not alone in this after all!

Thursday, 29 July 2010

Palmapper and Oqtans

Oqtans looks like an interesting tool for analysing RNA-seq data based on Galaxy framework. Unfortunately the server is down. But it did point to another mapping tool which I am curious to try out PALmapper. What piqued my interest is this in the abstract
"align around 7 million reads per hour on a single AMD CPU core (similar speed as TopHat [3]). Our study for C. elegans furthermore shows that PALMapperPALMapper is considerably more accurate than TopHat (47% and 81%, respectively)." predicts introns with very high sensitivity (72%) and specificity (82%) when using the annotation as ground truth.

Wednesday, 21 July 2010

Google Chrome in CentOS? You will have to wait

Rant warning:
Gah!
Another crippling experience of working with CentOS.

I still can't install chrome despite google having an official linux port due to an outdated package (lsb) on CentOS 5.4

Others are having the same issues.

Wednesday, 14 July 2010

Shiny new tool to index NGS reads G-SQZ

This is a long over due tool for those trying to do non-typical analysis with your reads.
Finally you can index and compress your NGS reads

http://www.ncbi.nlm.nih.gov/pubmed/20605925

Bioinformatics. 2010 Jul 6. [Epub ahead of print]
G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

Tembe W, Lowey J, Suh E.

Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
Abstract

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe (wtembe@tgen.org).

read the discussion thread in seqanswers for more tips and benchmarks

I am not affliated with the author btw