Showing posts with label software. Show all posts
Showing posts with label software. Show all posts

Monday, 10 December 2012

Gabe Rudy "GATK is a Research Tool. Clinics Beware." | Our 2 SNPs…(R)

http://blog.goldenhelix.com/?p=1534

Gabe points out in great detail a bug he found in GATK's variant caller which has be widely regarded as a reliable SNP caller. 

I think in general the 'unreliable' nature of next gen seq data has researchers often seeking multiple sources of confirmation for variants before moving to publication. 

though I am frankly surprised that GATK turned up an error but as Gabe points out it might be common to find Heisen Bugs in software

and it's a poignant reminder that DTC genetic testing needs more work to avoid mistakes like these that might be detrimental to personalised medicine 


"But my scary homozygous insertion (row 2) shows 153 reference bases and no reads supporting the insertion. Yet it was still called a homozygous variant!
I promptly sent an email off to 23andMe's exome team letting them know about what is clearly a bug in the GATK variant caller. They confirmed it was a bug that went away after updating to a newer release. I talked to 23andMe's bioinformatician behind the report face-to-face a bit at this year's ASHG conference, and it sounds like it was most likely a bug in the tool's multi-sample variant calling mode as this phantom insertion was a real insertion in one of the other samples.
Since there were 8,242 other InDels that match this pattern, I am most likely not looking at random noise but real "leaked" variants from other members of the 23andMe Exome Pilot Program. (Edit: After some analysis with a fixed version of GATK, Eoghan from 23andMe found that these genotypes where not leaked from other samples but completely synthetic.)" 

Thursday, 27 September 2012

using gzipped geno files for snptest

seems like using gzipped input files doesn't affect the analysis timings of snptest
will do this with binary geno (bgen) files and update

$ grep 'User' 6template-snptest.sh.log.gz.1 6template-snptest.sh.log
6template-snptest.sh.log.gz.1: User time (seconds): 669.87
6template-snptest.sh.log: User time (seconds): 662.10

Tuesday, 4 September 2012

R one-liners : do you have any to contribute?

NOT sure if installing packages to use one-liners count but heck .. anything to improve productivity eh?

excerpted from Jeffrey's blog http://jeffreybreen.wordpress.com/2011/07/21/one-liners-twitter/



One-liners which make me love R: twitteR’s searchTwitter() #rstats


R reminds me a lot of English. It’s easy to get started, but very difficult to master. So for all those times I’ve spent… well, forever… trying to figure out the “R way” of doing something, I’m glad to share these quick wins.
My recent R tutorial on mining Twitter for consumer sentiment wouldn’t have been possible without Jeff Gentry’s amazing twitteR package (available on CRAN). It does so much of the behind-the-scenes heavy lifting to access Twitter’s REST APIs, that one line of code is all you need to perform a search and retrieve the (even paginated) results:
library(twitteR)
tweets = searchTwitter("#rstats", n=1500)
You can search for anything, of course, “#rstats” is just an example. (And if you’re really into that hashtag, the twitteR package even provides an Rtweets() function which hardcodes that search string for you.) The n=1500 specifies the maximum number of tweets supported by the Search API, though you may retrieve fewer as Twitter’s search indices contain only a couple of days’ tweets.


PLoS ONE: Adaptive Ridge Regression for Rare Variant Detection

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0044173

Abstract Top

It is widely believed that both common and rare variants contribute to the risks of common diseases or complex traits and the cumulative effects of multiple rare variants can explain a significant proportion of trait variances. Advances in high-throughput DNA sequencing technologies allow us to genotype rare causal variants and investigate the effects of such rare variants on complex traits. We developed an adaptive ridge regression method to analyze the collective effects of multiple variants in the same gene or the same functional unit. Our model focuses on continuous trait and incorporates covariate factors to remove potential confounding effects. The proposed method estimates and tests multiple rare variants collectively but does not depend on the assumption of same direction of each rare variant effect. Compared with the Bayesian hierarchical generalized linear model approach, the state-of-the-art method of rare variant detection, the proposed new method is easy to implement, yet it has higher statistical power. Application of the new method is demonstrated using the well-known data from the Dallas Heart Study.

Monday, 3 September 2012

PLINK gPLINK Haploview WGAS software tutorial S. Purcell

PLINK gPLINK Haploview Whole genome association software tutorial

Tuesday, 31 July 2012

Picard release 1.74


Picard release 1.74
http://picard.sourceforge.net/
30 July 2012
- Added a new "ProgressLogger" class that facilitates more useful and
standard progress logging for any program that iterates through a stream
of SAMRecords.  Adapted most command line programs to use it.
- Add support for targetedPcrMetrics and collected common HsMetrics and
TargetedPcrMetrics behavior into TargetMetricsCollector
- New program CollectTargetedPcrMetrics
- MultiHitAlignedReadIterator.java: Handle case where an alignment
record has no cigar elements that consume both the read and the
reference (e.g. the read is all soft-clipped)

Monday, 30 July 2012

bioawk- AWK for gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.


Alerted to this on biostars.org

https://github.com/ialbert/bioawk/blob/master/README.bio.rst

About bioawk

Bioawk is an extension to Brian Kernighan's awk that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.
Bioawk adds a new -c fmt option that specifies the input format. The behavior of bioawk will vary depending on the value of fmt.
For the formats that awk recognizes specially named variables will be created. For example for the supported sequence formats the$name$seq and, if applicable $qual variable names may be used to access the name, sequence and quality string of the sequence record in each iteration. Here is an example of iterating over a fastq file to print the sequences:
    awk -c fastq '{ print $seq }' test.fq  
For known interval formats the columns can be accessed via the variables called $start$end$chrom (etc). For example to print the feature lenght of a file in BED format one could write:
    awk -c bed '{ print $end - $start }' test.bed  
One important change (and innovation) over the original awk is that bioawk will treat sequences that may span multiple lines as a single record. The parsing, implemented in C, may be several orders of magnitude faster than similar code programmed in interpreted languages: Perl, Python, Ruby.
When the format mode is header or hdr, bioawk parses named columns. It automatically adds variables whose names are taken from the first line and values from the column index. Special characters are converted to a underscore.
Bioawk also adds a few built-in functions including, as of now, and(), or(), xor(), and others (see comprehensive list below).
Detailed help is maintained in the bioawk manual page, to access it type:
    man ./awk.1  

Usage Examples

  1. Extract unmapped reads without header:
        awk -c sam 'and($flag,4)' aln.sam.gz  
  2. Extract mapped reads with header:
        awk -c sam -H '!and($flag,4)'

Wednesday, 25 July 2012

howto SKAT R library

The R package is simple to install though you will need a recent version of R (compile as local user if you have no admin rights)

download from 

~/Downloads$ R CMD INSTALL SKAT_0.76.tgz 
* installing to library '/Library/Frameworks/R.framework/Versions/2.14/Resources/library'
* installing *binary* package 'SKAT' ...

* DONE (SKAT) 

Saturday, 7 July 2012

Installing breakdancer.sourceforge.net

http://breakdancer.sourceforge.net/moreperl.html

Dependencies for breakdancer-1.1_2011_02_21


sudo perl -MCPAN -e 'install Statistics::Descriptive'
sudo perl -MCPAN -e 'install Math::CDF'  
sudo perl -MCPAN -e 'install GD::Graph::histogram'

#if you haven't done it already, install samtools and add the binary to your path bam2cfg.pl needs to call it.

Saturday, 19 May 2012

[Denovoassembler-users] Ray v2.0.0-rc7 is available online !



---------- Forwarded message ----------
From: Sébastien Boisvert
Date: Thu, May 17, 2012 at 11:02 PM
Subject: [Denovoassembler-users] Ray v2.0.0-rc7 is available online !



Hello !

I am proud to announce the immediate availability of the Ray assembler
version 2.0.0 release candidate 7, code name "Dark Astrocyte of Knowledge".

This version ships with RayPlatform v1.0.2, code name "Timely Gate of
Yields".


Link for download: http://denovoassembler.sourceforge.net/


Changes in Ray

* The CMakeList file was updated.
* GC content for contigs are dumped in XML files.
* New option -one-color-per-file for graph coloring.
* Optimized file system input/output operations.
* Network testing is more verbose.
* Fixed an integer overflow bug in the scaffolder.
* New guide in Documentation/ for software message routing.
* Fixed an integer overflow bug in the profiler.
* Fixed a synchronization bug in the coloring algorithm.
* Increased the sensitivity of the biological profiling algorithms.
* Disabled the plugin for neighbourhoods.
* New plugin to compute gene ontology profiles.
* Added various missing code headers.
* Simplified the plugin creation process.
* Fixed some divisions per 0.
* Fixed a synchronization bug for gene ontology.
* Added simple profile files for sequence abundance, taxonomy profiles
and gene ontology profiles.
* A bug that caused k-mers with >= 65536 coverage to have less coverage
was fixed.
        ---> This was a long-standing bug that caused some issues.
* Added some datatypes.


Changes in RayPlatform

* Command line arguments can be obtained.
* Simplified the plugin creation process.
* Fixed two divisions per 0.
* Added some datatypes.



                        seb


_______________________________________________
Denovoassembler-users mailing list

https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Friday, 18 November 2011

BED file format - FAQ


UCSC very good description of the BED format
http://genome.ucsc.edu/FAQ/FAQformat.html#format1



Bedtools attempts to auto-detect the file formats used in each command.  Thus, as long as your file conform to the format definitions for each file type, you should be okay.  For example:

- BAM is zero-based, half-open.  SAM is 1-based, closed.
- BED is zero-based, half-open.


zero length intervals (start == end), which in BED format, are interpreted as insertions in the reference.

I can't confirm this but from the top of my head, I recall that

start -1, end in BED format refers to SNPs


(source mostly from BEDTools mailling list)

VCFtools BEDtools compare, intersect, merge

A Good computational biologist is only as good as the tools he uses (or maybe how good he is at google) rofl kidding ...
Life is always easier when you find the correct tool.
I also adopt the path of least resistance when trying to solve problems that are more common than I imagine.
There's always the good old linux tools for comparing SNPs called from different programs / options

   grep | sed | awk | cut | diff | comm
   see http://ged.msu.edu/angus/tutorials-2011/snp_tutorial.html


and if you are working with NGS data, you most probably already have samtools installed on your system and you might have used bcftools
Did you also know that there's also a (unrelated) set of tools called vcftools?
http://vcftools.sourceforge.net/


The VCFtools package is broadly split into two sections: 


Then there's  also the highly used BEDTools http://code.google.com/p/bedtools/

which I highly recommend to keep as part of your tools collection. Check out the link below

Do watch out for this 'oversight' in vcftools as pointed out in seqanswers.
Overlap number discrepancy between VCFTools and BEDTools

UsageExamples of common usage.   Featured

Saturday, 30 April 2011

Evaluation of next-generation sequencing software in mapping and assembly.

Evaluation of next-generation sequencing software in mapping and assembly.

J Hum Genet. 2011 Apr 28;

Authors: Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ

Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields, and further provided advices on selecting suitable tools for specific biological applications.Journal of Human Genetics advance online publication, 28 April 2011; doi:10.1038/jhg.2011.43.

PMID: 21525877 [PubMed - as supplied by publisher]



More...

Thursday, 14 April 2011

Want to use samtools for polyploid organisms?

Well the short answer is you can't.
Here's what Heng Li has to say
Samtools is designed for diploid genomes. I would recommend to treat a haploid genome as diploid and filtering heterozygotes afterwards. For higher ploidy (>2), samtools does not work well. For pooled resequencing, specialized SNP callers (such as syzygy and a few others) are better.

Other recommendations are

Sunday, 13 March 2011

RNA-seq of a 12GB dataset in less than 7 hours

Folks at CLCbio have updated the CLC Genomics Machine benchmarks with newer datasets and the new hardware configuration. There are two RNA-Seq data sets and a full genome mapping data set, and more will come. All the benchmarks are now using the CLC Genomics Server rather than the Assembly Cell.

Saturday, 12 March 2011

RStudio IDE for R ... looks promising!

RStudio, released yesterday, is a new open-source IDE for R. It’s getting a lot of attention at R-bloggers and it’s easy to see why: this is open-source software development done right.
from 

What You’re Doing Is Rather Desperate

Notes from the life of a bioinformatics researcher



Quality control and preprocessing of metagenomic datasets


Quality control and preprocessing of metagenomic datasets

Summary: Here, we present PRINSEQ for easy and rapid quality control and data preprocessing of genomic and metagenomic datasets. Summary statistics of FASTA (and QUAL) or FASTQ files are generated in tabular and graphical form and sequences can be filtered, reformatted and trimmed by a variety of options to improve downstream analysis.
Availability and Implementation: This open-source application was implemented in Perl and can be used as a stand alone version or accessed online through a user-friendly web interface. The source code, user help and additional information are available at http://prinseq.sourceforge.net/

RNA-Seq Analysis Tools from the Broad Institute from RNA-Seq Blog


RNA-Seq Analysis Tools from the Broad Institute


GenePattern offers a suite of tools to support a wide variety of RNA-seq analyses, including short-read mapping, identification of splice junctions, transcript and isoform detection, quantitation, and differential expression. The modules have been adapted from widely-used tools. GenePattern also provides pipelines that allow you to perform a number of multi-step RNA-seq analyses automatically.
RNA-Seq Analysis Tools from the Broad Institute is a post from: RNA-Seq Blog More information about RNA-Seq can be here.

Datanami, Woe be me