Showing posts with label software. Show all posts

Monday, 10 December 2012

Gabe Rudy "GATK is a Research Tool. Clinics Beware." | Our 2 SNPs…(R)

http://blog.goldenhelix.com/?p=1534

Gabe points out in great detail a bug he found in GATK's variant caller which has be widely regarded as a reliable SNP caller.

I think in general the 'unreliable' nature of next gen seq data has researchers often seeking multiple sources of confirmation for variants before moving to publication.

though I am frankly surprised that GATK turned up an error but as Gabe points out it might be common to find Heisen Bugs in software.

and it's a poignant reminder that DTC genetic testing needs more work to avoid mistakes like these that might be detrimental to personalised medicine

"But my scary homozygous insertion (row 2) shows 153 reference bases and no reads supporting the insertion. Yet it was still called a homozygous variant!

I promptly sent an email off to 23andMe's exome team letting them know about what is clearly a bug in the GATK variant caller. They confirmed it was a bug that went away after updating to a newer release. I talked to 23andMe's bioinformatician behind the report face-to-face a bit at this year's ASHG conference, and it sounds like it was most likely a bug in the tool's multi-sample variant calling mode as this phantom insertion was a real insertion in one of the other samples.

Since there were 8,242 other InDels that match this pattern, I am most likely not looking at random noise but real "leaked" variants from other members of the 23andMe Exome Pilot Program. (Edit: After some analysis with a fixed version of GATK, Eoghan from 23andMe found that these genotypes where not leaked from other samples but completely synthetic.)"

Thursday, 27 September 2012

using gzipped geno files for snptest

seems like using gzipped input files doesn't affect the analysis timings of snptest

will do this with binary geno (bgen) files and update

$ grep 'User' 6template-snptest.sh.log.gz.1 6template-snptest.sh.log

6template-snptest.sh.log.gz.1: User time (seconds): 669.87

6template-snptest.sh.log: User time (seconds): 662.10

Tuesday, 4 September 2012

R one-liners : do you have any to contribute?

NOT sure if installing packages to use one-liners count but heck .. anything to improve productivity eh?

excerpted from Jeffrey's blog http://jeffreybreen.wordpress.com/2011/07/21/one-liners-twitter/

One-liners which make me love R: twitteR’s searchTwitter() #rstats

July 21, 2011 — Jeffrey Breen

R reminds me a lot of English. It’s easy to get started, but very difficult to master. So for all those times I’ve spent… well, forever… trying to figure out the “R way” of doing something, I’m glad to share these quick wins.

My recent R tutorial on mining Twitter for consumer sentiment wouldn’t have been possible without Jeff Gentry’s amazing twitteR package (available on CRAN). It does so much of the behind-the-scenes heavy lifting to access Twitter’s REST APIs, that one line of code is all you need to perform a search and retrieve the (even paginated) results:

library(twitteR)

tweets = searchTwitter("#rstats", n=1500)

You can search for anything, of course, “#rstats” is just an example. (And if you’re really into that hashtag, the twitteR package even provides an Rtweets() function which hardcodes that search string for you.) The n=1500 specifies the maximum number of tweets supported by the Search API, though you may retrieve fewer as Twitter’s search indices contain only a couple of days’ tweets.

PLoS ONE: Adaptive Ridge Regression for Rare Variant Detection

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0044173

Abstract Top

It is widely believed that both common and rare variants contribute to the risks of common diseases or complex traits and the cumulative effects of multiple rare variants can explain a significant proportion of trait variances. Advances in high-throughput DNA sequencing technologies allow us to genotype rare causal variants and investigate the effects of such rare variants on complex traits. We developed an adaptive ridge regression method to analyze the collective effects of multiple variants in the same gene or the same functional unit. Our model focuses on continuous trait and incorporates covariate factors to remove potential confounding effects. The proposed method estimates and tests multiple rare variants collectively but does not depend on the assumption of same direction of each rare variant effect. Compared with the Bayesian hierarchical generalized linear model approach, the state-of-the-art method of rare variant detection, the proposed new method is easy to implement, yet it has higher statistical power. Application of the new method is demonstrated using the well-known data from the Dallas Heart Study.

Citation: Zhan H, Xu S (2012) Adaptive Ridge Regression for Rare Variant Detection. PLoS ONE 7(8): e44173. doi:10.1371/journal.pone.0044173

Editor: Momiao Xiong, University of Texas School of Public Health, United States of America

Received: May 28, 2012; Accepted: July 30, 2012; Published: August 28, 2012

We believe that the high power of the BhGLM is due to (1) appropriate combination of the individual rare variants (the new score) and (2) assignment of the hierarchical priors. After a thorough evaluation of these new methods, we found that there is still some room for improvement. The new score of the BhGLM method is a first moment parameter (shared effect). An alternative score may be a second moment parameter (shared variance). The Yi and Zhi's method requires prior distributions, and thus different priors may produce different results. The hyper-parameters involved in the prior distributions may also affect the results. In this study, we proposed to use a shared variance among rare variants as the new score. The method is originally called ridge regression [27]. It is further modified to discriminate against rare variants with small effects. This modified ridge regression is called the adaptive ridge regression [28]. The adaptive ridge regression (ARR) is performed under the maximum likelihood framework, and thus prior distributions of parameters are not required, equivalent to independent uniform priors for all parameters.

....

The method is called sequence kernel association test (SKAT). After reading this paper, we agreed that our approach is similar to SKAT. However, SKAT only gives the score test and no parameter estimation is provided. This explains why SKAT is fast computationally. There are three major advantages of the adaptive ridge regression. First, a high score test does not mean the effects are large. It may be caused by small effects but large sample size. The score test cannot tell the difference. Our method not only provides a test but also an estimate of the group variance. We can provide a total proportion of the phenotypic variance contributed by the rare variants. Secondly, we introduced an adaptive step to the original ridge regression. This step plays the role of "weighting" of the SKAT method but it can "homogenize" the effect of each rare variant within a group. The ridge regression performs better under the "homogenized" rare variant effect assumption. Thirdly, our method works for both rare and common variants. However, the SKAT method was particularly designed for rare variants because the "weights" for the common variants will be almost zero (excluded from the model), according to the authors of that paper. There is a possibility to use the score test under our adaptive ridge regression framework. The estimation procedure will remain the same, but we may simply replace the likelihood ratio test by the score test. The "weights" obtained from the adaptive ridge regression will be used in the score test. This needs to be further investigated.

The new method is developed for continuous traits under the linear mixed model framework [30]. In many situations, the trait of interest may be a binary trait. The generalized linear mixed model (GLMM), which is an extension of the linear mixed model, can be used to analyze the association of multiple rare variants and a binary trait. This extension is very straight forward because the methodology of GLMM has been well established. The simple extension includes the adaptive steps.
Finally, we performed all the analyses using an R program. The R package is called Adaptive Ridge Regression (ARR) which can be downloaded from the authors' personal website: www.statgen.ucr.edu.

Monday, 3 September 2012

PLINK gPLINK Haploview WGAS software tutorial S. Purcell

PLINK gPLINK Haploview Whole genome association software tutorial

Tuesday, 31 July 2012

Picard release 1.74

Picard release 1.74
http://picard.sourceforge.net/
30 July 2012
- Added a new "ProgressLogger" class that facilitates more useful and
standard progress logging for any program that iterates through a stream
of SAMRecords. Adapted most command line programs to use it.
- Add support for targetedPcrMetrics and collected common HsMetrics and
TargetedPcrMetrics behavior into TargetMetricsCollector
- New program CollectTargetedPcrMetrics
- MultiHitAlignedReadIterator.java: Handle case where an alignment
record has no cigar elements that consume both the read and the
reference (e.g. the read is all soft-clipped)

Monday, 30 July 2012

bioawk- AWK for gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.

Alerted to this on biostars.org

http://www.biostars.org/post/show/47751/bioawk-fasta-fastq-sam-bed-gff-aware-awk-programming/

https://github.com/ialbert/bioawk/blob/master/README.bio.rst

About bioawk

Bioawk is an extension to Brian Kernighan's awk that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.

Bioawk adds a new -c fmt option that specifies the input format. The behavior of bioawk will vary depending on the value of fmt.

For the formats that awk recognizes specially named variables will be created. For example for the supported sequence formats the$name, $seq and, if applicable $qual variable names may be used to access the name, sequence and quality string of the sequence record in each iteration. Here is an example of iterating over a fastq file to print the sequences:

    awk -c fastq '{ print $seq }' test.fq

For known interval formats the columns can be accessed via the variables called $start, $end, $chrom (etc). For example to print the feature lenght of a file in BED format one could write:

    awk -c bed '{ print $end - $start }' test.bed

One important change (and innovation) over the original awk is that bioawk will treat sequences that may span multiple lines as a single record. The parsing, implemented in C, may be several orders of magnitude faster than similar code programmed in interpreted languages: Perl, Python, Ruby.

When the format mode is header or hdr, bioawk parses named columns. It automatically adds variables whose names are taken from the first line and values from the column index. Special characters are converted to a underscore.

Bioawk also adds a few built-in functions including, as of now, and(), or(), xor(), and others (see comprehensive list below).

Detailed help is maintained in the bioawk manual page, to access it type:

    man ./awk.1

Usage Examples

Extract unmapped reads without header:

    awk -c sam 'and($flag,4)' aln.sam.gz

Extract mapped reads with header:
```
    awk -c sam -H '!and($flag,4)'
```

Wednesday, 25 July 2012

howto SKAT R library

The R package is simple to install though you will need a recent version of R (compile as local user if you have no admin rights)

download from

http://cran.r-project.org/web/packages/SKAT/

~/Downloads$ R CMD INSTALL SKAT_0.76.tgz

* installing to library '/Library/Frameworks/R.framework/Versions/2.14/Resources/library'

* installing *binary* package 'SKAT' ...

* DONE (SKAT)

Saturday, 7 July 2012

Installing breakdancer.sourceforge.net

http://breakdancer.sourceforge.net/moreperl.html

Dependencies for breakdancer-1.1_2011_02_21

sudo perl -MCPAN -e 'install Statistics::Descriptive'
sudo perl -MCPAN -e 'install Math::CDF'

sudo perl -MCPAN -e 'install GD::Graph::histogram'

#if you haven't done it already, install samtools and add the binary to your path bam2cfg.pl needs to call it.

Saturday, 19 May 2012

[Denovoassembler-users] Ray v2.0.0-rc7 is available online !

---------- Forwarded message ----------
From: Sébastien Boisvert
Date: Thu, May 17, 2012 at 11:02 PM
Subject: [Denovoassembler-users] Ray v2.0.0-rc7 is available online !

Hello !

I am proud to announce the immediate availability of the Ray assembler
version 2.0.0 release candidate 7, code name "Dark Astrocyte of Knowledge".

This version ships with RayPlatform v1.0.2, code name "Timely Gate of
Yields".

Link for download: http://denovoassembler.sourceforge.net/

Changes in Ray

* The CMakeList file was updated.
* GC content for contigs are dumped in XML files.
* New option -one-color-per-file for graph coloring.
* Optimized file system input/output operations.
* Network testing is more verbose.
* Fixed an integer overflow bug in the scaffolder.
* New guide in Documentation/ for software message routing.
* Fixed an integer overflow bug in the profiler.
* Fixed a synchronization bug in the coloring algorithm.
* Increased the sensitivity of the biological profiling algorithms.
* Disabled the plugin for neighbourhoods.
* New plugin to compute gene ontology profiles.
* Added various missing code headers.
* Simplified the plugin creation process.
* Fixed some divisions per 0.
* Fixed a synchronization bug for gene ontology.
* Added simple profile files for sequence abundance, taxonomy profiles
and gene ontology profiles.
* A bug that caused k-mers with >= 65536 coverage to have less coverage
was fixed.
---> This was a long-standing bug that caused some issues.
* Added some datatypes.

Changes in RayPlatform

* Command line arguments can be obtained.
* Simplified the plugin creation process.
* Fixed two divisions per 0.
* Added some datatypes.

seb

_______________________________________________
Denovoassembler-users mailing list

https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Friday, 18 November 2011

BED file format - FAQ

UCSC very good description of the BED format
http://genome.ucsc.edu/FAQ/FAQformat.html#format1

Bedtools attempts to auto-detect the file formats used in each command. Thus, as long as your file conform to the format definitions for each file type, you should be okay. For example:

- BAM is zero-based, half-open. SAM is 1-based, closed.
- BED is zero-based, half-open.

zero length intervals (start == end), which in BED format, are interpreted as insertions in the reference.

I can't confirm this but from the top of my head, I recall that

start -1, end in BED format refers to SNPs

(source mostly from BEDTools mailling list)

VCFtools BEDtools compare, intersect, merge

A Good computational biologist is only as good as the tools he uses (or maybe how good he is at google) rofl kidding ...
Life is always easier when you find the correct tool.
I also adopt the path of least resistance when trying to solve problems that are more common than I imagine.
There's always the good old linux tools for comparing SNPs called from different programs / options

grep | sed | awk | cut | diff | comm
see http://ged.msu.edu/angus/tutorials-2011/snp_tutorial.html

and if you are working with NGS data, you most probably already have samtools installed on your system and you might have used bcftools
Did you also know that there's also a (unrelated) set of tools called vcftools?
http://vcftools.sourceforge.net/

The VCFtools package is broadly split into two sections:

The vcftools binary program, generally used to analyse VCF files.
The Vcf.pm perl module, which is a general Perl API containing a core of the utilities vcf-convert, vcf-merge, vcf-compare, vcf-isec, and others.
Documentation
Examples of usage by topic

Then there's also the highly used BEDTools http://code.google.com/p/bedtools/

which I highly recommend to keep as part of your tools collection. Check out the link below

Do watch out for this 'oversight' in vcftools as pointed out in seqanswers.
Overlap number discrepancy between VCFTools and BEDTools

Usage	Examples of common usage. Featured

Thursday, 18 August 2011

Torrent Suite Software Mobile Browser

Neat!!

Working with Reference Genomes in Torrent Suite Software 1.4

Saturday, 30 April 2011

Evaluation of next-generation sequencing software in mapping and assembly.

Evaluation of next-generation sequencing software in mapping and assembly.

J Hum Genet. 2011 Apr 28;

Authors: Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ

Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields, and further provided advices on selecting suitable tools for specific biological applications.Journal of Human Genetics advance online publication, 28 April 2011; doi:10.1038/jhg.2011.43.

PMID: 21525877 [PubMed - as supplied by publisher]

More...

Thursday, 14 April 2011

Want to use samtools for polyploid organisms?

Well the short answer is you can't.
Here's what Heng Li has to say
Samtools is designed for diploid genomes. I would recommend to treat a haploid genome as diploid and filtering heterozygotes afterwards. For higher ploidy (>2), samtools does not work well. For pooled resequencing, specialized SNP callers (such as syzygy and a few others) are better.

Other recommendations are

Freebayes (http://bioinformatics.bc.edu/marthlab/FreeBayes).
CRISP http://polymorphism.scripps.edu/~vbansal/software/CRISP/

Sunday, 13 March 2011

RNA-seq of a 12GB dataset in less than 7 hours

Folks at CLCbio have updated the CLC Genomics Machine benchmarks with newer datasets and the new hardware configuration. There are two RNA-Seq data sets and a full genome mapping data set, and more will come. All the benchmarks are now using the CLC Genomics Server rather than the Assembly Cell.

Saturday, 12 March 2011

RStudio IDE for R ... looks promising!

RStudio, released yesterday, is a new open-source IDE for R. It’s getting a lot of attention at R-bloggers and it’s easy to see why: this is open-source software development done right.
from

What You’re Doing Is Rather Desperate

Notes from the life of a bioinformatics researcher

Quality control and preprocessing of metagenomic datasets

from Bioinformatics - current issue by Schmieder, R., Edwards, R.

Summary: Here, we present PRINSEQ for easy and rapid quality control and data preprocessing of genomic and metagenomic datasets. Summary statistics of FASTA (and QUAL) or FASTQ files are generated in tabular and graphical form and sequences can be filtered, reformatted and trimmed by a variety of options to improve downstream analysis.
Availability and Implementation: This open-source application was implemented in Perl and can be used as a stand alone version or accessed online through a user-friendly web interface. The source code, user help and additional information are available at http://prinseq.sourceforge.net/

RNA-Seq Analysis Tools from the Broad Institute from RNA-Seq Blog

RNA-Seq Analysis Tools from the Broad Institute

from RNA-Seq Blog by admin

GenePattern offers a suite of tools to support a wide variety of RNA-seq analyses, including short-read mapping, identification of splice junctions, transcript and isoform detection, quantitation, and differential expression. The modules have been adapted from widely-used tools. GenePattern also provides pipelines that allow you to perform a number of multi-step RNA-seq analyses automatically.

Alignment: Bowtie.aligner
Differential Expression: Cufflinks.cuffdiff
Genome Annotation: Cufflinks, Cufflinks.cuffcompare, Scripture
Isoform Detection: Tophat
RNA Quantitation: Cufflinks, Scripture
Utilities: Bowtie.indexer, BamToSam, ExprToGct, SamToBam, SortSam
Visualizers: IGV

RNA-Seq Analysis Tools from the Broad Institute is a post from: RNA-Seq Blog More information about RNA-Seq can be here.