Friday, 31 August 2012

[tech] Attachments.me now lets you email a 100MB file without ever leaving Gmail with new cloud integration

This should be nifty!

anybody up to do a genomics analysis from a dropbox kept exome sent to the webserver via email?

Attachments.me now lets you email a 100MB file without ever leaving Gmail with new cloud integration
http://thenextweb.com/apps/2012/07/31/attachments-me-now-lets-you-email-a-100mb-file-without-ever-leaving-gmail-with-new-cloud-integration/

Attachments.me is one of my favorite services, as it really does a great job of wrangling all of those attachments in my various Gmail accounts.

Today, the service announced Dropbox, Box and Google Drive integration, which allows you to email someone a huge file without having to leave Gmail.

[pub] An Improved Protocol for Sequencing of Repetitive Genomic Regions and Structural Variations Using Mutagenesis and Next Generation Sequencing

An Improved Protocol for Sequencing of Repetitive Genomic Regions and Structural Variations Using Mutagenesis and Next Generation Sequencing
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0043359

The rise of Next Generation Sequencing (NGS) technologies has transformed de novo genome sequencing into an accessible research tool, but obtaining high quality eukaryotic genome assemblies remains a challenge, mostly due to the abundance of repetitive elements. These also make it difficult to study nucleotide polymorphism in repetitive regions, including certain types of structural variations. One solution proposed for resolving such regions is Sequence Assembly aided by Mutagenesis (SAM), which relies on the fact that introducing enough random mutations breaks the repetitive structure, making assembly possible. Sequencing many different mutated copies permits the sequence of the repetitive region to be inferred by consensus methods. However, this approach relies on molecular cloning in order to isolate and amplify individual mutant copies, making it hard to scale-up the approach for use in conjunction with high-throughput sequencing technologies. To address this problem, we propose NG-SAM, a modified version of the SAM protocol that relies on PCR and dilution steps only, coupled to a NGS workflow. NG-SAM therefore has the potential to be scaled-up, e.g. using emerging microfluidics technologies. We built a realistic simulation pipeline to study the feasibility of NG-SAM, and our results suggest that under appropriate experimental conditions the approach might be successfully put into practice. Moreover, our simulations suggest that NG-SAM is capable of reconstructing robustly a wide range of potential target sequences of varying lengths and repetitive structures.

Thursday, 30 August 2012

Scrollable Genome Browsing for Ensembl on beta.ensembl.org

Scrollable Genome Browsing

http://www.ensembl.info/blog/2012/08/30/scrollable-genome-browsing/

Scrollable Genome Browsing

Posted on August 30, 2012 by Simon Brent

I am pleased to announce the release of a scrollable region view for Ensembl on beta.ensembl.org, available on species location pages (for example, here on Human chromosome 13) by clicking on the Scrollable region link in the navigation menu.

This view is powered by Genoverse, an HTML5 genome browser co-developed by the Ensembl and DECIPHER projects. As such it is only supported by modern browsers – recent versions of Chrome, Firefox and Safari, and IE9.

Wednesday, 29 August 2012

region based calling in SAMtools Mpileup

Ok my new favourite feature about samtools mpileup is that you can make use of the bam indexing to specify regions you wish to do mpileup on (now I can easily max out that 250 core system in shared HPC resources! )

Calling SNPs/INDELs in small regions (see http://samtools.sourceforge.net/mpileup.shtml )

    vcfutils.pl splitchr -l 500000 | xargs -i \  
  echo samtools mpileup -C50 -m3 -F0.0002 -DSuf ref.fa -r {} -b bam.list \| bcftools \  
  view -bcvg - \> part-{}.bcf 

http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html

Random position retrieval that works. One of the most powerful features of mpileup is that you can specify a region with -r chrom:start-stop and it will report pileup output for the specified position(s). The old pileup command had this option, but took a long time because it looked at all positions and just reported the ones within your desired region. Instead, mpileup leverages BAM file indexing to retrieve data quite rapidly: In my experience, it takes about 1 second to retrieve the pileup for several samples at any given position in the human genome. Multi-sample, rapid random access has lots of uses for bio-informaticians; for example, I can retrieve all bases observed in all samples at a variant of interest to look at the evidence in each sample.

Picard release 1.76

---------- Forwarded message ----------
From: Alec Wysoker

Picard release 1.76
28 August 2012

- IntervalListTools.java: Added scatter (to support scatter-gather
parallelism) to IntervalList.

- ProgressLogger.java: Synchronized record() method to make class
thread-safe.

- Significant code refactoring in IlluminaBasecallsToSam; overhaulded
threading model to improve CPU utilization; added test for Tile comparator.

- Set SortingCollection.destructiveIteration true by default (as was the
original intent), in order to enable earlier GC of no-longer-needed
SAMRecords.

-Alec

------------------------------------------------------------------------------
Samtools-help mailing list

https://lists.sourceforge.net/lists/listinfo/samtools-help

With Rise of Gene Sequencing, Ethical Puzzles - NYTimes.com

http://www.nytimes.com/2012/08/26/health/research/with-rise-of-gene-sequencing-ethical-puzzles.html?_r=1

In laboratories around the world, genetic researchers using tools that are ever more sophisticated to peer into the DNA of cells are increasingly finding things they were not looking for, including information that could make a big difference to an anonymous donor.

The question of how, when and whether to return genetic results to study subjects or their families "is one of the thorniest current challenges in clinical research," said Dr. Francis Collins, the director of the National Institutes of Health. "We are living in an awkward interval where our ability to capture the information often exceeds our ability to know what to do with it."

The federal government is hurrying to develop policy options. It has made the issue a priority, holding meetings and workshops and spending millions of dollars on research on how to deal with questions unique to this new genomics era.

The quandaries arise from the conditions that medical research studies typically set out. Volunteers usually sign forms saying that they agree only to provide tissue samples, and that they will not be contacted. Only now have some studies started asking the participants whether they want to be contacted, but that leads to more questions: What sort of information should they get? What if the person dies before the study is completed?

The complications are procedural as well as ethical. Often, the research labs that make the surprise discoveries are not certified to provide clinical information to patients. The consent forms the patients signed were approved by ethics boards, which would have to approve any changes to the agreements — if the patients could even be found.

..... One of the first cases came a decade ago, just as the new age of genetics was beginning. A young woman with a strong family history of breast and ovarian cancer enrolled in a study trying to find cancer genes that, when mutated, greatly increase the risk of breast cancer. But the woman, terrified by her family history, also intended to have her breasts removed prophylactically.

Her consent form said she would not be contacted by the researchers. Consent forms are typically written this way because the purpose of such studies is not to provide medical care but to gain new insights. The researchers are not the patients' doctors.

But in this case, the researchers happened to know about the woman's plan, and they also knew that their study indicated that she did not have her family's breast cancer gene. They were horrified.

"We couldn't sit back and let this woman have her healthy breasts cut off," said Barbara B. Biesecker, the director of the genetic counseling program at the National Human Genome Research Institute, part of the National Institutes of Health.

Monday, 27 August 2012

Qualimap: evaluating next generation sequencing alignment data.

http://www.ncbi.nlm.nih.gov/pubmed/22914218

Bioinformatics. 2012 Aug 22. [Epub ahead of print]

Qualimap: evaluating next generation sequencing alignment data.

García-Alcalde F, Okonechnikov K, Carbonell J, Ruiz LM, Götz S, Tarazona S, Meyer TF, Conesa A.

Source

Bioinformatics and Genomics Department, Centro de Investigación Príncipe Felipe, Valencia, Spain.

Abstract

MOTIVATION:

The sequence alignment/map (SAM) and the binary alignment/map (BAM) formats have become the standard method of representation of nucleotide sequence alignments for next-generation sequencing data. SAM/BAM files usually contain information from tens to hundreds of millions of reads. Often, the sequencing technology, protocol, and/or the selected mapping algorithm introduce some unwanted biases in these data. The systematic detection of such biases is a non-trivial task that is crucial to to drive appropriate downstream analyses.

RESULTS:

We have developed Qualimap, a Java application that supports user-friendly quality control of mapping data, by considering sequence features and their genomic properties. Qualimap takes sequence alignment data and provides graphical and statistical analyses for the evaluation of data. Such quality-control data are vital for highlighting problems in the sequencing and/or mapping processes, which must be addressed prior to further analyses.

AVAILABILITY:

Qualimap is freely available from http://www.qualimap.org

http://qualimap.bioinfo.cipf.es/samples/ERR089819_result/qualimapReport.html

Coverage across reference

Sunday, 26 August 2012

Now in Galaxy Tool Shed: NCBI BLAST+ http:bit.ly/gxyshed #usegalaxy

Galaxy Project (@galaxyproject)
8/26/12 5:02 AM
Now in Galaxy Tool Shed: NCBI BLAST+ http:bit.ly/gxyshed #usegalaxy

Sent from my iPad

Saturday, 25 August 2012

Multiple regression methods show great potential for rare variant association tests.

PLoS One. 2012;7(8):e41694. Epub 2012 Aug 8.

Multiple regression methods show great potential for rare variant association tests.

Xu C, Ladouceur M, Dastani Z, Richards JB, Ciampi A, Greenwood CM.

Source

Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada.

Abstract

The investigation of associations between rare genetic variants and diseases or phenotypes has two goals. Firstly, the identification of which genes or genomic regions are associated, and secondly, discrimination of associated variants from background noise within each region. Over the last few years, many new methods have been developed which associate genomic regions with phenotypes. However, classical methods for high-dimensional data have received little attention. Here we investigate whether several classical statistical methods for high-dimensional data: ridge regression (RR), principal components regression (PCR), partial least squares regression (PLS), a sparse version of PLS (SPLS), and the LASSO are able to detect associations with rare genetic variants. These approaches have been extensively used in statistics to identify the true associations in data sets containing many predictor variables. Using genetic variants identified in three genes that were Sanger sequenced in 1998 individuals, we simulated continuous phenotypes under several different models, and we show that these feature selection and feature extraction methods can substantially outperform several popular methods for rare variant analysis. Furthermore, these approaches can identify which variants are contributing most to the model fit, and therefore both goals of rare variant analysis can be achieved simultaneously with the use of regression regularization methods. These methods are briefly illustrated with an analysis of adiponectin levels and variants in the ADIPOQ gene.

PMID:: 22916111; [PubMed - in process]

Free full text

Schizophrenia, rare and common disease variants Dr. Norio Ozaki | Nagoya University Global COE Program Integrated Functional Molecular Medicine for Neuronal and Neoplastic Disorders

Genome-wide studies of copy number variation (CNV) have given rise to a new understanding of schizophrenia etiology, bringing rare variants to the forefront: rare variant-common disease (RDCV) model. Earlier, we conducted low resolution CNV screening using affimetrix 5.0 array in order to catalog CNVs that may increase the schizophrenia susceptibility in the Japanese population. In our current study, we are using high resolution comparative genomic hybridization array for the CNV detection. Besides the known large CNVs that are previously reported to be associated with schizophrenia we found hundreds of small to medium size novel, exon disrupting sequence variations in more than 10 % of patients with schizophrenia. These findings point to the number of genomic variants that may be relevant to the pathoetiology of schizophrenia were below the detection threshold of last generation CNV typing technologies.

In conclusion, our results showed that in case of schizophrenia, the 'rare high risk variant' vs the 'common variant with low effect' hypotheses should not be viewed as exclusive hypotheses, but more as a continuum. That is why direct resequencing of candidate genes, as well as CNV on the one side, and GWAS on the other side, could be viewed as complementary approaches to dissect the genetic susceptibilities

http://w3serv.nagoya-u.ac.jp/coemed/en/meetings/meetings/international/dr-norio-ozaki/

Thursday, 23 August 2012

two 'versions' of GATK now? & GATK forum woes

quite confused by the problems I have to logging in to the GATK forums with either my google account or twitter account using chrome/firefox on a mac ...

I keep getting an prompt to ask me why I wish to join the GATK forum (well if u let me join, I think my first post/intention will be to rant about how the new website looks pretty but doesn't work, see below )

This is a beta version of the new GATK forum. Please be patient as we roll in new content and functionality in the next few days.OpenID Connect

Tell us why you want to join!

registered for a proper account with my email ... shall see if that helps.

http://www.broadinstitute.org/gatk/download

Download GATK 2.0 (beta)

GATK 2.0 includes all of the original GATK 1.x tools as well as many newer and more advanced tools for error modeling, data compression, and variant calling. The version of Queue provided below is built for GATK 2.0.

Please be aware that the GATK 2.0 beta tool chain may be unstable, slow, not scalable, poorly documented, or not interact seamlessly among each other or with other tools in the suite, so could require more effort from users. With these caveats, these tools provide radically improved calling sensitivity, specificity, and performance, so are worth the exposure as beta software.

Download GATK 2.0 Download Queue

Download GATK-lite

GATK-lite is a subset of the full GATK 2.0 release that is free-to-use for all entities, including commercial ones. It includes all of the capabilities (if not the exact tools) from GATK 1.6 but none of the exclusive 2.0 tools. The version of Queue provided below is built for GATK-lite.

For the tech-savvy, GATK-lite is the binary distribution corresponding to the public GATK source released in the Github repository. Everything in GATK-lite is licensed under the MIT license.

Download GATK-lite » Download Queue » Github »

Wednesday, 22 August 2012

[tool] Efficient Mixed-Model Association eXpedited (EMMAX) to Simutaneously Account for Relatedness and Stratification in Genome-Wide Association Studies

Efficient Mixed-Model Association eXpedited (EMMAX) to Simutaneously Account for Relatedness and Stratification in Genome-Wide Association Studies
http://gettinggeneticsdone.blogspot.sg/2010/06/efficient-mixed-model-association.html

The original EMMA algorithm, however, is computationally infeasible for datasets with thousands of individuals because the variance components parameters are estimated for each marker, which can take about 10 minutes per marker on the authors' large GWAS dataset, which would take over 6 years to complete on a single processor. A new implementation of the algorithm called EMMAX (Efficient Mixed-Model Association eXpedited) makes the simplifying assumption that because the effect of any given SNP on the trait is typically small, then the variance parameters only need to be estimated once for the entire dataset, rather than once for each marker.

In the paper the authors take the Northern Finland Birth Cohort and estimate genomic control inflation factors (gamma) for uncorrected test statistics, test statistics adjusted for the top 100 principle components using Eigenstrat, and corrected for structure using the EMMAX algorithm and found that the inflation factors were closest to 1 for the EMMAX-corrected tests. Further, whereas genomic control simply adjusts all test statistics downward without changing the rank of the test statistics, the EMMAX method does result in changes of the ranks of test statistics for each SNP.

A beta version of EMMAX is available online, with a complete version to be released soon. Conveniently, the software is able to take a PLINK transposed ped file and covariate files as input (tped and tfam documentation here).

Nature Genetics Technical Report - Variance component model to account for sample structure in genome-wide association studies

:/ unfortunately the website still says a complete version is pending

http://genetics.cs.ucla.edu/emmax/install.html

The method is also avail in EPACTS.

http://genome.sph.umich.edu/wiki/EPACTS#Single_Variant_EMMAX_Association_Analysis

but I think i ran into trouble getting it to work, shall explore soon ..

How To Make the Mac OS X Finder Suck Less - How-To Geek

http://www.howtogeek.com/howto/33414/how-to-make-the-mac-os-x-finder-suck-less/

this irked me for the longest time I actually never bothered to change it till now cos I used cmdline

find . -iname *.list

in the end to save the trouble ..

Search the Current Folder

As far back as Tiger, when you searched for something with the search bar in a Finder window, it would default to searching your entire computer. With Snow Leopard, Apple finally added the option to change the default to searching in the folder you're currently in.

To change it, go to Finder -> Preferences and then to the "Advanced" tab. Then, on the drop-down menu for "When performing a search:" select "Search the Current Folder":

[pub] Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health.

Found this gem from

Empirical evidence is that 80-90% of the claims made by epidemiologists are false ...

Item 1 of 1 (Display the citation in PubMed)

1.	J Clin Epidemiol. 2006 Sep;59(9):964-9. Epub 2006 Jul 11. Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. Austin PC, Mamdani MM, Juurlink DN, Hux JE. Source Institute for Clinical Evaluative Sciences, G1 06, 2075 Bayview Avenue, Toronto, Ontario, M4N 3M5 Canada. peter.austin@ices.on.ca Comment in Multiple attacks from multiple perspectives. [J Clin Epidemiol. 2006] Multiple attacks from multiple perspectives.Redelmeier DA. J Clin Epidemiol. 2006 Sep; 59(9):871-2. Epub 2006 Jun 19. Abstract OBJECTIVES: To illustrate how multiple hypotheses testing can produce associations with no clinical plausibility. STUDY DESIGN AND SETTING: We conducted a study of all 10,674,945 residents of Ontario aged between 18 and 100 years in 2000. Residents were randomly assigned to equally sized derivation and validation cohorts and classified according to their astrological sign. Using the derivation cohort, we searched through 223 of the most common diagnoses for hospitalization until we identified two for which subjects born under one astrological sign had a significantly higher probability of hospitalization compared to subjects born under the remaining signs combined (P<0.05). RESULTS: We tested these 24 associations in the independent validation cohort. Residents born under Leo had a higher probability of gastrointestinal hemorrhage (P=0.0447), while Sagittarians had a higher probability of humerus fracture (P=0.0123) compared to all other signs combined. After adjusting the significance level to account for multiple comparisons, none of the identified associations remained significant in either the derivation or validation cohort. CONCLUSIONS: Our analyses illustrate how the testing of multiple, non-prespecified hypotheses increases the likelihood of detecting implausible associations. Our findings have important implications for the analysis and interpretation of clinical studies.
	PMID: 16895820 [PubMed - indexed for MEDLINE]

Empirical evidence is that 80-90% of the claims made by epidemiologists are false ...

Definitely an interesting read!

http://www.niss.org/sites/default/files/Young_Safety_June_2008.pdf

The basic thesis is quite simple. Epidemiologists have as their statistical

analysis/scientific method paradigm not to correct for any multiple testing. Also, as

part of their scientific paradigm they ask multiple, often hundreds to thousands, of

questions of the same data set. Their position is that it is better to miss nothing real

than to control the number of false claims they make. The Statisticians paradigm is

to control the probability of making a false claim. We have a clash of paradigms.

Empirical evidence is that 80-90% of the claims made by epidemiologists are false;

these claims do not replicate when retested under rigorous conditions.

This is a contest for every analyst who has struggled to explain the value of data to his or her boss: Harvard Business Review - 1 week Viz contest

Your Analysis/Visualization featured in the Harvard Business Review

We just launched the Harvard Business Review Visualization Prospect with a very short deadline, so we wanted to get the word out to make sure everyone interested has a chance to compete. The contest ends in 1 WEEK (deadline: 8/27/2012 4:00 AM UTC )

The Harvard Business Review is asking you to turn your data-vision on the archival history of the HBR. The goal of this prospect to to generate analysis and visualizations from the metadata and abstracts of every article they have published over the last 90 years. Winning entries will be featured (with credit to the Kaggler) in the Vision Statement feature of the upcoming 90th anniversary issue.

What makes a great entry? Check out the past 'Vision Statement' features scattered throughout the contest page, and available for download. The HBR wants you to find the story behind the data. Don't just build a latent topic model... show how the important topics have trended over the last 90 years. Once you quantify the impact of an article, can you pick out the most seminal case-studies of the 20th century?

Entries must contain not just an idea, but actual analysis and visualization. You don't have to be a professional graphic designer, but you should keep in mind how your work will make its point to a professional, but possibly non-technical, audience. This is a contest for every analyst who has struggled to explain the value of data to his or her boss. Well, now is your chance to show what you can do to your boss's boss's boss.

[BioRuby] BioRuby 1.4.3 released

---------- Forwarded message ----------
From: Naohisa GOTO

Hi, all,

We are pleased to announce the release of BioRuby 1.4.3.
This new release fixes bugs existed in 1.4.2 and improves
portability on JRuby and Rubinius.

Tar.gz file: http://bioruby.org/archive/bioruby-1.4.3.tar.gz
Gem file: http://bioruby.org/archive/gems/bio-1.4.3.gem

We also put RubyGems pacakge at RubyGems.org and RubyForge (*).
(* Files on RubyForge will soon be available.)

You can easily install by using RubyGems. First, check the
version number by using search command:
% gem search --remote bio
and find "bio (1.4.3)" in the list. Then,
% sudo gem install bio

Here is a brief summary of changes.

* Bio::KEGG::KGML bug fixes and new class Bio::KEGG::KGML::Graphics
for storing a graphics element.
* Many failures and errors running on JRuby and Rubinius are resolved.
* Strange behavior related with "circular require" is fixed.
* Fixed: Genomenet remote BLAST does not work.
* Fixed: Bio::NucleicAcid.to_re("s") typo.
* Fixed: Bio::EMBL#os raises RuntimeError.
* Fixed: bin/bioruby: Failed to save object with error message
"can't convert Symbol into String" on Ruby 1.9.

In addition, many changes have been made, including incompatible
changes. For more information, see RELEASE_NOTES.rdoc and
ChangeLog.

Acknowledgments: Thanks to all persons reporting issues and/or
submitting patches.

Hope you enjoy.

P.S. We are having trouble when updating http://bioruby.org/
and http://bioruby.open-bio.org/ webpages now, possibly because
of file/directory permissions in open-bio.org server.

--
Naohisa Goto
_______________________________________________
BioRuby Project - http://www.bioruby.org/
BioRuby mailing list
http://lists.open-bio.org/mailman/listinfo/bioruby

Tuesday, 21 August 2012

[pub]: Estimates of Genetic Differentiation Measured by F(ST) Do Not Necessarily Require Large Sample Sizes When Using Many SNP Markers.

Item 1 of 1 (Display the citation in PubMed)

1.	PLoS One. 2012;7(8):e42649. Epub 2012 Aug 14. Estimates of Genetic Differentiation Measured by F(ST) Do Not Necessarily Require Large Sample Sizes When Using Many SNP Markers. Willing EM, Dreyer C, van Oosterhout C. Source Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany. Abstract Population genetic studies provide insights into the evolutionary processes that influence the distribution of sequence variants within and among wild populations. F(ST) is among the most widely used measures for genetic differentiation and plays a central role in ecological and evolutionary genetic studies. It is commonly thought that large sample sizes are required in order to precisely infer F(ST) and that small sample sizes lead to overestimation of genetic differentiation. Until recently, studies in ecological model organisms incorporated a limited number of genetic markers, but since the emergence of next generation sequencing, the panel size of genetic markers available even in non-reference organisms has rapidly increased. In this study we examine whether a large number of genetic markers can substitute for small sample sizes when estimating F(ST). We tested the behavior of three different estimators that infer F(ST) and that are commonly used in population genetic studies. By simulating populations, we assessed the effects of sample size and the number of markers on the various estimates of genetic differentiation. Furthermore, we tested the effect of ascertainment bias on these estimates. We show that the population sample size can be significantly reduced (as small as n = 4-6) when using an appropriate estimator and a large number of bi-allelic genetic markers (k>1,000). Therefore, conservation genetic studies can now obtain almost the same statistical power as studies performed on model organisms using markers developed with next-generation sequencing.
	PMID: 22905157 [PubMed - in process]

[pub]: Compression of next-generation sequencing reads aided by highly efficient de novo assembly

Item 1 of 1 (Display the citation in PubMed)

1.	Nucleic Acids Res. 2012 Aug 16. [Epub ahead of print] Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Jones DC, Ruzzo WL, Peng X, Katze MG. Source Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, Department of Genome Sciences, University of Washington, Seattle, WA 98195-5065, Fred Hutchinson Cancer Research Center, Seattle, WA 98109 and Department of Microbiology, University of Washington, Seattle, WA 98195-7242, USA. Abstract We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information. Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.
	PMID: 22904078 [PubMed - as supplied by publisher]

[howto] ssh to Ubuntu virtualbox machine from Mac Host

Had issues trying to setup shared drives in Guest additions so tried adding a 2nd ethernet adaptor on my vbox Ubuntu so that i can SSH into it (Not sure if others will be able to ssh into the vbox Ubuntu without port forwarding on my Mac Host)

oh remember to

sudo apt-get install openssh-server

Book: Applied Statistic Genetics with R with data & code!

Check this out! http://people.umass.edu/foulkes/asg/examples.html

This book is intended to provide fundamental statistical concepts and R tools relevant to the analysis of genetic data arising from population-based association studies. The statistical methods described are broadly relevant to the field of statistical genetics and include a large array of tools for a wide variety of medical and public health applications. Data analytic methods include approaches to handling multiplicity, ambiguity in haplotypic phase and underlying gene-gene and gene-environment interactions. Several publicly available data sets are used for illustration

Chapter 1
# 1.1: Identifying the minor allele and its frequency
Chapter 2
            #  2.1: Chi-squared test for association
           #  2.2: Fisher's exact test for association
           #  2.3: Chochran-Armitage (C-A) trend test for association
        #  2.4: Two-sample tests for association for a quantitative trait
            #  2.5: M-sample tests of association for a quantitative trait
        #  2.6: Linear Regression
Chapter 3
# 3.1: Measuring LD using D-prime
# 3.2: Measuring LD for a group of SNPs
# 3.3: Measuring LD based on r^2 and the \chi^2-statistic
# 3.4: Determining average LD across multiple SNPs
# 3.5: Population substructure and LD
# 3.6: Testing for HWE using Pearsons \chi^2-test
# 3.7: Testing for HWE using Fishers exact test
# 3.8: HWE and geographic origin
# 3.9: Generating a similarity matrix
# 3.10: Multidimensional scaling (MDS) for identifying population substructure
# 3.11: Principal components analysis (PCA) for identifying population substructure
Chapter 4
# 4.1: Bonferroni adjustment
# 4.2: Tukeys single-step method
# 4.3: Banjamini and Hochberg (B-H) adjustment # 4.4: Benjamini and Yekutieli (B-Y) adjustment
# 4.5: Calculation of the q-value
# 4.6: Free step down resampling adjustment
# 4.7: Null unrestricted bootstrap approach
Chapter 5
# 5.1: EM approach to haplotype frequency estimation
# 5.2: Calculating posterior haplotype probabilities
# 5.3: Testing hypotheses about haplotype frequencies within the EM framework
# 5.4: Application of haplotype trend regression (HTR)
# 5.5: Multiple imputation for haplotype effect estimation and testing
# 5.6: EM for estimation and testing of haplotype-trait association
Chapter 6
# 6.2: Creating a classification tree
            # 6.3: Generating a regression tree
            #  6.4: Categorical and ordinal predictors in a tree
            # 6.5: Cost-complexity pruning
Chapter 7
# 7.1: An application of random forests
# 7.2: RF with missing SNP data - single imputation
# 7.3: RF with missing SNP data - multiple imputation
# 7.4: MIRF
# 7.5: Application of logic regression
# 7.6: Monte Carol logic regression
# 7.7: An application of MARS

Friday, 31 August 2012

Thursday, 30 August 2012

Scrollable Genome Browsing

Wednesday, 29 August 2012

Monday, 27 August 2012

Qualimap: evaluating next generation sequencing alignment data.

Source

Abstract

MOTIVATION:

RESULTS:

AVAILABILITY:

Coverage across reference

CONTENTS

Sunday, 26 August 2012

Saturday, 25 August 2012

Multiple regression methods show great potential for rare variant association tests.

Source

Abstract

Thursday, 23 August 2012

Download GATK 2.0 (beta)

Download GATK-lite

Wednesday, 22 August 2012

Search the Current Folder

Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health.

Source

Comment in

Abstract

OBJECTIVES:

STUDY DESIGN AND SETTING:

RESULTS:

CONCLUSIONS:

Tuesday, 21 August 2012

Estimates of Genetic Differentiation Measured by F(ST) Do Not Necessarily Require Large Sample Sizes When Using Many SNP Markers.

Source

Abstract

Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Source

Abstract

Datanami, Woe be me