Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Tuesday, 21 August 2012

Getting Started with R and Hadoop

Getting Started with R and Hadoop: (This article was first published on Revolutions, and kindly contr... 


For newcomers to map-reduce programming with R and Hadoop, Jeffrey's presentation includes a step-by-step example of computing flight times from air traffic data. The last few slides some advanced features: how to work directly with files in HDFS from R with the rhdfs package; and how to simulate a Hadoop cluster on the local machine (useful for development, testing and learning RHadoop). Jeffrey also mentions that the RHadoop tutorial is a good resource for new users.
You can find Jeffrey's slides embedded below, and a video of the presentation is also available. You might also want to check out Jeffrey's older presentation Big Data Step-by-Step for tips on setting up a compute environment with Hadoop and R.



Tuesday, 16 November 2010

Pending release of Contrail, Hadoop de novo assembler?

Jermdemo on Twitter

Just noticed the source code for Contrail, the first Hadoop based de-novo assembler, has been uploaded http://bit.ly/96pSbw 26 days ago


Oh the suspense!

Wednesday, 15 September 2010

Myrna-calculate differential gene expression on Elastic MapReduce or local Hadoop

The software, termed “Myrna” was funded in part by Amazon Web Services (in addition to the Bloomberg School of Public Health and the National Institutes of Health) was, not surprisingly, making use of compute resources from Amazon. In order to test Myrna, researchers rented time and storage resources from AWS and were able to realize solid performance and cost savings. According to the study's authors, “Myrna calculated differential expression from 1.1 billion RNA sequences reads in less than two hours at a cost of about $66.”

Note:
Myrna is a cloud computing tool for calculating differential gene expression in large RNA-seq datasets. Myrna uses Bowtie for short read alignment and R/Bioconductor for interval calculations, normalization, and statistical testing. These tools are combined in an automatic, parallel pipeline that runs in the cloud (Elastic MapReduce in this case) on a local Hadoop cluster, or on a single computer, exploiting multiple computers and CPUs wherever possible. 

also see

Cloud computing method greatly increases gene analysis

Datanami, Woe be me