Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts

Saturday, 12 September 2015

Mining Massive Datasets by Stanford University on Coursera.

Mining Massive Datasets by Stanford University on Coursera.

Let the fun begin ... 


 In the next seven weeks, we will present to you many of the important tools for extracting information from very large datasets.  Each week there will be a number of videos to watch, and one or more homeworks to do.  The materials are backed up by a free on-line textbook, also published by Cambridge University Press, also called "Mining of Massive Datasets."  You can download the book athttp://www.mmds.org

The first week is devoted to two topics:

  1. MapReduce: A programming system for easily implementing parallel algorithms on commodity clusters.  This material is in the first four videos available for the week.
  2. Link Analysis: The remaining seven videos discuss the PageRank algorithm that made Google more effective than previous search engines.
There is also a single homework covering both topics.  This homework is classified as "Basic."  See below for an explanation of basic vs. advanced work, and the significance.

Wednesday, 13 November 2013

Consumer Grade HDD are OK for Data backup

Ah storage, who doesn't need more of it? Cheaply I might add.
The folks at Backblaze published their own field report on HDD failure rate which is interesting for any data center.
Earlier I had read about Google's study on how temperature doesn't affect HDD failure rate and promptly removed the noisy HDD cooling fans in my Linux box.
Their latest blog post at http://blog.backblaze.com/2013/11/12/how-long-do-disk-drives-last/ has me thinking that some of my colleagues elsewhere that are doing Backblaze like setups should switch to consumer grade HDDs to save on cost.
I do have a 80 Gb Seagate HDD that has survived the years. Admittedly I am not sure what to do with it anymore as it is too small(80 Gb) to be useful and too big (3.5") to be portable. It was used as a main HDD until it's size rendered it obselete hence it's sitting in a USB HDD dock that I use occasionally.
maybe you can find out the age by looking up the serial number but I use the SMART data info that you can see from the Disk Utility in Ubuntu.

My ancient 3.5" HDD




However the age as you can see from the screen shot is an estimate of the days it has been powered on.
Powered on only for 314 days!

Hmm pretty low mileage for a 80 Gb HDD eh?
Check out this 320 Gb IDE HDD



Even lower mileage! Not too sure of the history of this drive so can't really comment here.

Completely anecdotal but I have 3 Seagate 1 Tb HDD dying within a year from a software 5x HDD RAID array from within CentOS. When I checked on the powered on days it says it has been running for 3 years. So I am
1) confused how SMART data records HDD age
2) in agreement with Backblaze that HDD have specific failure phases. (where usage patterns play less of a role perhaps)
3)guessing that most of the data on Backblaze are archival in nature i.e. write once and forget until disaster strikes. So it would be great if  Backblaze can 'normalize' the lifespan of the HDD with data access patterns per HDD to make it more relevant for a crowd that has a slightly different usage pattern than pure data archival needs.

That said I think it's an excellent piece of reading if you are concerned about using consumer grade HDD. Kudos the Backblaze team who managed to 'shuck' 5.5 Petabytes of raw HDD to weather the Thailand crisis (wonder how that affected their economics of using consumer grade HDD)

As usual YMMV applies here. Feel free to use consumer grade HDD for your archival needs but be sure to build in redundancy and resilience into your system like the folks in Backblaze.

Tuesday, 21 August 2012

Getting Started with R and Hadoop

Getting Started with R and Hadoop: (This article was first published on Revolutions, and kindly contr... 


For newcomers to map-reduce programming with R and Hadoop, Jeffrey's presentation includes a step-by-step example of computing flight times from air traffic data. The last few slides some advanced features: how to work directly with files in HDFS from R with the rhdfs package; and how to simulate a Hadoop cluster on the local machine (useful for development, testing and learning RHadoop). Jeffrey also mentions that the RHadoop tutorial is a good resource for new users.
You can find Jeffrey's slides embedded below, and a video of the presentation is also available. You might also want to check out Jeffrey's older presentation Big Data Step-by-Step for tips on setting up a compute environment with Hadoop and R.



Sunday, 6 May 2012

slides:2012: Trends from the Trenches


2012: Trends from the Trenches

by Chris Dagdigian on Apr 26, 2012
Talk slides as delivered at the 2012 Bio-IT World Conference in Boston, MA 

2012: Trends from the Trenches
View more presentations from Chris Dagdigian

Thursday, 3 May 2012

Seven Bridges Genomics - a commercial curated DB for genetic information?

Spotted on Google ads .. 

You HAVE to love the titles for some of the staff/founders

Igor Bogicevic

Founder/CTO
Ultimate Gandalf. The Architect.

 ... they are in beta now .. I think a lot of people are racing to be in the same bandwagon .. notably you do not see a clinician / psychologist/ counsellor  amidst them ... perhaps they are aiming for a different angle ...

See you at the end of the race! 


Meet Our Team

The mission of Seven Bridges Genomics is to enable people to make sense of the world's biological information, in order to improve lives and to share in the joy of discovery.
https://igor.sbgenomics.com/about/sbg/

Sunday, 29 April 2012

Cloud storage: a pricing and feature guide for consumers

http://arstechnica.com/gadgets/news/2012/04/cloud-storage-a-pricing-and-feature-guide-for-consumers.ars
Storage is always a top most concern for NGS data analysis. For quick and dirty file sharing you can't beat these commercial storage providers almost everyone has a dropbox account that you can share your files with now. Ars technica has an nice article that summarises the supported platforms and the cost if you wish to upgrade your account

Wednesday, 29 February 2012

Translational Genomics Research Institute, personalised genomics to improve chemotherapy, cloud computing for pediatric cancer


I think it's fantastic that this is happening right now. Given that the cost of sequencing and computing is still relatively high, I can see how the first wave of personalized medicine will be lead by non-profit organizations. I am personally curious how this might pan out and would this be cost-effective for the patients ultimately? Would they be able to quantify it? 
Kudos for Dell for being a part of this exercise, though I wondered if they could have donated more to the data center or alternatively setup a mega cloud center and donate compute resources instead. Since i think the infrastructure and knowledge gleaned will be useful for their marketing and sales. 




http://www.hpcinthecloud.com/hpccloud/2012-02-29/cloud_computing_helps_fight_pediatric_cancer.html

Cloud technology is being used to speed computation, as well as manage and store the resulting data. Cloud also enables the high degree of collaboration that is necessary for science research at this level. The scientists have video-conferences where they work off of "tumor boards" to make clinical decisions for the patients in real-time. Before they'd have to ship hard drives to each other to have that degree of collaboration and now the data is always accessible through the cloud platform.


"We expect to change the way that the clinical medicine is delivered to pediatric cancer patients, and none of this could be done without the cloud," Coffin says emphatically. "With 12 cancer centers collaborating, you have to have the cloud to exchange the data."


Dell relied on donations to build the initial 8.2 teraflop high-performance machine. A second round of donations has meant a doubling in resources for this important work, up to an estimated 13 teraflops of sustained performance.


"Expanding on the size of the footprint means we can treat more and more patients in the clinic trial so this is an exciting time for us. This is the first pediatric clinic trial using genomic data ever done. And Dell is at the leading edge driving this work from an HPC standpoint and from a science standpoint."


The donated platform is comprised of Dell PowerEdge Blade Servers, PowerVault Storage Arrays, Dell Compellent Storage Center arrays and Dell Force10 Network infrastructure. It features 148 CPUs, 1,192 cores, 7.1 TB of RAM, and 265 TB Disk (Data Storage). Dell Precision Workstations are available for data analysis and review. TGen's computation and collaboration capacity has increased by 1,200 percent compared to the site's previous clinical cluster. In addition, the new system has reduced tumor mapping and analysis time from a matter of months to days.

Wednesday, 22 February 2012

Amazon S3 for temporary storage of large datasets?

Just did a rough calculation on AWS calculator, the numbers are quite scary!

For a hypothetical 50 TB dataset (haven't found out the single S3 object max file size yet, seem to recall it's 1 Gbytes)
it costs $4160.27 to store it for a month!

to transfer it out it costs $4807.11!

For 3 years, the cost of storage is $149,000 which I guess you can pay for an enterprise storage solution and transfer costs are zero.

At this point in time, I guess one can't really use AWS S3 for sequence archival. I wonder if data deduplication can help reduce cloud storage costs ... I am sure in terms of bytes, BAM files should be quite similar .. no?


Monday, 6 February 2012

Public Poll: What data storage solutions are you using now?

BooRoo

Just a public poll created curious as to what systems are out there and what is being used day to day. Results are immediately viewable. Order of choices is randomized. 


Which data storage platform are you using?

Create an online survey quiz or web poll
http://www.bluearc.com/ BlueArc is your Life Sciences Solution0%
http://www.isilon.com/ Isilon scale-out NAS, BIG data is no longer a challenge, it’s an opportunity.0%
http://www.panasas.com/ PANASAS SOLVES BIG DATA PROBLEMS0%
http://www.ddn.com/ DataDirect Networks’ (DDN) Big Data Storage Technology Powers More0%
http://aws.amazon.com/ Store data and build dependable backup solutions using AWS’s highly reliable, inexpensive data storage services.0%
Inhouse solutions using commodity hardware0%
http://www-03.ibm.com/systems/software/gpfs/ IBM GPFS0%
Other: (Please specify)0%
Create an online survey quiz or web poll

Datanami, Woe be me