Wednesday 28 July 2010

Sort 1 Terabyte of data in a 60 seconds?

UC San Diego Computer scientists did that according to HPCwire.

"SAN DIEGO, July 27 -- Computer scientists from the University of California, San Diego broke "the terabyte barrier" -- and set a new world record -- when they sorted more than one terabyte of data (1,000 gigabytes or 1 million megabytes) in just 60 seconds on a computing cluster at Calit2. During this 2010 "Sort Benchmark" competition -- the "World Cup of data sorting" -- the computer scientists from the UC San Diego Jacobs School of Engineering also tied a world record for fastest data sorting rate. They sorted one trillion data records in 172 minutes – and did so using just a quarter of the computing resources of the other record holder....."

Gosh and I felt I was having trouble sorting mere gigabytes of data ...


"Sorting is also an interesting proxy for a whole bunch of other data processing problems. Generally, sorting is a great way to measure how fast you can read a lot of data off a set of disks, do some basic processing on it, shuffle it around a network and write it to another set of disks," explained Rasmussen. "Sorting puts a lot of stress on the entire input/output subsystem, from the hard drives and the networking hardware to the operating system and application software."


 I would love to find out how you can sort items bigger than ramspace efficiently and see how they can be implemented on bioinformatics problems.


Do check out http://sortbenchmark.org/
They have various metrics for sorting most of which I had never thought of before!
I love the category of PennySort in particular!

Metric: Amount of data that can be sorted for a penny's worth of system time.
Originally defined in AlphaSort paper.

There's also JouleSort

 Metric: Amount of energy required to sort either 108, 109 or 1010 records (10 GB, 100 GB or 1 TB).
Originally defined in JouleSort paper.

No comments:

Post a Comment

Datanami, Woe be me