Wednesday, 28 July 2010

Sort 1 Terabyte of data in a 60 seconds?

UC San Diego Computer scientists did that according to HPCwire.

"SAN DIEGO, July 27 -- Computer scientists from the University of California, San Diego broke "the terabyte barrier" -- and set a new world record -- when they sorted more than one terabyte of data (1,000 gigabytes or 1 million megabytes) in just 60 seconds on a computing cluster at Calit2. During this 2010 "Sort Benchmark" competition -- the "World Cup of data sorting" -- the computer scientists from the UC San Diego Jacobs School of Engineering also tied a world record for fastest data sorting rate. They sorted one trillion data records in 172 minutes – and did so using just a quarter of the computing resources of the other record holder....."

Gosh and I felt I was having trouble sorting mere gigabytes of data ...

"Sorting is also an interesting proxy for a whole bunch of other data processing problems. Generally, sorting is a great way to measure how fast you can read a lot of data off a set of disks, do some basic processing on it, shuffle it around a network and write it to another set of disks," explained Rasmussen. "Sorting puts a lot of stress on the entire input/output subsystem, from the hard drives and the networking hardware to the operating system and application software."

 I would love to find out how you can sort items bigger than ramspace efficiently and see how they can be implemented on bioinformatics problems.

Do check out
They have various metrics for sorting most of which I had never thought of before!
I love the category of PennySort in particular!

Metric: Amount of data that can be sorted for a penny's worth of system time.
Originally defined in AlphaSort paper.

There's also JouleSort

 Metric: Amount of energy required to sort either 108, 109 or 1010 records (10 GB, 100 GB or 1 TB).
Originally defined in JouleSort paper.

No comments:

Post a Comment

Datanami, Woe be me