Wednesday, 4 July 2012

Sort::Rank - search.cpan.org


I have a problem where I needed the non unique rank of SNP id based on a score. I thought that this seemed a terribly popular thing to do but so far most of the options that turned out are Excel and this was the only Perl module that seems to be able to do it.
Sort::Rank - Sort arrays by some score and organise into ranks. http://search.cpan.org/~andya/Sort-Rank-v0.0.2/lib/Sort/Rank.pm
If I needed a unique rank then I could have done it this way
sort -n -k 3 snp_score.csv | nl > ranksorted-snp-score.csv
I could conceivably parse the file again and edit the ranking based on uniqueness of the score value ...
Anyone has a better / faster way to sort 2 million records?
inputfile looks like this
rs  bla 0.8200E+02
rs  bla 1.9200E+02
rs  bla 1.7200E+02
rs  bla 1.8200E+02   
rs  bla 1.8200E+02
I want to get something like
 1  rs  bla 0.8200E+02
 2  rs  bla 1.7200E+02
 3  rs  bla 1.8200E+02
 3  rs  bla 1.8200E+02
 4  rs  bla 1.9200E+02

1 comment:

  1. Interesting.

    I know R can do this but I don't know how fast or memory efficient it would be on this many rows.

    Off the top of my head while not at a computer, I'd benchmark "sort | awk" where awk would increment the rank each time the score changed.

    Also, try using --stable to disable last resort comparison as this might speed up sort if you have many scores that are the same.

    GNU sort can be parallelised but then I don't think you can take advantage of executing awk in parallel using a pipe. Since the output of the sort is delayed until the end of the sort.

    I might have dabble today if I get the chance!

    Nathan

    ReplyDelete

Datanami, Woe be me