Wednesday, 14 July 2010

Shiny new tool to index NGS reads G-SQZ

This is a long over due tool for those trying to do non-typical analysis with your reads.
Finally you can index and compress your NGS reads

Bioinformatics. 2010 Jul 6. [Epub ahead of print]
G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.

Tembe W, Lowey J, Suh E.

Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe (

read the discussion thread in seqanswers for more tips and benchmarks

I am not affliated with the author btw

No comments:

Post a Comment

Datanami, Woe be me