This is a long over due tool for those trying to do non-typical analysis with your reads.
Finally you can index and compress your NGS reads
Bioinformatics. 2010 Jul 6. [Epub ahead of print]
G-SQZ: Compact Encoding of Genomic Sequence and Quality Data.
Tembe W, Lowey J, Suh E.
Translational Genomics Research Institute, 445 N 5th Street, Phoenix, AZ 85004, USA.
SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access, and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This paper focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. CONTACT: Waibhav Tembe (firstname.lastname@example.org).
read the discussion thread in seqanswers for more tips and benchmarks
I am not affliated with the author btw