Kevin's GATTACA World: PROGRAM: seqtk for sampling, trimming, fastq2fasta, subsequence, reverse complement and more

Monday, 28 May 2012

PROGRAM: seqtk for sampling, trimming, fastq2fasta, subsequence, reverse complement and more

Following the discussion on subsampling sequence from fasta/fastq, I think perhaps it is time to more openly advertise my in-house tool: seqtk. Currently, seqtk supports quality based trimming with the phred algorithm, converting fastq to fasta, reverse complementing sequences, extracting or masking subsequences in regions given in a BED/name list file, and more. I have just added a subsampling module to sample exactly n sequences or a fraction of sequences.

Seqtk supports both fasta and fastq input files, which can be optionally gzip compressed. Each module is perhaps the most efficient among tools of the same functionality. For example, I know fasta-to-fastq is 10X faster than another converter, while being more flexible.

Seqtk is implemented in a single .c file and two header files and only depends on zlib. The source code is freely available here (MIT license):

https://github.com/lh3/seqtk

Heng

4 comments:

Anonymous16 August 2012 at 19:55
I would like to know what are considered low-quality bases by the Phred algorithm.
Regards,
Dennis
ReplyDelete
Replies
Kevin16 August 2012 at 21:10
check out http://en.wikipedia.org/wiki/Phred_quality_score#Reliability
Phred quality scores are defined as a property which is logarithmically related to the base-calling error probabilities .[2]

or

For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000. The most commonly used method is to count the bases with a quality score of 20 and above
ReplyDelete
Replies
Anonymous17 August 2012 at 17:09
Thanks for your reply. So I understand what the Pphred score means, but what is then considered low quality by "seqtk trimfq"? What is the default and is it possible to choose your own cut-off?
ReplyDelete
Replies
Kevin17 August 2012 at 17:49
Ah! sorry!
I pasted Li Heng's reply here, perhaps you have the mistaken impression that I am the author of this software. Perhaps that's a question you can direct to the mailling list there?
ReplyDelete
Replies

Add comment

Kevin's GATTACA World

Monday, 28 May 2012

PROGRAM: seqtk for sampling, trimming, fastq2fasta, subsequence, reverse complement and more

4 comments:

Datanami, Woe be me

Analytics code

Contributors