Following the discussion on subsampling sequence from fasta/fastq, I think perhaps it is time to more openly advertise my in-house tool: seqtk. Currently, seqtk supports quality based trimming with the phred algorithm, converting fastq to fasta, reverse complementing sequences, extracting or masking subsequences in regions given in a BED/name list file, and more. I have just added a subsampling module to sample exactly n sequences or a fraction of sequences.
Seqtk supports both fasta and fastq input files, which can be optionally gzip compressed. Each module is perhaps the most efficient among tools of the same functionality. For example, I know fasta-to-fastq is 10X faster than another converter, while being more flexible.
Seqtk is implemented in a single .c file and two header files and only depends on zlib. The source code is freely available here (MIT license):
https://github.com/lh3/seqtk
Heng
I would like to know what are considered low-quality bases by the Phred algorithm.
ReplyDeleteRegards,
Dennis
check out http://en.wikipedia.org/wiki/Phred_quality_score#Reliability
ReplyDeletePhred quality scores are defined as a property which is logarithmically related to the base-calling error probabilities .[2]
or
For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000. The most commonly used method is to count the bases with a quality score of 20 and above
Thanks for your reply. So I understand what the Pphred score means, but what is then considered low quality by "seqtk trimfq"? What is the default and is it possible to choose your own cut-off?
ReplyDeleteAh! sorry!
ReplyDeleteI pasted Li Heng's reply here, perhaps you have the mistaken impression that I am the author of this software. Perhaps that's a question you can direct to the mailling list there?