Monday, 30 July 2012

bioawk- AWK for gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.

Alerted to this on

About bioawk

Bioawk is an extension to Brian Kernighan's awk that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.
Bioawk adds a new -c fmt option that specifies the input format. The behavior of bioawk will vary depending on the value of fmt.
For the formats that awk recognizes specially named variables will be created. For example for the supported sequence formats the$name$seq and, if applicable $qual variable names may be used to access the name, sequence and quality string of the sequence record in each iteration. Here is an example of iterating over a fastq file to print the sequences:
    awk -c fastq '{ print $seq }' test.fq  
For known interval formats the columns can be accessed via the variables called $start$end$chrom (etc). For example to print the feature lenght of a file in BED format one could write:
    awk -c bed '{ print $end - $start }' test.bed  
One important change (and innovation) over the original awk is that bioawk will treat sequences that may span multiple lines as a single record. The parsing, implemented in C, may be several orders of magnitude faster than similar code programmed in interpreted languages: Perl, Python, Ruby.
When the format mode is header or hdr, bioawk parses named columns. It automatically adds variables whose names are taken from the first line and values from the column index. Special characters are converted to a underscore.
Bioawk also adds a few built-in functions including, as of now, and(), or(), xor(), and others (see comprehensive list below).
Detailed help is maintained in the bioawk manual page, to access it type:
    man ./awk.1  

Usage Examples

  1. Extract unmapped reads without header:
        awk -c sam 'and($flag,4)' aln.sam.gz  
  2. Extract mapped reads with header:
        awk -c sam -H '!and($flag,4)'

1 comment:

  1. Looks interesting.

    I'm in a situation where I want to use my GFF file to extract the exons from a SAM file. But I have pooled data, so each of my paired reads is of interest, so I really want to be able to look at my data in its context, say for instance I have SNPs that are 1 bp apart, they could be resulting in different AA codings. Depending upon recombination I could have 4 different amino acids .... any idea how to extract that?


Datanami, Woe be me