Friday, 29 January 2010

gawk Sequence Alignment/Map (SAM) Format files for mapping info

There's lots of stuff you can do with sed awk grep in linux.
Below is a few commands that I modified from samtools helplist that will help others in trying to sieve out mapping info from sam files
Basically it works by looking at col 2 of the sam file for the flag to see if the read is mapped, unmapped. Look at above link for more info on the sam format.

#extract unmapped reads from sam file
gawk '!/^@/&&and($2,4)' aln.sam > aln.sam.unmapped.reads

#extracts readname & name of seq in reference where read is mapped
gawk '!/^@/&&!and($2,4){print  $1 $3}' aln.sam > aln.sam.mapped.reads

#counts the number of reads mapped to seq in reference
gawk '{print $3}' aln.sam.mapped.reads|sort  |uniq -c |sort -n > aln.sam.mapped.reads.stats

No comments:

Post a Comment

Datanami, Woe be me