Thursday, 1 March 2012

Mac (BSD) awk != Linux (GNU) awk, split BED file by chromosomes


To split your BED file by chromosome, this simple GNU AWK script will create "my.chrNNN.bed" file for each chromosome:
  awk '{ print >> "my." $1 ".bed" }' my.bed


BSD's AWK 
 awk '{ file = "TFBS." $1 ".bed" ; print >> file }' TFBS.bed





credit: Assaf Gordon on BedTools mailing list for pointing this out






On a side note, 
To split your BAM file by chromosome, you can use "bamtools split" ( bamtools here: https://github.com/pezmaster31/bamtools ) .


There is a SAMtools filter option as well. 
Anyone benchmarked both to see which runs faster? 




this discussion arose from someone trying to compute the mean coverage 

bedtools coverage -abam my.bam -b my.bed -d | sort -k1,1 -k2,2n | groupby -g 1,2,3,4,5,6 -c 8 -o mean > my.txt


it doesn't scale too well with the no. of entries in the bed file 
1,000 lines takes a few minutes
1,200,000 lines " it's been running for 12 hours and still not done yet " on a Mac Pro with speed of 2.66 GHz and 8 GB of Memory.


So splitting the file by chromosome helps to parallelize the process although the job might scale linearly

1 comment:

  1. What if I want to keep headers?

    ReplyDelete

Datanami, Woe be me