Thursday, 1 March 2012

Mac (BSD) awk != Linux (GNU) awk, split BED file by chromosomes

To split your BED file by chromosome, this simple GNU AWK script will create "my.chrNNN.bed" file for each chromosome:
  awk '{ print >> "my." $1 ".bed" }' my.bed

 awk '{ file = "TFBS." $1 ".bed" ; print >> file }' TFBS.bed

credit: Assaf Gordon on BedTools mailing list for pointing this out

On a side note, 
To split your BAM file by chromosome, you can use "bamtools split" ( bamtools here: ) .

There is a SAMtools filter option as well. 
Anyone benchmarked both to see which runs faster? 

this discussion arose from someone trying to compute the mean coverage 

bedtools coverage -abam my.bam -b my.bed -d | sort -k1,1 -k2,2n | groupby -g 1,2,3,4,5,6 -c 8 -o mean > my.txt

it doesn't scale too well with the no. of entries in the bed file 
1,000 lines takes a few minutes
1,200,000 lines " it's been running for 12 hours and still not done yet " on a Mac Pro with speed of 2.66 GHz and 8 GB of Memory.

So splitting the file by chromosome helps to parallelize the process although the job might scale linearly

1 comment:

  1. What if I want to keep headers?


Datanami, Woe be me