Kevin's GATTACA World: Mac (BSD) awk != Linux (GNU) awk, split BED file by chromosomes

Thursday, 1 March 2012

Mac (BSD) awk != Linux (GNU) awk, split BED file by chromosomes

To split your BED file by chromosome, this simple GNU AWK script will create "my.chrNNN.bed" file for each chromosome:

awk '{ print >> "my." $1 ".bed" }' my.bed

BSD's AWK

awk '{ file = "TFBS." $1 ".bed" ; print >> file }' TFBS.bed

credit: Assaf Gordon on BedTools mailing list for pointing this out

On a side note,
To split your BAM file by chromosome, you can use "bamtools split" ( bamtools here: https://github.com/pezmaster31/bamtools ) .

There is a SAMtools filter option as well.
Anyone benchmarked both to see which runs faster?

this discussion arose from someone trying to compute the mean coverage

bedtools coverage -abam my.bam -b my.bed -d | sort -k1,1 -k2,2n | groupby -g 1,2,3,4,5,6 -c 8 -o mean > my.txt

it doesn't scale too well with the no. of entries in the bed file
1,000 lines takes a few minutes
1,200,000 lines " it's been running for 12 hours and still not done yet " on a Mac Pro with speed of 2.66 GHz and 8 GB of Memory.

So splitting the file by chromosome helps to parallelize the process although the job might scale linearly

1 comment:

Anonymous21 December 2016 at 00:55
What if I want to keep headers?
ReplyDelete
Replies

Add comment

Kevin's GATTACA World

Thursday, 1 March 2012

Mac (BSD) awk != Linux (GNU) awk, split BED file by chromosomes

1 comment:

Datanami, Woe be me

Analytics code

Contributors