Kevin's GATTACA World: bedtools

Showing posts with label bedtools. Show all posts

Friday, 17 September 2021

Benchmarking variants and comparing truth sets: List of useful tools and publications

Just realised that other than vcf-compare and bedtools intersect

there's other options

https://github.com/RealTimeGenomics/rtg-tools

Also there's actually new variant callers ..

Molina-Mora, J.A., Solano-Vargas, M. Set-theory based benchmarking of three different variant callers for targeted sequencing. BMC Bioinformatics 22, 20 (2021). https://doi.org/10.1186/s12859-020-03926-3

Krishnan, V., Utiramerur, S., Ng, Z. et al. Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays. BMC Bioinformatics 22, 85 (2021). https://doi.org/10.1186/s12859-020-03934-3

Additional file 23: File 3

. verify_variants.py

Zook, Justin M et al. “An open resource for accurately benchmarking small variant and reference calls.” Nature biotechnology vol. 37,5 (2019): 561-566. doi:10.1038/s41587-019-0074-6

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`

hgvs.readthedocs.io/

hgvs: A Python package for manipulating sequence variants using HGVS nomenclature: 2018 Update.
Wang M, Callenberg KM, Dalgleish R, Fedtsov A, Fox NK, Freeman PJ, Jacobs KB, Kaleta P, McMurry AJ, Prlić A, Rajaraman V, Hart RK.Hum Mutat. 2018 Dec;39(12):1803-1813. doi: 10.1002/humu.23615. Epub 2018 Sep 5.PMID: 30129167 Free PMC article.
- Sequence Variant Descriptions: HGVS Nomenclature and Mutalyzer.
  den Dunnen JT.Curr Protoc Hum Genet. 2016 Jul 1;90:7.13.1-7.13.19. doi: 10.1002/cphg.2.PMID: 27367167

A Python package for parsing, validating, mapping and formatting sequence variants using HGVS nomenclature.
Hart RK, Rico R, Hare E, Garcia J, Westbrook J, Fusaro VA.Bioinformatics. 2015 Jan 15;31(2):268-70. doi: 10.1093/bioinformatics/btu630. Epub 2014 Sep 30.PMID: 25273102 Free PMC article.
VariantValidator: Accurate validation, mapping, and formatting of sequence variation descriptions.
Freeman PJ, Hart RK, Gretton LJ, Brookes AJ, Dalgleish R.Hum Mutat. 2018 Jan;39(1):61-68. doi: 10.1002/humu.23348. Epub 2017 Oct 17.PMID: 28967166 Free PMC article.
Clinical Implementation and Validation of Automated Human Genome Variation Society (HGVS) Nomenclature System for Next-Generation Sequencing-Based Assays for Cancer.
Callenberg KM, Santana-Santos L, Chen L, Ernst WL, De Moura MB, Nikiforov YE, Nikiforova MN, Roy S.J Mol Diagn. 2018 Sep;20(5):628-634. doi: 10.1016/j.jmoldx.2018.05.006. Epub 2018 Jun 21.PMID: 29936258

Wednesday, 2 May 2012

BED file format ...

[$bash] coverageBed -a xc.bed -b xb.bed
Unexpected file format. Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?

Can't believe I wasted 2 hours trying all options to debug the above ... ONLY to suddenly remember that my colleague probably passed me the xb.bed which she generated in Windows ...

I was witch hunting for double tabs, chr names ... etc when

dos2unix xb.bed

solved it in 5 secs ... :/

Thursday, 1 March 2012

Mac (BSD) awk != Linux (GNU) awk, split BED file by chromosomes

To split your BED file by chromosome, this simple GNU AWK script will create "my.chrNNN.bed" file for each chromosome:

awk '{ print >> "my." $1 ".bed" }' my.bed

BSD's AWK

awk '{ file = "TFBS." $1 ".bed" ; print >> file }' TFBS.bed

credit: Assaf Gordon on BedTools mailing list for pointing this out

On a side note,
To split your BAM file by chromosome, you can use "bamtools split" ( bamtools here: https://github.com/pezmaster31/bamtools ) .

There is a SAMtools filter option as well.
Anyone benchmarked both to see which runs faster?

this discussion arose from someone trying to compute the mean coverage

bedtools coverage -abam my.bam -b my.bed -d | sort -k1,1 -k2,2n | groupby -g 1,2,3,4,5,6 -c 8 -o mean > my.txt

it doesn't scale too well with the no. of entries in the bed file
1,000 lines takes a few minutes
1,200,000 lines " it's been running for 12 hours and still not done yet " on a Mac Pro with speed of 2.66 GHz and 8 GB of Memory.

So splitting the file by chromosome helps to parallelize the process although the job might scale linearly

Thursday, 29 December 2011

multiBamCov (BAM) Counts sequence coverage for multiple position-sorted bams at specific loci defined in a BED/GFF/VCF file BedTools

multiBamCov will report a count of alignments for each BAM at each interval

Program: multiBamCov (v2.14.2)
Summary: Counts sequence coverage for multiple bams at specific loci.

Usage:   multiBamCov [OPTIONS] -bams aln.1.bam aln.2.bam ... aln.n.bam -bed

Options:
        -bams   The bam files.

        -bed    The bed file.

        -q      Minimum mapping quality allowed. Default is 0.

        -D      Include duplicate-marked reads. Default is to count non-duplicates only

        -F      Include failed-QC reads. Default is to count pass-QC reads only

        -p      Only count proper pairs. Default is to count all alignments with MAPQ
                greater than the -q argument, regardless of the BAM FLAG field.

the sample output looks like

chr20 0 100000 19312 16844
chr20 100000 200000 43910 37579
chr20 200000 300000 43245 43215
chr20 300000 400000 41653 47556
chr20 400000 500000 42929 43165
chr20 500000 600000 44265 45325

The last two columns are the count of alignments in the first and second BAM, respectively.

FAQ - How to get average depth of coverage for targeted seq regions using BedTools

Question
If I have a list of targeted resequencing regions and a bam file is
there any easy way to get the average depth of coverage for each region?
I could calculate it from all of the values in the histogram output from
coverageBed -hist but that seems messy. In addition is there any way to
adjust the minimum number of hits at a location for it to be included in
the fraction of bases in B that had non-zero coverage into bases in B
that had more-than-X coverage?

Reply
There is no option for directly computing the mean coverage. However, one can combine the -d option (reports the depth at each base in each interval) with the groupBy tool found in the filo package (https://github.com/arq5x/filo

) to compute the mean. For example, here's a 5 column BED of the UCSC simple repeat track:

$ head simRep.chr22.bed
22 16052163 16052199 trf 65
22 16059465 16059496 trf 53
22 16060479 16060524 trf 90
22 16060479 16060592 trf 120
22 16060517 16060609 trf 129
22 16071925 16071962 trf 74
22 16078616 16078686 trf 86
22 16079642 16079731 trf 61
22 16079644 16079696 trf 59
22 16079648 16079724 trf 52

I can compute the coverage at each base in each of these intervals as follows (column 6 is the position in the interval, column 7 is the depth):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | head
22 17825583 17825839 trf 372 1 2
22 17825583 17825839 trf 372 2 2
22 17825583 17825839 trf 372 3 2
22 17825583 17825839 trf 372 4 2
22 17825583 17825839 trf 372 5 2
22 17825583 17825839 trf 372 6 2
22 17825583 17825839 trf 372 7 3
22 17825583 17825839 trf 372 8 3
22 17825583 17825839 trf 372 9 3
22 17825583 17825839 trf 372 10 3

Now, I can compute the mean by grouping on columns 1-5 of the output (the orig. BED entry) and computing the mean on column 7 for each grouped interval (the last column in the output below is the mean):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | groupBy -g 1,2,3,4,5 -c 7 -o mean | head
22 17825583 17825839 trf 372 2.96875
22 16506905 16558419 trf 20099 7.6543852156695271205
22 16506905 16558419 trf 23903 7.6543852156695271205
22 16506941 16537880 trf 14401 7.47002165551569241586
22 16515001 16516375 trf 832 4.88427947598253275885
22 17039134 17049375 trf 5493 6.38755980861244054836
22 17039134 17049375 trf 4544 6.38755980861244054836
22 17039134 17049375 trf 5071 6.38755980861244054836
22 18087713 18088319 trf 759 5.42904290429042912791
22 19529662 19529964 trf 469 5.91721854304635730415

Note that coverageBed does not maintain the order of the intervals in the original file, so you might want to resort the output first:

$ coverageBed -abam aln.bam -b simRep.22.bed -d | sort -k1,1 -k2,2n | groupBy -g 1,2,3,4,5 -c 7 -o mean

> In addition is there any way to
> adjust the minimum number of hits at a location for it to be included in
> the fraction of bases in B that had non-zero coverage into bases in B
> that had more-than-X coverage?

There is no way to do this except by taking the raw output of the -d or -hist and computing this yourself.

Best,
Aaron

Friday, 18 November 2011

BED file format - FAQ

UCSC very good description of the BED format
http://genome.ucsc.edu/FAQ/FAQformat.html#format1

Bedtools attempts to auto-detect the file formats used in each command. Thus, as long as your file conform to the format definitions for each file type, you should be okay. For example:

- BAM is zero-based, half-open. SAM is 1-based, closed.
- BED is zero-based, half-open.

zero length intervals (start == end), which in BED format, are interpreted as insertions in the reference.

I can't confirm this but from the top of my head, I recall that

start -1, end in BED format refers to SNPs

(source mostly from BEDTools mailling list)

Monday, 4 July 2011

genomeCoverageBed to look at coverage of your WGS

BEDTools is a very useful set of programs for looking at your NGS sequencing result.

one of which I use regularly is

$ genomeCoverageBed -d -ibam sortedBamFile.bam -g genome.csv > coverage.csv

so you only need your bam file and a genome file (tab-delimited file with "contig_name contig_size" information for all contigs in the reference genome.
This file can be done automatically with

$ samtools faidx reference_genome.fasta

which generates a .fai file which contains the info required. (I very nearly went and wrote a bioperl script to generate this, luckily I remembered the contents of the fai file.

Update:
Do read the helpful comments by Micheal for using samtools idxstats on the bam file. And the shortcomings of looking at the coverage this way.
He has a helpful recent post here

Accurate genome-wide read depth calculation (how-to)

I must say I hadn't thought of the points he raised, but I was generating these to check for evenness of coverage for a bacteria reseq project. I like the coverage plot in IGV but am not sure if there are opensource tools to do the same.
Any tips to share?

Other references
Discussion at BioStar

Saturday, 29 January 2011

Pybedtools - python wrapper for bedtools

Looks interesting!
URL above links to tutorial for use to generate a 3 way venn diagram.

Kevin's GATTACA World