Showing posts with label FAQ. Show all posts

Saturday, 9 March 2013

Linux CLI gems

Was alerted to this 'shell script' on github that contains several lines of linux commandline gems. That's how I store my code snippets too so that I can see them in contextual colors when you open in VIM or some other editor that is aware of the content (see example below)


# remove spaces from filenames in current directory

rename -n 's/[\s]/''/g' *

another good resource for picking up new CLI magic is at commandlinefu.com

Enjoy!

Thursday, 29 December 2011

FAQ - How to get average depth of coverage for targeted seq regions using BedTools

Question
If I have a list of targeted resequencing regions and a bam file is
there any easy way to get the average depth of coverage for each region?
I could calculate it from all of the values in the histogram output from
coverageBed -hist but that seems messy. In addition is there any way to
adjust the minimum number of hits at a location for it to be included in
the fraction of bases in B that had non-zero coverage into bases in B
that had more-than-X coverage?

Reply
There is no option for directly computing the mean coverage. However, one can combine the -d option (reports the depth at each base in each interval) with the groupBy tool found in the filo package (https://github.com/arq5x/filo

) to compute the mean. For example, here's a 5 column BED of the UCSC simple repeat track:

$ head simRep.chr22.bed
22 16052163 16052199 trf 65
22 16059465 16059496 trf 53
22 16060479 16060524 trf 90
22 16060479 16060592 trf 120
22 16060517 16060609 trf 129
22 16071925 16071962 trf 74
22 16078616 16078686 trf 86
22 16079642 16079731 trf 61
22 16079644 16079696 trf 59
22 16079648 16079724 trf 52

I can compute the coverage at each base in each of these intervals as follows (column 6 is the position in the interval, column 7 is the depth):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | head
22 17825583 17825839 trf 372 1 2
22 17825583 17825839 trf 372 2 2
22 17825583 17825839 trf 372 3 2
22 17825583 17825839 trf 372 4 2
22 17825583 17825839 trf 372 5 2
22 17825583 17825839 trf 372 6 2
22 17825583 17825839 trf 372 7 3
22 17825583 17825839 trf 372 8 3
22 17825583 17825839 trf 372 9 3
22 17825583 17825839 trf 372 10 3

Now, I can compute the mean by grouping on columns 1-5 of the output (the orig. BED entry) and computing the mean on column 7 for each grouped interval (the last column in the output below is the mean):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | groupBy -g 1,2,3,4,5 -c 7 -o mean | head
22 17825583 17825839 trf 372 2.96875
22 16506905 16558419 trf 20099 7.6543852156695271205
22 16506905 16558419 trf 23903 7.6543852156695271205
22 16506941 16537880 trf 14401 7.47002165551569241586
22 16515001 16516375 trf 832 4.88427947598253275885
22 17039134 17049375 trf 5493 6.38755980861244054836
22 17039134 17049375 trf 4544 6.38755980861244054836
22 17039134 17049375 trf 5071 6.38755980861244054836
22 18087713 18088319 trf 759 5.42904290429042912791
22 19529662 19529964 trf 469 5.91721854304635730415

Note that coverageBed does not maintain the order of the intervals in the original file, so you might want to resort the output first:

$ coverageBed -abam aln.bam -b simRep.22.bed -d | sort -k1,1 -k2,2n | groupBy -g 1,2,3,4,5 -c 7 -o mean

> In addition is there any way to
> adjust the minimum number of hits at a location for it to be included in
> the fraction of bases in B that had non-zero coverage into bases in B
> that had more-than-X coverage?

There is no way to do this except by taking the raw output of the -d or -hist and computing this yourself.

Best,
Aaron

Friday, 18 November 2011

BED file format - FAQ

UCSC very good description of the BED format
http://genome.ucsc.edu/FAQ/FAQformat.html#format1

Bedtools attempts to auto-detect the file formats used in each command. Thus, as long as your file conform to the format definitions for each file type, you should be okay. For example:

- BAM is zero-based, half-open. SAM is 1-based, closed.
- BED is zero-based, half-open.

zero length intervals (start == end), which in BED format, are interpreted as insertions in the reference.

I can't confirm this but from the top of my head, I recall that

start -1, end in BED format refers to SNPs

(source mostly from BEDTools mailling list)

Thursday, 31 March 2011

Convert SAM / BAM to fasta / fastq

Probably one of the most freq FAQ

latest thread in biostar
http://biostar.stackexchange.com/questions/6993/convert-bam-file-to-fasta-file

Samtofastq using Picard
http://picard.sourceforge.net/command-line-overview.shtml#SamToFastq

Samtools and awk to make fasta from sam

samtools view filename.bam | awk '{OFS="\t"; print ">"$1"\n"$10}' - > filename.fasta

Biopython and pysam (code contributed by Brad Chapman)

http://biostar.stackexchange.com/questions/6993/convert-bam-file-to-fasta-file/6994#6994

Tuesday, 22 March 2011

FAQ-SNPs missing when called with more samples

Question
Using mpileup called with 2 different samples, Im getting this
particular SNP which has a gd coverage (DP=55) only in one particular
sample. This is good.

However when mpileup was called with 10 samples, the SNP got lost. Im
just trying to figure out if the SNP got 'drowned' out by the other 9
samples which doesnt have the SNP and hence, wasnt called. Is this how
mpileup works

Excellent answer by Heng Li

With more samples, you gain power on SNPs shared between samples, but lose power on singleton SNPs. Here is a way of thinking of this. Suppose we have 1% false positive rate (FPR) for one sample. If we call SNPs from 10 samples separately and then combine the calls, the FPR would be around 5% (not 10% because more SNPs are found given 10 samples). To retain a low FPR on singletons we have to be more stringent. Nonetheless, with more samples, we can usually get overall better calls than calling SNPs in each sample separately because information between samples is used more effectively.

Monday, 11 October 2010

Genome sizes

Compendium of links for gauging genome sizes (useful for calculating genome coverage)

My fav diagram to get a relative feel for the genome sizes
is from

For which the image is reproduced here Figure 1-38 Genome sizes compared

There's other sources like

Wikipedia http://en.wikipedia.org/wiki/Genome_size
also see Comparison of different genome sizes

DOGS - Database Of Genome Sizes last modified Tuesday 16th of May 2006

Google images search for "genome sizes" for other lucky finds.

Molecular Biology of the Cell, Fourth Edition

Wednesday, 15 September 2010

Should I remove duplicates in my NGS data?

interesting posts here about PCR duplicates in NGS data

links compiled by malachig

Redundant reads are removed from ChIP-seq, what about RNA-seq
Duplicate reads removal

Threshold for duplicate removal

Heavy read stacking on the solexa platform
Should identical reads be reduced to a single count?

Removing duplicate reads from multigig .csfasta

Source of duplicate reads and possible ways of reducing

Friday, 27 August 2010

What hardware do I get for NGS data analysis?

Another that should belong in a NGS FAQ

Computer Hardware: CPU vs. Memory
http://seqanswers.com/forums/showthread.php?t=6564
In forum: Bioinformatics

Why not use BLAST for NGS reads?

This should be in a FAQ somewhere

Why do we use mapping programs instead of blast for mapping to a reference?
http://seqanswers.com/forums/showthread.php?t=6568
In forum: Bioinformatics

Kevin's GATTACA World