Showing posts with label FAQ. Show all posts
Showing posts with label FAQ. Show all posts

Saturday, 9 March 2013

Linux CLI gems

Was alerted to this 'shell script' on github that contains several lines of linux commandline gems. That's how I store my code snippets too so that I can see them in contextual colors when you open in VIM or some other editor that is aware of the content (see example below)


# remove spaces from filenames in current directory
rename -n 's/[\s]/''/g' *



another good resource for picking up new CLI magic is at commandlinefu.com

Enjoy!

Thursday, 29 December 2011

FAQ - How to get average depth of coverage for targeted seq regions using BedTools

Question
If I have a list of targeted resequencing regions and a bam file is
there any easy way to get the average depth of coverage for each region?
I could calculate it from all of the values in the histogram output from
coverageBed -hist but that seems messy. In addition is there any way to
adjust the minimum number of hits at a location for it to be included in
the fraction of bases in B that had non-zero coverage into bases in B
that had more-than-X coverage?

Reply
There is no option for directly computing the mean coverage. However, one can combine the -d option (reports the depth at each base in each interval) with the groupBy tool found in the filo package (https://github.com/arq5x/filo
) to compute the mean.  For example, here's a 5 column BED of the UCSC simple repeat track:

$ head simRep.chr22.bed
22      16052163        16052199        trf     65
22      16059465        16059496        trf     53
22      16060479        16060524        trf     90
22      16060479        16060592        trf     120
22      16060517        16060609        trf     129
22      16071925        16071962        trf     74
22      16078616        16078686        trf     86
22      16079642        16079731        trf     61
22      16079644        16079696        trf     59
22      16079648        16079724        trf     52

I can compute the coverage at each base in each of these intervals as follows (column 6 is the position in the interval, column 7 is the depth):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | head
22      17825583        17825839        trf     372     1       2
22      17825583        17825839        trf     372     2       2
22      17825583        17825839        trf     372     3       2
22      17825583        17825839        trf     372     4       2
22      17825583        17825839        trf     372     5       2
22      17825583        17825839        trf     372     6       2
22      17825583        17825839        trf     372     7       3
22      17825583        17825839        trf     372     8       3
22      17825583        17825839        trf     372     9       3
22      17825583        17825839        trf     372     10      3

Now, I can compute the mean by grouping on columns 1-5 of the output (the orig. BED entry) and computing the mean on column 7 for each grouped interval (the last column in the output below is the mean):

$ coverageBed -abam aln.bam -b simRep.22.bed -d | groupBy -g 1,2,3,4,5 -c 7 -o mean | head
22      17825583        17825839        trf     372     2.96875
22      16506905        16558419        trf     20099   7.6543852156695271205
22      16506905        16558419        trf     23903   7.6543852156695271205
22      16506941        16537880        trf     14401   7.47002165551569241586
22      16515001        16516375        trf     832     4.88427947598253275885
22      17039134        17049375        trf     5493    6.38755980861244054836
22      17039134        17049375        trf     4544    6.38755980861244054836
22      17039134        17049375        trf     5071    6.38755980861244054836
22      18087713        18088319        trf     759     5.42904290429042912791
22      19529662        19529964        trf     469     5.91721854304635730415

Note that coverageBed does not maintain the order of the intervals in the original file, so you might want to resort the output first:

$ coverageBed -abam aln.bam -b simRep.22.bed -d | sort -k1,1 -k2,2n | groupBy -g 1,2,3,4,5 -c 7 -o mean

>  In addition is there any way to
> adjust the minimum number of hits at a location for it to be included in
> the fraction of bases in B that had non-zero coverage into bases in B
> that had more-than-X coverage?

There is no way to do this except by taking the raw output of the -d or -hist and computing this yourself.

Best,
Aaron

Friday, 18 November 2011

BED file format - FAQ


UCSC very good description of the BED format
http://genome.ucsc.edu/FAQ/FAQformat.html#format1



Bedtools attempts to auto-detect the file formats used in each command.  Thus, as long as your file conform to the format definitions for each file type, you should be okay.  For example:

- BAM is zero-based, half-open.  SAM is 1-based, closed.
- BED is zero-based, half-open.


zero length intervals (start == end), which in BED format, are interpreted as insertions in the reference.

I can't confirm this but from the top of my head, I recall that

start -1, end in BED format refers to SNPs


(source mostly from BEDTools mailling list)

Thursday, 31 March 2011

Convert SAM / BAM to fasta / fastq

Probably one of the most freq FAQ

latest thread in biostar
http://biostar.stackexchange.com/questions/6993/convert-bam-file-to-fasta-file

Samtofastq using Picard
http://picard.sourceforge.net/command-line-overview.shtml#SamToFastq

Samtools and awk to make fasta from sam
samtools view filename.bam | awk '{OFS="\t"; print ">"$1"\n"$10}' - > filename.fasta
 
Biopython and pysam (code contributed by Brad Chapman)
http://biostar.stackexchange.com/questions/6993/convert-bam-file-to-fasta-file/6994#6994 

Tuesday, 22 March 2011

FAQ-SNPs missing when called with more samples

Question
Using mpileup called with 2 different samples, Im getting this
particular SNP which has a gd coverage (DP=55) only in one particular
sample. This is good.

However when mpileup was called with 10 samples, the SNP got lost. Im
just trying to figure out if the SNP got 'drowned' out by the other 9
samples which doesnt have the SNP and hence, wasnt called. Is this how
mpileup works

Excellent answer by Heng Li 

With more samples, you gain power on SNPs shared between samples, but lose power on singleton SNPs. Here is a way of thinking of this. Suppose we have 1% false positive rate (FPR) for one sample. If we call SNPs from 10 samples separately and then combine the calls, the FPR would be around 5% (not 10% because more SNPs are found given 10 samples). To retain a low FPR on singletons we have to be more stringent. Nonetheless, with more samples, we can usually get overall better calls than calling SNPs in each sample separately because information between samples is used more effectively.

Monday, 11 October 2010

Genome sizes

Compendium of links for gauging genome sizes (useful for calculating genome coverage)

My fav diagram to get a relative feel for the genome sizes
is from Molecular Biology of the Cell
  For which the image is reproduced here Figure 1-38   Genome sizes compared

There's other sources like

Wikipedia http://en.wikipedia.org/wiki/Genome_size
also see Comparison of different genome sizes

DOGS - Database Of Genome Sizes last modified Tuesday 16th of May 2006


Google images search for "genome sizes" for other lucky finds.



Molecular Biology of the Cell, Fourth Edition

Friday, 27 August 2010

What hardware do I get for NGS data analysis?

Another that should belong in a NGS FAQ

Computer Hardware: CPU vs. Memory
http://seqanswers.com/forums/showthread.php?t=6564
In forum: Bioinformatics

Why not use BLAST for NGS reads?

This should be in a FAQ somewhere

Why do we use mapping programs instead of blast for mapping to a reference?
http://seqanswers.com/forums/showthread.php?t=6568
In forum: Bioinformatics


Datanami, Woe be me