PLINK gPLINK Haploview Whole genome association software tutorial
Showing posts with label howto. Show all posts
Showing posts with label howto. Show all posts
Monday, 3 September 2012
Tuesday, 21 August 2012
Getting Started with R and Hadoop
Getting Started with R and Hadoop: (This article was first published on Revolutions, and kindly contr... http://bit.ly/POPaK6
For newcomers to map-reduce programming with R and Hadoop, Jeffrey's presentation includes a step-by-step example of computing flight times from air traffic data. The last few slides some advanced features: how to work directly with files in HDFS from R with the rhdfs package; and how to simulate a Hadoop cluster on the local machine (useful for development, testing and learning RHadoop). Jeffrey also mentions that the RHadoop tutorial is a good resource for new users.
You can find Jeffrey's slides embedded below, and a video of the presentation is also available. You might also want to check out Jeffrey's older presentation Big Data Step-by-Step for tips on setting up a compute environment with Hadoop and R.
Wednesday, 1 August 2012
SSD / HDD benchmarking on Linux
Found this gem of a wiki on archlinux the distro for speed.
https://wiki.archlinux.org/index.php/SSD_Benchmarking
There are several HDD benchmarking utils/ways
some like dd (see exerpt below) are avail on all linux systems by default and good for quick and dirty benchmarking
like I recently found that running on a HDD RAID that was twice as fast might have saved me half the time on the samtools mpileup command!
https://wiki.archlinux.org/index.php/SSD_Benchmarking
There are several HDD benchmarking utils/ways
some like dd (see exerpt below) are avail on all linux systems by default and good for quick and dirty benchmarking
like I recently found that running on a HDD RAID that was twice as fast might have saved me half the time on the samtools mpileup command!
Using dd
Note: This method requires the command to be executed from a mounted partition on the device of interest!
First, enter a directory on the SSD with at least 1.1 GB of free space (and one that obviously gives your user wrx permissions) and write a test file to measure write speeds and to give the device something to read:
$ cd /path/to/SSD $ dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 1024+0 records in 1024+0 records out w bytes (x GB) copied, y s, z MB/s
Next, clear the buffer-cache to accurately measure read speeds directly from the device:
# echo 3 > /proc/sys/vm/drop_caches $ dd if=tempfile of=/dev/null bs=1M count=1024 1024+0 records in 1024+0 records out w bytes (x GB) copied, y s, z MB/s
Now that the last file is in the buffer, repeat the command to see the speed of the buffer-cache:
$ dd if=tempfile of=/dev/null bs=1M count=1024 1024+0 records in 1024+0 records out w bytes (x GB) copied, y s, z GB/s
Wednesday, 25 July 2012
howto SKAT R library
The R package is simple to install though you will need a recent version of R (compile as local user if you have no admin rights)
download from
~/Downloads$ R CMD INSTALL SKAT_0.76.tgz
* installing to library '/Library/Frameworks/R.framework/Versions/2.14/Resources/library'
* installing *binary* package 'SKAT' ...
* DONE (SKAT)
Thursday, 29 December 2011
multiBamCov (BAM) Counts sequence coverage for multiple position-sorted bams at specific loci defined in a BED/GFF/VCF file BedTools
multiBamCov will report a count of alignments for each BAM at each interval
Program: multiBamCov (v2.14.2)
Summary: Counts sequence coverage for multiple bams at specific loci.
Usage: multiBamCov [OPTIONS] -bams aln.1.bam aln.2.bam ... aln.n.bam -bed
Options:
-bams The bam files.
-bed The bed file.
-q Minimum mapping quality allowed. Default is 0.
-D Include duplicate-marked reads. Default is to count non-duplicates only
-F Include failed-QC reads. Default is to count pass-QC reads only
-p Only count proper pairs. Default is to count all alignments with MAPQ
greater than the -q argument, regardless of the BAM FLAG field.
the sample output looks like
chr20 0 100000 19312 16844
chr20 100000 200000 43910 37579
chr20 200000 300000 43245 43215
chr20 300000 400000 41653 47556
chr20 400000 500000 42929 43165
chr20 500000 600000 44265 45325
The last two columns are the count of alignments in the first and second BAM, respectively.
Program: multiBamCov (v2.14.2)
Summary: Counts sequence coverage for multiple bams at specific loci.
Usage: multiBamCov [OPTIONS] -bams aln.1.bam aln.2.bam ... aln.n.bam -bed
Options:
-bams The bam files.
-bed The bed file.
-q Minimum mapping quality allowed. Default is 0.
-D Include duplicate-marked reads. Default is to count non-duplicates only
-F Include failed-QC reads. Default is to count pass-QC reads only
-p Only count proper pairs. Default is to count all alignments with MAPQ
greater than the -q argument, regardless of the BAM FLAG field.
the sample output looks like
chr20 0 100000 19312 16844
chr20 100000 200000 43910 37579
chr20 200000 300000 43245 43215
chr20 300000 400000 41653 47556
chr20 400000 500000 42929 43165
chr20 500000 600000 44265 45325
The last two columns are the count of alignments in the first and second BAM, respectively.
FAQ - How to get average depth of coverage for targeted seq regions using BedTools
Question
If I have a list of targeted resequencing regions and a bam file is
there any easy way to get the average depth of coverage for each region?
I could calculate it from all of the values in the histogram output from
coverageBed -hist but that seems messy. In addition is there any way to
adjust the minimum number of hits at a location for it to be included in
the fraction of bases in B that had non-zero coverage into bases in B
that had more-than-X coverage?
Reply
There is no option for directly computing the mean coverage. However, one can combine the -d option (reports the depth at each base in each interval) with the groupBy tool found in the filo package (https://github.com/arq5x/filo
) to compute the mean. For example, here's a 5 column BED of the UCSC simple repeat track:
$ head simRep.chr22.bed
22 16052163 16052199 trf 65
22 16059465 16059496 trf 53
22 16060479 16060524 trf 90
22 16060479 16060592 trf 120
22 16060517 16060609 trf 129
22 16071925 16071962 trf 74
22 16078616 16078686 trf 86
22 16079642 16079731 trf 61
22 16079644 16079696 trf 59
22 16079648 16079724 trf 52
I can compute the coverage at each base in each of these intervals as follows (column 6 is the position in the interval, column 7 is the depth):
$ coverageBed -abam aln.bam -b simRep.22.bed -d | head
22 17825583 17825839 trf 372 1 2
22 17825583 17825839 trf 372 2 2
22 17825583 17825839 trf 372 3 2
22 17825583 17825839 trf 372 4 2
22 17825583 17825839 trf 372 5 2
22 17825583 17825839 trf 372 6 2
22 17825583 17825839 trf 372 7 3
22 17825583 17825839 trf 372 8 3
22 17825583 17825839 trf 372 9 3
22 17825583 17825839 trf 372 10 3
Now, I can compute the mean by grouping on columns 1-5 of the output (the orig. BED entry) and computing the mean on column 7 for each grouped interval (the last column in the output below is the mean):
$ coverageBed -abam aln.bam -b simRep.22.bed -d | groupBy -g 1,2,3,4,5 -c 7 -o mean | head
22 17825583 17825839 trf 372 2.96875
22 16506905 16558419 trf 20099 7.6543852156695271205
22 16506905 16558419 trf 23903 7.6543852156695271205
22 16506941 16537880 trf 14401 7.47002165551569241586
22 16515001 16516375 trf 832 4.88427947598253275885
22 17039134 17049375 trf 5493 6.38755980861244054836
22 17039134 17049375 trf 4544 6.38755980861244054836
22 17039134 17049375 trf 5071 6.38755980861244054836
22 18087713 18088319 trf 759 5.42904290429042912791
22 19529662 19529964 trf 469 5.91721854304635730415
Note that coverageBed does not maintain the order of the intervals in the original file, so you might want to resort the output first:
$ coverageBed -abam aln.bam -b simRep.22.bed -d | sort -k1,1 -k2,2n | groupBy -g 1,2,3,4,5 -c 7 -o mean
> In addition is there any way to
> adjust the minimum number of hits at a location for it to be included in
> the fraction of bases in B that had non-zero coverage into bases in B
> that had more-than-X coverage?
There is no way to do this except by taking the raw output of the -d or -hist and computing this yourself.
Best,
Aaron
If I have a list of targeted resequencing regions and a bam file is
there any easy way to get the average depth of coverage for each region?
I could calculate it from all of the values in the histogram output from
coverageBed -hist but that seems messy. In addition is there any way to
adjust the minimum number of hits at a location for it to be included in
the fraction of bases in B that had non-zero coverage into bases in B
that had more-than-X coverage?
Reply
There is no option for directly computing the mean coverage. However, one can combine the -d option (reports the depth at each base in each interval) with the groupBy tool found in the filo package (https://github.com/arq5x/filo
$ head simRep.chr22.bed
22 16052163 16052199 trf 65
22 16059465 16059496 trf 53
22 16060479 16060524 trf 90
22 16060479 16060592 trf 120
22 16060517 16060609 trf 129
22 16071925 16071962 trf 74
22 16078616 16078686 trf 86
22 16079642 16079731 trf 61
22 16079644 16079696 trf 59
22 16079648 16079724 trf 52
I can compute the coverage at each base in each of these intervals as follows (column 6 is the position in the interval, column 7 is the depth):
$ coverageBed -abam aln.bam -b simRep.22.bed -d | head
22 17825583 17825839 trf 372 1 2
22 17825583 17825839 trf 372 2 2
22 17825583 17825839 trf 372 3 2
22 17825583 17825839 trf 372 4 2
22 17825583 17825839 trf 372 5 2
22 17825583 17825839 trf 372 6 2
22 17825583 17825839 trf 372 7 3
22 17825583 17825839 trf 372 8 3
22 17825583 17825839 trf 372 9 3
22 17825583 17825839 trf 372 10 3
Now, I can compute the mean by grouping on columns 1-5 of the output (the orig. BED entry) and computing the mean on column 7 for each grouped interval (the last column in the output below is the mean):
$ coverageBed -abam aln.bam -b simRep.22.bed -d | groupBy -g 1,2,3,4,5 -c 7 -o mean | head
22 17825583 17825839 trf 372 2.96875
22 16506905 16558419 trf 20099 7.6543852156695271205
22 16506905 16558419 trf 23903 7.6543852156695271205
22 16506941 16537880 trf 14401 7.47002165551569241586
22 16515001 16516375 trf 832 4.88427947598253275885
22 17039134 17049375 trf 5493 6.38755980861244054836
22 17039134 17049375 trf 4544 6.38755980861244054836
22 17039134 17049375 trf 5071 6.38755980861244054836
22 18087713 18088319 trf 759 5.42904290429042912791
22 19529662 19529964 trf 469 5.91721854304635730415
Note that coverageBed does not maintain the order of the intervals in the original file, so you might want to resort the output first:
$ coverageBed -abam aln.bam -b simRep.22.bed -d | sort -k1,1 -k2,2n | groupBy -g 1,2,3,4,5 -c 7 -o mean
> In addition is there any way to
> adjust the minimum number of hits at a location for it to be included in
> the fraction of bases in B that had non-zero coverage into bases in B
> that had more-than-X coverage?
Best,
Aaron
Saturday, 17 September 2011
FAQ - Howto do RNA-seq Bioinformatics analysis on Galaxy
One of the top questions posted in the Galaxy User mailing list.
reposted the summary links here for convenience.
Tutorial covering RNA-seq analysis (tool under "NGS: RNA Analysis")
http://usegalaxy.org/u/jeremy/ p/galaxy-rna-seq-analysis-exer cise
FAQ to help with troubleshooting (if needed):
http://usegalaxy.org/u/jeremy/ p/transcriptome-analysis-faq
For visualization, an update that allows the use of a user-specified
fasta reference genome is coming out very soon. For now, you can view
annotation by creating a custom genome build, but the actual reference
will be not included. Use "Visualization -> New Track Browser" and
follow the instructions for "Is the build not listed here? Add a Custom
Build".
Help for using the tool is available here:
http://galaxyproject.org/ Learn/Visualization
Currently, RNA-seq analysis for SOLiD data is available only on Galaxy test server:
http://test.g2.bx.psu.edu/
Please note that there are quotas associated with the test server:
http://galaxyproject.org/wiki/ News/Galaxy%20Public%20Servers %20Usage%20Quotas
[Credit : Jennifer Jackson ]
http://usegalaxy.org
http://galaxyproject.org/Suppo rt
Another helpful resource (non-Galaxy related though) is
http://seqanswers.com/wiki/How-to/RNASeq_analysis written by Matthew Young
and the discussion on this wiki @ seqanswers
http://seqanswers.com/forums/showthread.php?t=7068
As well as this review paper in Genome Biology RNA-seq Review
Stephen mentions this tutorial as well in this blog
Dr David Matthews
has posted a starter thread to discuss RNA seq analysis workflow on
Paired End Seq with Tophat on Galaxy in the mailling list.
His post and the discussion thread is here.
http://gmod.827538.n3.nabble.com/Replicates-tt2397672.html#a2560404
kevin:waiting for the next common question to come next, is there Ion Torrent Support on Galaxy ?)
reposted the summary links here for convenience.
Tutorial covering RNA-seq analysis (tool under "NGS: RNA Analysis")
http://usegalaxy.org/u/jeremy/
FAQ to help with troubleshooting (if needed):
http://usegalaxy.org/u/jeremy/
For visualization, an update that allows the use of a user-specified
fasta reference genome is coming out very soon. For now, you can view
annotation by creating a custom genome build, but the actual reference
will be not included. Use "Visualization -> New Track Browser" and
follow the instructions for "Is the build not listed here? Add a Custom
Build".
Help for using the tool is available here:
http://galaxyproject.org/
Currently, RNA-seq analysis for SOLiD data is available only on Galaxy test server:
http://test.g2.bx.psu.edu/
Please note that there are quotas associated with the test server:
http://galaxyproject.org/wiki/
[Credit : Jennifer Jackson ]
http://usegalaxy.org
http://galaxyproject.org/Suppo
Another helpful resource (non-Galaxy related though) is
http://seqanswers.com/wiki/How-to/RNASeq_analysis written by Matthew Young
and the discussion on this wiki @ seqanswers
http://seqanswers.com/forums/showthread.php?t=7068
As well as this review paper in Genome Biology RNA-seq Review
Stephen mentions this tutorial as well in this blog
RNA seq analysis workflow on Galaxy (Bristol workflow)
His post and the discussion thread is here.
http://gmod.827538.n3.nabble.com/Replicates-tt2397672.html#a2560404
kevin:waiting for the next common question to come next, is there Ion Torrent Support on Galaxy ?)
Labels:
analysis,
bioinformatics,
galaxy,
howto,
Next Generation Sequencing,
NGS,
RNA-seq,
tutorial,
usegalaxy
Monday, 4 July 2011
genomeCoverageBed to look at coverage of your WGS
BEDTools is a very useful set of programs for looking at your NGS sequencing result.
one of which I use regularly is
$ genomeCoverageBed -d -ibam sortedBamFile.bam -g genome.csv > coverage.csv
so you only need your bam file and a genome file (tab-delimited file with "contig_name contig_size" information for all contigs in the reference genome.
This file can be done automatically with
$ samtools faidx reference_genome.fasta
which generates a .fai file which contains the info required. (I very nearly went and wrote a bioperl script to generate this, luckily I remembered the contents of the fai file.
Update:
Do read the helpful comments by Micheal for using samtools idxstats on the bam file. And the shortcomings of looking at the coverage this way.
He has a helpful recent post here
I must say I hadn't thought of the points he raised, but I was generating these to check for evenness of coverage for a bacteria reseq project. I like the coverage plot in IGV but am not sure if there are opensource tools to do the same.
Any tips to share?
Other references
Discussion at BioStar
one of which I use regularly is
$ genomeCoverageBed -d -ibam sortedBamFile.bam -g genome.csv > coverage.csv
so you only need your bam file and a genome file (tab-delimited file with "contig_name contig_size" information for all contigs in the reference genome.
This file can be done automatically with
$ samtools faidx reference_genome.fasta
which generates a .fai file which contains the info required. (I very nearly went and wrote a bioperl script to generate this, luckily I remembered the contents of the fai file.
Update:
Do read the helpful comments by Micheal for using samtools idxstats on the bam file. And the shortcomings of looking at the coverage this way.
He has a helpful recent post here
Accurate genome-wide read depth calculation (how-to)
I must say I hadn't thought of the points he raised, but I was generating these to check for evenness of coverage for a bacteria reseq project. I like the coverage plot in IGV but am not sure if there are opensource tools to do the same.
Any tips to share?
Other references
Discussion at BioStar
Thursday, 9 June 2011
BlueSEQ- an excellent place to start understanding NGS
The folks at blueseq have nicely summarized the applications which platforms are most suited for in a easy to understand table.(Your views may vary especially if you are aligned with any of the platform)
there is also a glossary, that is helpful in explaining terms to newbies (read: will point students to this link instead of having to explaining it myself)
Alerted to this resource by this post
there is also a glossary, that is helpful in explaining terms to newbies (read: will point students to this link instead of having to explaining it myself)
Alerted to this resource by this post
Goldmine of unbiased expert knowledge on next generation sequencing
Monday, 25 April 2011
Webinar Introduction to Ion Torrent Informatics
Webinar by Life Technologies
Introduction to Ion Torrent Informatics
4 May 2011
Register via this web link
Introduction to Ion Torrent Informatics
4 May 2011
Register via this web link
Event Title | Introduction to Ion Torrent Informatics |
Event Description | The Ion Torrent semiconductor sequencing platform includes a preconfigured Torrent Server that processes data from the Ion PGM™ Sequencer. With each semiconductor sequencing run, data analysis occurs on the Torrent Server. Scientists can interact with Torrent Server through the remotely accessible Torrent Browser web interface. Both processing status and run performance are easily viewable through these web pages. From Torrent Browser, detailed analysis reports can be viewed or sequencing data can be downloaded to your local computer for downstream analysis. The Torrent Suite Software formats base call and alignment data using industry standard data formats giving users the flexibility to use a wide variety of analysis tools. For scientists who are looking for analysis solutions, several downstream software packages are available and will be briefly demonstrated to show how DNA variations can be identified. |
Monday, 11 October 2010
Installing SOLiD™ System de novo Accessory Tools 2.0 with Velvet and MUMmer
howto install on CentOS 5.4
download the example data to run through the pipeline
http://solidsoftwaretools.com/gf/project/ecoli50x50/
http://download.solidsoftwaretools.com/denovo/ecoli_600x_F3.csfasta.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_F3.qual.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_R3.csfasta.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_R3.qual.gz
Description
This is a 50X50 Mate pair library from DH10B produced by SOLiD™ system.The set includes .csfasta and .qual files for F3 and R3. The insert size of the library is 1300bp and it is about 600X coverage of the DH10B genome. The results from MP library in the de novo documents are generated from this dataset.
YMMV


Next step:
wget http://solidsoftwaretools.com/gf/project/denovo/ #just to keep a record
wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_0.7.55.tgz
wget http://downloads.sourceforge.net/project/mummer/mummer/3.22/MUMmer3.22.tar.gz
tar zxvf denovo2.tgz
cp velvet_0.7.55.tgz denovo2 #you can use mv if you don’t mind downloading again
cp MUMmer3.22.tar.gz denovo2
cd denovo2
tar zxvf velvet_0.7.55.tgz
tar zxvf MUMmer3.22.tar.gz
cd MUMmer3.22/src/kurtz #this was the part where I deviated from instructions
gmake mummer #Might be redundant but running gmake at root dir gave no binary
gmake |tee gmake-install.log
download the example data to run through the pipeline
http://solidsoftwaretools.com/gf/project/ecoli50x50/
http://download.solidsoftwaretools.com/denovo/ecoli_600x_F3.csfasta.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_F3.qual.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_R3.csfasta.gz
http://download.solidsoftwaretools.com/denovo/ecoli_600x_R3.qual.gz
Description
This is a 50X50 Mate pair library from DH10B produced by SOLiD™ system.The set includes .csfasta and .qual files for F3 and R3. The insert size of the library is 1300bp and it is about 600X coverage of the DH10B genome. The results from MP library in the de novo documents are generated from this dataset.
YMMV
Wednesday, 25 August 2010
howto do BWA mapping in colorspace
Here's what I use for bwa alignment (without removing PCR dups).
You can replace the paths with your own and put into a bash script for automation
comments or corrections welcome!
#Visit kevin-gattaca.blogspot.com to see updates of this template!
#http://kevin-gattaca.blogspot.com/2010/08/howto-do-bwa-mapping-in-colorspace.html
#updated 16th Mar 2011
#Creates colorspace index
bwa index -a bwtsw -c hg18.fasta
#convert to fastq.gz
perl /opt/bwa-0.5.7/solid2fastq.pl Sample-input-prefix-name Sample
#aln using 4 threads
#-l 25 seed length
#-k 2 mismatches allowed in seed
#-n 10 total mismatches allowed
bwa aln -c -t 4 -l 25 -k 2 -n 10 /data/public/bwa-color-index/hg18.fasta Sample.single.fastq.gz > Sample.bwa.hg18.sai
#for bwa samse
bwa samse /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sai Sample.single.fastq.gz > Sample.bwa.hg18.sam
#creates bam file from pre-generated .fai file
samtools view -bt /data/public/hg18.fasta.fai -o Sample.bwa.hg18.sam.bam Sample.bwa.hg18.sam
#sorts bam file
samtools sort Sample.bwa.hg18.sam.bam{,.sorted}
#From a sorted BAM alignment, raw SNP and indel calls are acquired by:
samtools pileup -vcf /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup
#resultant output should be further filtered by:
/opt/samtools/misc/samtools.pl varFilter Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup | awk '$6>=20' > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup.final.pileup
#new section using mpileup and bcftools to generate vcf files
samtools mpileup -ugf hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
Do note the helpful comments below! Repost here for clarity.
Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.
Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.
Also:
1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".
2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."
The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).
2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.
3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)
4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.
5) View the alignments with samtools tview.
You can replace the paths with your own and put into a bash script for automation
comments or corrections welcome!
#Visit kevin-gattaca.blogspot.com to see updates of this template!
#http://kevin-gattaca.blogspot.com/2010/08/howto-do-bwa-mapping-in-colorspace.html
#updated 16th Mar 2011
#Creates colorspace index
bwa index -a bwtsw -c hg18.fasta
#convert to fastq.gz
perl /opt/bwa-0.5.7/solid2fastq.pl Sample-input-prefix-name Sample
#aln using 4 threads
#-l 25 seed length
#-k 2 mismatches allowed in seed
#-n 10 total mismatches allowed
bwa aln -c -t 4 -l 25 -k 2 -n 10 /data/public/bwa-color-index/hg18.fasta Sample.single.fastq.gz > Sample.bwa.hg18.sai
#for bwa samse
bwa samse /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sai Sample.single.fastq.gz > Sample.bwa.hg18.sam
#creates bam file from pre-generated .fai file
samtools view -bt /data/public/hg18.fasta.fai -o Sample.bwa.hg18.sam.bam Sample.bwa.hg18.sam
#sorts bam file
samtools sort Sample.bwa.hg18.sam.bam{,.sorted}
#From a sorted BAM alignment, raw SNP and indel calls are acquired by:
samtools pileup -vcf /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup
#resultant output should be further filtered by:
/opt/samtools/misc/samtools.pl varFilter Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup | awk '$6>=20' > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup.final.pileup
#new section using mpileup and bcftools to generate vcf files
samtools mpileup -ugf hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
Do note the helpful comments below! Repost here for clarity.
Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.
Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.
Also:
1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".
2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."
The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).
2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.
3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)
4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.
5) View the alignments with samtools tview.
Labels:
bwa,
colorspace,
howto,
mapping,
NGS,
opensource,
software,
SOLiD,
tutorial
Thursday, 11 March 2010
Compiling BFAST and DNAA in CentOS 5.4
Update finally got it to work.
#prereqs
GNU Autoconf version 2.59
GNU Automake version 1.9.6
GNU Libtool version 1.5.22
also requires bzlib.h found in bzlib-devel (debian name) in CentOS 5.4 it's bzip2-devel-1.0.3-4.el5_2.x86_64.rpm
#bfast prereq
download bfast source
tar zxvf bfast-0.6.3c.tar.gz
sh autogen.sh
./configure
make
#samtools source
download samtools source
tar jxvf samtools-0.1.7a.tar.bz2
#install this for tview use yum
/var/cache/yum/base/packages/ncurses-devel-5.5-24.20060715.x86_64.rpm
/var/cache/yum/base/packages/ncurses-devel-5.5-24.20060715.i386.rpm
cd samtools-0.1.7a
make
git clone git://dnaa.git.sourceforge.net/gitroot/dnaa/dnaa
#symbolic link to bfast dir in root dir (.. relative to dnaa dir)
cd /home/username/bin/source/dnaa/dnaa
ln -s /home/username/bin/source/bfast-0.6.3c/ bfast
ln -s /home/username/bin/source/samtools/samtools-0.1.7a samtools
cd ..
ln -s /home/username/bin/source/bfast-0.6.3c/ bfast
ln -s /home/username/bin/source/samtools/samtools-0.1.7a samtools
cd /home/username/bin/source/dnaa/dnaa
sh autogen.sh
./configure
make
ln -s /usr/local/lib/installwatch.so /usr/local/lib64/installwatch.so
cd bfast-*
sh autogen.sh
./configure
make
sudo checkinstall
rpm -ivv bfast-0.6.4a-1.x86_64.rpm
#prereqs
GNU Autoconf version 2.59
GNU Automake version 1.9.6
GNU Libtool version 1.5.22
also requires bzlib.h found in bzlib-devel (debian name) in CentOS 5.4 it's bzip2-devel-1.0.3-4.el5_2.x86_64.rpm
#bfast prereq
download bfast source
tar zxvf bfast-0.6.3c.tar.gz
sh autogen.sh
./configure
make
#samtools source
download samtools source
tar jxvf samtools-0.1.7a.tar.bz2
#install this for tview use yum
/var/cache/yum/base/packages/ncurses-devel-5.5-24.20060715.x86_64.rpm
/var/cache/yum/base/packages/ncurses-devel-5.5-24.20060715.i386.rpm
cd samtools-0.1.7a
make
git clone git://dnaa.git.sourceforge.net/gitroot/dnaa/dnaa
#symbolic link to bfast dir in root dir (.. relative to dnaa dir)
cd /home/username/bin/source/dnaa/dnaa
ln -s /home/username/bin/source/bfast-0.6.3c/ bfast
ln -s /home/username/bin/source/samtools/samtools-0.1.7a samtools
cd ..
ln -s /home/username/bin/source/bfast-0.6.3c/ bfast
ln -s /home/username/bin/source/samtools/samtools-0.1.7a samtools
cd /home/username/bin/source/dnaa/dnaa
sh autogen.sh
./configure
make
update: Used checkinstall to create rpm package so its easier for me to uninstall and recompile updates.
with checkinstall 1.6.2 I had to softlink a library
for bfast now the install method is
tar zxvf bfast-*.tar.gzcd bfast-*
sh autogen.sh
./configure
make
sudo checkinstall
rpm -ivv bfast-0.6.4a-1.x86_64.rpm
Friday, 29 January 2010
gawk Sequence Alignment/Map (SAM) Format files for mapping info
There's lots of stuff you can do with sed awk grep in linux.
Below is a few commands that I modified from samtools helplist that will help others in trying to sieve out mapping info from sam files
Basically it works by looking at col 2 of the sam file for the flag to see if the read is mapped, unmapped. Look at above link for more info on the sam format.
#extract unmapped reads from sam file
gawk '!/^@/&&and($2,4)' aln.sam > aln.sam.unmapped.reads
#extracts readname & name of seq in reference where read is mapped
gawk '!/^@/&&!and($2,4){print $1 $3}' aln.sam > aln.sam.mapped.reads
#counts the number of reads mapped to seq in reference
gawk '{print $3}' aln.sam.mapped.reads|sort |uniq -c |sort -n > aln.sam.mapped.reads.stats
Below is a few commands that I modified from samtools helplist that will help others in trying to sieve out mapping info from sam files
Basically it works by looking at col 2 of the sam file for the flag to see if the read is mapped, unmapped. Look at above link for more info on the sam format.
#extract unmapped reads from sam file
gawk '!/^@/&&and($2,4)' aln.sam > aln.sam.unmapped.reads
#extracts readname & name of seq in reference where read is mapped
gawk '!/^@/&&!and($2,4){print $1 $3}' aln.sam > aln.sam.mapped.reads
#counts the number of reads mapped to seq in reference
gawk '{print $3}' aln.sam.mapped.reads|sort |uniq -c |sort -n > aln.sam.mapped.reads.stats
Thursday, 17 December 2009
ABySS: A parallel assembler for short read sequence data
Update:
new version available
reinstall in CentOS
Trying to install this for de novo transcriptome assembly
Had the above error
then tried
yum install gcc-c++
fingers crossed
Part 2 of install woes
well the above worked but i got this at the end of configure
warning: ABySS should be compiled with Google sparsehash to
reduce memory usage. It may be downloaded here:
http://code.google.com/p/google-sparsehash
Hmmm you would think they would mention this in the 1st line of the README
two options I want
If you wish to build the parallel assembler with MPI support,
MPI should be found in /usr/include and /usr/lib or its location
specified to configure:
./configure --with-mpi=/usr/lib/openmpi && make
ABySS should be built using Google sparsehash to reduce memory usage,
although it will build without. Google sparsehash should be found in
/usr/include or its location specified to configure:
./configure CPPFLAGS=-I/usr/local/include
new version available
reinstall in CentOS
sudo yum install openmpi
./configure --prefix=/opt/ABySS --with-mpi=/usr/lib/openmpi CPPFLAGS=-I/usr/include && make
sudo make install
Trying to install this for de novo transcriptome assembly
checking how to run the C++ preprocessor... /lib/cpp
configure: error: in `/home/k/bin/source/abyss-1.0.16':
configure: error: C++ preprocessor "/lib/cpp" fails sanity check
See `config.log' for more details.
configure: error: in `/home/k/bin/source/abyss-1.0.16':
configure: error: C++ preprocessor "/lib/cpp" fails sanity check
See `config.log' for more details.
| Syntax error
configure:6628: /lib/cpp conftest.cpp
cpp: error trying to exec 'cc1plus': execvp: No such file or directory
configure:6628: $? = 1
configure: failed program was:
| /* confdefs.h */
configure:6628: /lib/cpp conftest.cpp
cpp: error trying to exec 'cc1plus': execvp: No such file or directory
configure:6628: $? = 1
configure: failed program was:
| /* confdefs.h */
Had the above error
then tried
yum install gcc-c++
fingers crossed
Part 2 of install woes
well the above worked but i got this at the end of configure
reduce memory usage. It may be downloaded here:
http://code.google.com/p/google-sparsehash
Hmmm you would think they would mention this in the 1st line of the README
two options I want
If you wish to build the parallel assembler with MPI support,
MPI should be found in /usr/include and /usr/lib or its location
specified to configure:
./configure --with-mpi=/usr/lib/openmpi && make
ABySS should be built using Google sparsehash to reduce memory usage,
although it will build without. Google sparsehash should be found in
/usr/include or its location specified to configure:
./configure CPPFLAGS=-I/usr/local/include
Monday, 14 December 2009
Subscribe to:
Posts (Atom)