Kevin's GATTACA World: colorspace

Showing posts with label colorspace. Show all posts

Tuesday, 16 November 2010

Uniqueome a uniquely ... omics word

Spotted this post on the Tree of Life blog

Another good paper, but bad omics word of the day: uniqueome

From "The Uniqueome: A mappability resource for short-tag sequencing

Ryan Koehler, Hadar Issac , Nicole Cloonan,*, and Sean M. Grimmond." Bioinformatics (2010) doi: 10.1093/bioinformatics

Paper does look interesting though!

Summary: Quantification applications of short-tag sequencing data (such as CNVseq and RNAseq) depend on knowing the uniqueness of specific genomic regions at a given threshold of error. Here we present the “uniqueome”, a genomic resource for understanding the uniquely mappable proportion of genomic sequences. Pre-computed data is available for human, mouse, fly, and worm genomes in both color-space and nucletotide-space, and we demonstrate the utility of this resource as applied to the quantification of RNAseq data.

Availability: Files, scripts, and supplementary data is available from http://grimmond.imb.uq.edu.au/uniqueome/; the ISAS uniqueome aligner is freely available from http://www.imagenix.com/

Monday, 8 November 2010

Trimming adaptor seq in colorspace (SOLiD)

Needed to do research on small RNA seq using SOLiD.
Wasn't clear of the adaptor trimming procedure (its dead easy in basespace fastq files but oh well, SOLiD has directionality and read lengths dont' really matter for small RNA)

novoalign suggests the use of cutadapt as a colorspace adaptor trimming tool
was going to script one in python if it didn't exist
Check their wiki page

Sadly on CentOS I most probably will get this

If you get this error:

   File "./cutadapt", line 62
    print("# There are %7d sequences in this data set." % stats.n, file=outfile)
                                                                       ^
SyntaxError: invalid syntax

Then your Python is too old. At least Python 2.6 is needed for cutadapt.

have to dig up how to have two versions of Python on a CentOS box..

Wednesday, 25 August 2010

howto do BWA mapping in colorspace

Here's what I use for bwa alignment (without removing PCR dups).
You can replace the paths with your own and put into a bash script for automation
comments or corrections welcome!

#Visit kevin-gattaca.blogspot.com to see updates of this template!
#http://kevin-gattaca.blogspot.com/2010/08/howto-do-bwa-mapping-in-colorspace.html
#updated 16th Mar 2011
#Creates colorspace index
bwa index -a bwtsw -c hg18.fasta

#convert to fastq.gz
perl /opt/bwa-0.5.7/solid2fastq.pl Sample-input-prefix-name Sample

#aln using 4 threads
#-l 25        seed length
#-k 2         mismatches allowed in seed
#-n 10      total mismatches allowed

bwa aln -c -t 4 -l 25 -k 2 -n 10 /data/public/bwa-color-index/hg18.fasta Sample.single.fastq.gz > Sample.bwa.hg18.sai

#for bwa samse
bwa samse /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sai Sample.single.fastq.gz > Sample.bwa.hg18.sam

#creates bam file from pre-generated .fai file

samtools view -bt /data/public/hg18.fasta.fai -o Sample.bwa.hg18.sam.bam Sample.bwa.hg18.sam

#sorts bam file

samtools sort Sample.bwa.hg18.sam.bam{,.sorted}

#From a sorted BAM alignment, raw SNP and indel calls are acquired by:

samtools pileup -vcf /data/public/bwa-color-index/hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup

#resultant output should be further filtered by:

/opt/samtools/misc/samtools.pl varFilter Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup | awk '$6>=20' > Sample.bwa.hg18.sam.bam.sorted.bam.raw.pileup.final.pileup

#new section using mpileup and bcftools to generate vcf files
samtools mpileup -ugf hg18.fasta Sample.bwa.hg18.sam.bam.sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

Do note the helpful comments below! Repost here for clarity.

Different anon here. But try -n 3 and -e 10 and see how that works for you. Then filter out low quality alignments (MAPQ < 10) before you do any variant calling.

Also, depending on your task, you might consider disabling seeding altogether to get an even more sensitive alignment. -l 1000 should do that.

Also:

1) bwa is a global aligner with respect to reads, so consider trimming low-quality bases off the end of your reads with "bwa aln -q 10".

2) For user comprehension, it's easier if you replace "samtools view -bt /data/public/hg18.fasta.fai ..." with "samtools view -bT /data/public/hg18.fasta ..."

The T option handles reference files directly rather than having to deal with a .fai index file (which you haven't told people how to create in this guide).

2) Use "samtools view -F 4 -q 10" to get rid of unaligned reads (which are still in double-encoded color space) and dodgy alignments.

3) Use "samtools calmd" to correct MD and NM tags. (However, I'm not sure if this is necessary/helpful.)

4) Use Picard's SortSam and MarkDuplicates to take care of PCR duplicates.

5) View the alignments with samtools tview.

Kevin's GATTACA World