Kevin's GATTACA World: 1000genomes phase1 alignment process

Friday, 17 August 2012

1000genomes phase1 alignment process

This is rather detailed!

waiting for phase 2 to be released.

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/README.phase1_alignment_data

B. Alignment Process

The alignments themselves are produced by Richard Durbin's group at the Sanger
Institute (Illumina and LS454) and David Craig's group at The Translational
Genomics Institute (Solid). Different aligners are used for the 3 sequencing
platforms, described below. Alignments are carried out on 'run level' fastqs,
that is fastq files that share the same column 3 in the sequence.index (which
becomes known as the 'readgroup' in the eventual released bam files).

B1. Illumina

Aligned with bwa v0.5.5 in 3 steps:
1. bwa index -a bwtsw $ref (where $ref is the reference fasta file)
2. bwa aln -q 15 $ref $fq > $sai (for each fastq file $fq)
3. bwa sampe $ref @sais @fqs > $bam (for pairs of fastq files @fqs), or
bwa samse $ref $sai $fq > $bam (for unpaired fastq files)

B2. LS454

Aligned with ssaha v2.5 in 5 steps:
1. ssaha2Build -skip 3 -save $ref_fa $ref_fa (where $ref is the reference
fasta file)
2. reads in fastq file $fq are filtered to remove those less than 30 bp in
length
3. ssaha2 -disk -454 -output cigar -diff 10 -save $ref | tee |
4. the above pipe is filtered to get the top 10 hits per and output to file
5. cigar output is then converted to sam format, taking into account if the
fastq was paired with another, and what the expected library insert size
was

B3. Solid

Aligned with BFAST 0.6.3d:
1. merge into single fastq file based on SRR ID
2. bfast match -f human_g1k_v37.fasta -r SRRxxxxxx.fastq -A 1 -K 8 -M 384 -n 8 -Q 25000 -T /tmp/ -t > SRR006899.bmf
3. bfast localalign -f human_g1k_v37.fasta -m SRRxxxxxx.bmf -A 1 -o 20 -n 8 -t > SRRxxxxxx.baf
4. bfast postprocess -f human_g1k_v37.fasta -i SRRxxxxxx.baf -r SRRxxxxxx.ReadGroup.txt -n 8 -Q 1000 -t > bfast.reported.file.SRRxxxxxx.sam
5. java -Xmx2g -jar /bin/picard/SortSam.jar I=SRRxxxxxx.sam O=SRRxxxxxx.bam SO=coordinate TMP_DIR=/tmp/ VALIDATION_STRINGENCY=SILENT

B4. Bam Improvement

The run-level alignment bams are improved in various ways to help increase
the quality and speed of subequent SNP calling that may be carried out on
them.
1. reads undergo local realignment around known (pilot) indels using GATK
2. resulting bams have their mate information fixed and are coordinate
sorted by Picard.
3. run-level bams have their read qualities recalibrated with GATK
4. samtools calmd -r is run, which fixes the NM tags and introduces BQ tags
which can be used during SNP calling.

Kevin's GATTACA World

Friday, 17 August 2012

1000genomes phase1 alignment process

No comments:

Post a Comment

Datanami, Woe be me

Analytics code

Contributors