This is rather detailed!
waiting for phase 2 to be released.
B. Alignment Process The alignments themselves are produced by Richard Durbin's group at the Sanger Institute (Illumina and LS454) and David Craig's group at The Translational Genomics Institute (Solid). Different aligners are used for the 3 sequencing platforms, described below. Alignments are carried out on 'run level' fastqs, that is fastq files that share the same column 3 in the sequence.index (which becomes known as the 'readgroup' in the eventual released bam files). B1. Illumina Aligned with bwa v0.5.5 in 3 steps: 1. bwa index -a bwtsw $ref (where $ref is the reference fasta file) 2. bwa aln -q 15 $ref $fq > $sai (for each fastq file $fq) 3. bwa sampe $ref @sais @fqs > $bam (for pairs of fastq files @fqs), or bwa samse $ref $sai $fq > $bam (for unpaired fastq files) B2. LS454 Aligned with ssaha v2.5 in 5 steps: 1. ssaha2Build -skip 3 -save $ref_fa $ref_fa (where $ref is the reference fasta file) 2. reads in fastq file $fq are filtered to remove those less than 30 bp in length 3. ssaha2 -disk -454 -output cigar -diff 10 -save $ref | tee | 4. the above pipe is filtered to get the top 10 hits per and output to file 5. cigar output is then converted to sam format, taking into account if the fastq was paired with another, and what the expected library insert size was B3. Solid Aligned with BFAST 0.6.3d: 1. merge into single fastq file based on SRR ID 2. bfast match -f human_g1k_v37.fasta -r SRRxxxxxx.fastq -A 1 -K 8 -M 384 -n 8 -Q 25000 -T /tmp/ -t > SRR006899.bmf 3. bfast localalign -f human_g1k_v37.fasta -m SRRxxxxxx.bmf -A 1 -o 20 -n 8 -t > SRRxxxxxx.baf 4. bfast postprocess -f human_g1k_v37.fasta -i SRRxxxxxx.baf -r SRRxxxxxx.ReadGroup.txt -n 8 -Q 1000 -t > bfast.reported.file.SRRxxxxxx.sam 5. java -Xmx2g -jar /bin/picard/SortSam.jar I=SRRxxxxxx.sam O=SRRxxxxxx.bam SO=coordinate TMP_DIR=/tmp/ VALIDATION_STRINGENCY=SILENT B4. Bam Improvement The run-level alignment bams are improved in various ways to help increase the quality and speed of subequent SNP calling that may be carried out on them. 1. reads undergo local realignment around known (pilot) indels using GATK 2. resulting bams have their mate information fixed and are coordinate sorted by Picard. 3. run-level bams have their read qualities recalibrated with GATK 4. samtools calmd -r is run, which fixes the NM tags and introduces BQ tags which can be used during SNP calling.