Friday, 17 August 2012

1000genomes phase1 alignment process

This is rather detailed! 
waiting for phase 2 to be released.

B. Alignment Process

The alignments themselves are produced by Richard Durbin's group at the Sanger
Institute (Illumina and LS454) and David Craig's group at The Translational
Genomics Institute (Solid). Different aligners are used for the 3 sequencing
platforms, described below. Alignments are carried out on 'run level' fastqs,
that is fastq files that share the same column 3 in the sequence.index (which
becomes known as the 'readgroup' in the eventual released bam files).

    B1. Illumina

    Aligned with bwa v0.5.5 in 3 steps:
    1. bwa index -a bwtsw $ref (where $ref is the reference fasta file)
    2. bwa aln -q 15 $ref $fq > $sai (for each fastq file $fq)
    3. bwa sampe $ref @sais @fqs > $bam (for pairs of fastq files @fqs), or
        bwa samse $ref $sai $fq > $bam (for unpaired fastq files)

    B2. LS454

    Aligned with ssaha v2.5 in 5 steps:
    1. ssaha2Build -skip 3 -save $ref_fa $ref_fa (where $ref is the reference
        fasta file)
    2. reads in fastq file $fq are filtered to remove those less than 30 bp in
    3. ssaha2 -disk -454 -output cigar -diff 10 -save $ref | tee |
    4. the above pipe is filtered to get the top 10 hits per and output to file
    5. cigar output is then converted to sam format, taking into account if the
        fastq was paired with another, and what the expected library insert size

    B3. Solid

    Aligned with BFAST 0.6.3d:
    1. merge into single fastq file based on SRR ID
    2. bfast match -f human_g1k_v37.fasta -r SRRxxxxxx.fastq -A 1 -K 8 -M 384 -n 8 -Q 25000 -T /tmp/ -t > SRR006899.bmf
    3. bfast localalign -f human_g1k_v37.fasta -m SRRxxxxxx.bmf -A 1 -o 20 -n 8 -t > SRRxxxxxx.baf
    4. bfast postprocess -f human_g1k_v37.fasta -i SRRxxxxxx.baf -r SRRxxxxxx.ReadGroup.txt -n 8 -Q 1000 -t > bfast.reported.file.SRRxxxxxx.sam
    5. java -Xmx2g -jar /bin/picard/SortSam.jar I=SRRxxxxxx.sam O=SRRxxxxxx.bam SO=coordinate TMP_DIR=/tmp/ VALIDATION_STRINGENCY=SILENT

    B4. Bam Improvement
    The run-level alignment bams are improved in various ways to help increase
    the quality and speed of subequent SNP calling that may be carried out on
    1.  reads undergo local realignment around known (pilot) indels using GATK
    2.  resulting bams have their mate information fixed and are coordinate
        sorted by Picard.
    3.  run-level bams have their read qualities recalibrated with GATK
    4.  samtools calmd -r is run, which fixes the NM tags and introduces BQ tags
        which can be used during SNP calling.

No comments:

Post a Comment

Datanami, Woe be me