Kevin's GATTACA World

Monday, 18 November 2024

Connecting the Dots: Biology At Scale In The Age Of AI (Broad Institute)

Collected the videos in a playlist

Connecting the Dots: Biology At Scale In The Age Of AI (Broad Institute)

Nov 9, 2024

SESSION 1 BRIDGING THE GAP: MODELS FOR BIOLOGICAL PREDICTION Chair: Wengong Jin (MIT) Hilary Finucane (Broad) Mehrtash Babadi (Broad) Keynote II: Alex Rives (Broad) For more information visit: https://www.broadinstitute.org

Saturday, 16 November 2024

Meet Evo, the DNA-trained AI that creates genomes from scratch

https://www.science.org/content/article/meet-evo-dna-trained-ai-creates-genomes-scratch ChatGPT, the famous artificial intelligence (AI) chatbot, can summarize Moby Dick, write computer code, and serve up a recipe for chicken à la king because it has much of the written information on the internet at its silicon fingertips. What if it could do the same for DNA?

That’s the advance behind a new study published today in Science. Researchers describe an AI model, schooled on billions of lines of genetic sequences, that can deduce how bacterial and viral genomes operate and use that information to design new proteins and even whole microbial genomes. The model, known as Evo, could help scientists probe evolution, investigate diseases, develop new treatments, and potentially answer a host of other biomedical questions.

“This work is extremely significant,” says computational biologist Arvind Ramanathan of Argonne National Laboratory, who wasn’t connected to the study. The tests the authors put Evo through, he says, provide “a great showcase of applications” for the AI.

Tuesday, 5 October 2021

HGVS nomenclature

Was recently asked about HGVS nomenclature reporting. The fun thing about biology is that there's going to be exceptions to the rule or some shenanigans that you didn't expect when setting out a rule.

"The Human Genome Variation Society (HGVS) provides standardized recommendations for describing human sequence variants, which are widely accepted in the scientific community, especially in the practice of clinical molecular pathology.1 Use of the HGVS nomenclature system is a de facto recommendation for clinical reporting of sequence variants.2, 3 Being a core component of the clinical report, incorrect HGVS nomenclature can have a negative impact on patient care, such as misdiagnosis or clinical trial ineligibility. HGVS nomenclature has been traditionally computed manually by pathologists from Sanger sequencing electropherograms. However, manually computing HGVS nomenclature is time consuming, complex, and error prone, particularly with insertion and deletion (indel) variants, resulting in inconsistencies across laboratories."

Source:Clinical Implementation and Validation of Automated Human Genome Variation Society (HGVS) Nomenclature System for Next-Generation Sequencing–Based Assays for Cancer

In the 25th Anniversary Special Issue of Human Mutation, Den Dunnen et al. (2016) publish an update of the Human Genome Variation Society (HGVS) recommendations for the description of sequence variants (http://www.HGVS.org/varnomen). One of the issues discussed is how widespread HGVS nomenclature is used and, when used, whether published variant descriptions correctly follow the recommendations. An EGFR (OMIM# 131550) lung cancer testing scheme assessed in January 2016 by the United Kingdom National External Quality Assessment Scheme (UK NEQAS) for Molecular Genetics demonstrates the current variability in the use and interpretation of the HGVS guidelines by diagnostic laboratories based across the globe.

Source: HGVS Nomenclature in Practice: An Example from the United Kingdom National External Quality Assessment Scheme

Shall explore this tool

hgvs: A Python package for manipulating sequence variants using HGVS nomenclature: 2018 Update

https://github.com/biocommons/hgvs

Tuesday, 28 September 2021

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers | Nature Methods

https://www.nature.com/articles/s41592-021-01254-9

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers | Nature Methods

The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not ...

www.nature.com

https://github.com/GoekeLab/bioinformatics-workflows

GitHub - GoekeLab/bioinformatics-workflows: minimal example implementations for bioinformatics workflow managers

Workflow managers provide an easy and intuitive way to simplify pipeline development. Here we provide basic proof-of-concept implementations for selected workflow managers. The analysis workflow is based on a small portion of an RNA-seq pipeline, using fastqc for quality controls and salmon for ...

github.com

Monday, 27 September 2021

Orientation Bias artifect

strand bias and orientation bias – GATK (broadinstitute.org)

The read orientation artifact, also known as the orientation bias artifact, arises due to a chemical change in the nucleotide during library prep that results in, for example, G base-paring with A. This kind of artifact has a clear signature (e.g. C to A SNP that occurs predominantly for the middle C in the DNA sequence CCG), and it’s singlestranded in nature. Downstream, this artifact manifests as low allele fraction SNPs whose evidence for the alt allele consists almost entirely F1R2 reads or F2R1 reads. A read pair is F1R2 (forward 1st, reverse 2nd) if the sequence of bases in Read 1 maps to the forward strand of the reference (F1), and the sequence of Read 2 to the reverse strand
of the reference (R2). F2R1 is defined similarly

if someone has read the dragonbioit used guide in illumina, it just mentioned orientation bias, ignore the strand bias.

Friday, 17 September 2021

Benchmarking variants and comparing truth sets: List of useful tools and publications

Just realised that other than vcf-compare and bedtools intersect

there's other options

https://github.com/RealTimeGenomics/rtg-tools

https://github.com/Illumina/hap.py

Also there's actually new variant callers ..

Molina-Mora, J.A., Solano-Vargas, M. Set-theory based benchmarking of three different variant callers for targeted sequencing. BMC Bioinformatics 22, 20 (2021). https://doi.org/10.1186/s12859-020-03926-3

Krishnan, V., Utiramerur, S., Ng, Z. et al. Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays. BMC Bioinformatics 22, 85 (2021). https://doi.org/10.1186/s12859-020-03934-3

Additional file 23: File 3

. verify_variants.py

Zook, Justin M et al. “An open resource for accurately benchmarking small variant and reference calls.” Nature biotechnology vol. 37,5 (2019): 561-566. doi:10.1038/s41587-019-0074-6

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`

hgvs.readthedocs.io/

hgvs: A Python package for manipulating sequence variants using HGVS nomenclature: 2018 Update.
Wang M, Callenberg KM, Dalgleish R, Fedtsov A, Fox NK, Freeman PJ, Jacobs KB, Kaleta P, McMurry AJ, Prlić A, Rajaraman V, Hart RK.Hum Mutat. 2018 Dec;39(12):1803-1813. doi: 10.1002/humu.23615. Epub 2018 Sep 5.PMID: 30129167 Free PMC article.
- Sequence Variant Descriptions: HGVS Nomenclature and Mutalyzer.
  den Dunnen JT.Curr Protoc Hum Genet. 2016 Jul 1;90:7.13.1-7.13.19. doi: 10.1002/cphg.2.PMID: 27367167

A Python package for parsing, validating, mapping and formatting sequence variants using HGVS nomenclature.
Hart RK, Rico R, Hare E, Garcia J, Westbrook J, Fusaro VA.Bioinformatics. 2015 Jan 15;31(2):268-70. doi: 10.1093/bioinformatics/btu630. Epub 2014 Sep 30.PMID: 25273102 Free PMC article.
VariantValidator: Accurate validation, mapping, and formatting of sequence variation descriptions.
Freeman PJ, Hart RK, Gretton LJ, Brookes AJ, Dalgleish R.Hum Mutat. 2018 Jan;39(1):61-68. doi: 10.1002/humu.23348. Epub 2017 Oct 17.PMID: 28967166 Free PMC article.
Clinical Implementation and Validation of Automated Human Genome Variation Society (HGVS) Nomenclature System for Next-Generation Sequencing-Based Assays for Cancer.
Callenberg KM, Santana-Santos L, Chen L, Ernst WL, De Moura MB, Nikiforov YE, Nikiforova MN, Roy S.J Mol Diagn. 2018 Sep;20(5):628-634. doi: 10.1016/j.jmoldx.2018.05.006. Epub 2018 Jun 21.PMID: 29936258

Thursday, 15 July 2021

Running Kraken2 and creating a Krona report

Had to work with Ion Torrent BAMs for this but I think it's applicable to everything

Needed to run this on unmapped reads so running this first.

After that the next script is fairly simple

Will share the install when I have time. A major hiccup for me was realising not all pre-built db works with Kraken2

Kevin's GATTACA World

Monday, 18 November 2024

Connecting the Dots: Biology At Scale In The Age Of AI (Broad Institute)

Saturday, 16 November 2024

Meet Evo, the DNA-trained AI that creates genomes from scratch

Tuesday, 5 October 2021

HGVS nomenclature

Tuesday, 28 September 2021

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers | Nature Methods

Monday, 27 September 2021

Orientation Bias artifect

Friday, 17 September 2021

Benchmarking variants and comparing truth sets: List of useful tools and publications

Additional file 23: File 3

Thursday, 15 July 2021

Running Kraken2 and creating a Krona report

Datanami, Woe be me

Analytics code

Contributors