Showing posts with label bioinformatics. Show all posts

Tuesday, 28 September 2021

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers | Nature Methods

https://www.nature.com/articles/s41592-021-01254-9

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers | Nature Methods

The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not ...

www.nature.com

https://github.com/GoekeLab/bioinformatics-workflows

GitHub - GoekeLab/bioinformatics-workflows: minimal example implementations for bioinformatics workflow managers

Workflow managers provide an easy and intuitive way to simplify pipeline development. Here we provide basic proof-of-concept implementations for selected workflow managers. The analysis workflow is based on a small portion of an RNA-seq pipeline, using fastqc for quality controls and salmon for ...

github.com

Friday, 17 September 2021

Benchmarking variants and comparing truth sets: List of useful tools and publications

Just realised that other than vcf-compare and bedtools intersect

there's other options

https://github.com/RealTimeGenomics/rtg-tools

https://github.com/Illumina/hap.py

Also there's actually new variant callers ..

Molina-Mora, J.A., Solano-Vargas, M. Set-theory based benchmarking of three different variant callers for targeted sequencing. BMC Bioinformatics 22, 20 (2021). https://doi.org/10.1186/s12859-020-03926-3

Krishnan, V., Utiramerur, S., Ng, Z. et al. Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays. BMC Bioinformatics 22, 85 (2021). https://doi.org/10.1186/s12859-020-03934-3

Additional file 23: File 3

. verify_variants.py

Zook, Justin M et al. “An open resource for accurately benchmarking small variant and reference calls.” Nature biotechnology vol. 37,5 (2019): 561-566. doi:10.1038/s41587-019-0074-6

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`

hgvs.readthedocs.io/

hgvs: A Python package for manipulating sequence variants using HGVS nomenclature: 2018 Update.
Wang M, Callenberg KM, Dalgleish R, Fedtsov A, Fox NK, Freeman PJ, Jacobs KB, Kaleta P, McMurry AJ, Prlić A, Rajaraman V, Hart RK.Hum Mutat. 2018 Dec;39(12):1803-1813. doi: 10.1002/humu.23615. Epub 2018 Sep 5.PMID: 30129167 Free PMC article.
- Sequence Variant Descriptions: HGVS Nomenclature and Mutalyzer.
  den Dunnen JT.Curr Protoc Hum Genet. 2016 Jul 1;90:7.13.1-7.13.19. doi: 10.1002/cphg.2.PMID: 27367167

A Python package for parsing, validating, mapping and formatting sequence variants using HGVS nomenclature.
Hart RK, Rico R, Hare E, Garcia J, Westbrook J, Fusaro VA.Bioinformatics. 2015 Jan 15;31(2):268-70. doi: 10.1093/bioinformatics/btu630. Epub 2014 Sep 30.PMID: 25273102 Free PMC article.
VariantValidator: Accurate validation, mapping, and formatting of sequence variation descriptions.
Freeman PJ, Hart RK, Gretton LJ, Brookes AJ, Dalgleish R.Hum Mutat. 2018 Jan;39(1):61-68. doi: 10.1002/humu.23348. Epub 2017 Oct 17.PMID: 28967166 Free PMC article.
Clinical Implementation and Validation of Automated Human Genome Variation Society (HGVS) Nomenclature System for Next-Generation Sequencing-Based Assays for Cancer.
Callenberg KM, Santana-Santos L, Chen L, Ernst WL, De Moura MB, Nikiforov YE, Nikiforova MN, Roy S.J Mol Diagn. 2018 Sep;20(5):628-634. doi: 10.1016/j.jmoldx.2018.05.006. Epub 2018 Jun 21.PMID: 29936258

Thursday, 15 July 2021

Running Kraken2 and creating a Krona report

Had to work with Ion Torrent BAMs for this but I think it's applicable to everything

Needed to run this on unmapped reads so running this first.

After that the next script is fairly simple

Will share the install when I have time. A major hiccup for me was realising not all pre-built db works with Kraken2

Tuesday, 11 April 2017

github-based, community-maintained list of cancer clinical informatics resources

Sean Davis created a github-based, community-maintained list of cancer clinical informatics resources.
"Contributions are welcome!" https://lnkd.in/d-uphUc

For now, it's named as

ci4cc-informatics-resources

https://github.com/seandavi/ci4cc-informatics-resources

Friday, 29 January 2016

Freelancing in Bioinformatics? It's happening here...uBiome FASTQ

http://www.guru.com/jobs/ubiome-raw-data-fastq-files-analysis/1210516#proposalModal

Any takers to help walk this guy through analysing ubiome raw fastq files?

p.s. I wasn't aware that ubiome gives out fastq files?

Friday, 28 June 2013

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Each time when I change jobs, I will have to go through the adventure (and sometimes pain) to relearn about the computing resources available to me (personal), lab (small sharing pool), and the entire institute/company/school (Not enought to go around usually).
Depending on the job scope / number of cores / length of the job I would then setup the computing resources to run on either of the 3 resources available to me.
Sometimes, grant money appears magically and I am asked by my boss what do I need to buy (ok TBH this is rare). Hence it's always nice to keep a lookout on what's available on the market and who's using what to do what. So that one day when grant money magically appears, I won't be stumped for an answer.

excerpted from the provisional PDF are three points which I agree fully

Three GiB of RAM per core is not enough
you won't believe the number of things I tried to do to outsmart the 'system' just to squeeze enough ram for my jobs. Like looking for parallel queues which often have a bigger amount of RAM allocation. Doing tests for small jobs to make sure it runs ok before scaling it up and have it fail after two days due to insufficient RAM.
MPI is not widely used in NGS analysis
A lot of the queues in the university shared resource has ample resources for my jobs but were reserved for MPI jobs. Hence I can't touch those at all.
A central ﬁle system helps keep redundancy to a minimum
balancing RAM / compute cores to make the job splitting efficient was one thing. The other pain in the aXX was having to move files out of the compute node as soon as the job is done and clear all intermediate files. There were times where the job might have failed but as I deleted the intermediate files in the last step of the pipeline bash script, I wasn't able to be sure it ran to completion. In the end I had to rerun the job and keeping the intermediate files

anyway for more info you can check out the below

http://www.gigasciencejournal.com/content/2/1/9/abstract

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Samuel Lampa, Martin Dahlö, Pall I Olason, Jonas Hagberg and Ola Spjuth

For all author emails, please log on.

GigaScience 2013, 2:9 doi:10.1186/2047-217X-2-9

Published: 25 June 2013

Abstract (provisional)

Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made.

The complete article is available as a provisional PDF. The fully formatted PDF and HTML versions are in production.

Friday, 22 March 2013

Adventures with my WD My Book Live (A PowerPC Debian Linux Server with 2 TB HDD)

I shoulda known to googled before probing at the CLI with stuff and I would have found out what I needed to know. but oh well damage done. What I needed to know was that it's a debian Linux (quite up to date!) with the standard perl/python/sqlite installed. CPU and RAM ain't super impressive but if you are just looping through text files I doubt that it matters a lot. Heck it's roughly equivalent to a older gen of Raspberry Pi with 256 Mb ram

The My Book Live is based upon the APM82181, a 800 MHz PowerPC 464 based platform (PDF). It has a host of features which are not utilized by the MyBook Live. For example, the PCI-E ports as well as the USB 2.0 OTG ports are fully disabled. The SATA port and GbE MAC are the only active components. The unit also has 256 MB of DRAM.(Source anandtech.com)

It's such a shame that the PCI-E ports and USB ports are disabled but at the least the root account isn't disabled which opens up possibilities to install and hack the system into a low power device with a 2 TB HDD to do a bit of bioinformatics eh?

Imagine shipping someone's genomic data in one of these babies that allows you to slice and dice the fastq file to extract pertinent info! After all it already is a web server, won't be too much of a strain to make web apps or just simple web interface as a wrapper for scripts to generate graphical reports (*Dreams of putting galaxy webserver on the WD mybooklive*)
or perhaps use HTSeq or Erange to do something that doesn't strain the 256 Mb of DRAM

Post in Comments what you might do with a 800 Mhz CPU and 256 Mb Ram with Debian under it's hood.

UPDATE: Unfortunately I have managed to brick my WD Mybooklive by being overzealous in installing stuff that required the HTTP webserver as well. DOING that to a headless server with NO terminal/keyboard access is a BAD BAD idea especially if it breaks the SSH login if it hangs at boot up :(

Sigh hope to fix it soon and will be more careful in trying to test packages on my Ubuntu box before trying it on the mybooklive

MyBookLive:~# cat /proc/cpuinfo
processor : 0
cpu : APM82181
clock : 800.000008MHz
revision : 28.130 (pvr 12c4 1c82)
bogomips : 1600.00
timebase : 800000008
platform : PowerPC 44x Platform
model : amcc,apollo3g
Memory : 256 MB

MyBookLive:~# apt-get update
Get:1 http://ftp.us.debian.org squeeze Release.gpg [1672B]
Get:2 http://ftp.us.debian.org wheezy Release.gpg [836B]
Get:3 http://ftp.us.debian.org squeeze Release [99.8kB]
Ign http://ftp.us.debian.org squeeze Release
Get:4 http://ftp.us.debian.org wheezy Release [223kB]
Ign http://ftp.us.debian.org wheezy Release
Get:5 http://ftp.us.debian.org squeeze/main Packages [6493kB]
Get:6 http://ftp.us.debian.org wheezy/main Packages [5754kB]
Fetched 12.6MB in 1min17s (163kB/s)
Reading package lists... Done

MyBookLive:~# perl -v

This is perl, v5.10.1 (*) built for powerpc-linux-gnu-thread-multi
(with 51 registered patches, see perl -V for more detail)

Copyright 1987-2009, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

MyBookLive:~# python

Python 2.5.2 (r252:60911, Jan 24 2010, 18:51:01)

[GCC 4.3.2] on linux2

Type "help", "copyright", "credits" or "license" for more information.

MyBookLive:~# sqlite3

SQLite version 3.7.3

Enter ".help" for instructions

Enter SQL statements terminated with a ";"

sqlite>

MyBookLive:~# free

total used free shared buffers cached

Mem: 253632 250112 3520 0 53568 52352

-/+ buffers/cache: 144192 109440

Swap: 500608 146048 354560

Ok if you are interested below is the exact model of WD MyBookLive that I own right now.

Related Links
Hacking WD My Book Live
http://mybookworld.wikidot.com/mybook-live

Wednesday, 5 September 2012

[pub] SEED: efficient clustering of next-generation sequences.

Bioinformatics. 2011 Sep 15;27(18):2502-9. Epub 2011 Aug 2.

SEED: efficient clustering of next-generation sequences.

Bao E, Jiang T, Kaloshian I, Girke T.

Source

Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA.

Abstract

MOTIVATION:

Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.

RESULTS:

Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 10-fold="10-fold" 12-27="12-27" 2-="2-" 21-41="21-41" 60-85="60-85" a="a" able="able" addition="addition" also="also" and="and" area="area" as="as" asis="asis" assembler="assembler" assemblies="assemblies" assembly="assembly" best="best" better="better" by="by" cluster="cluster" clustering="clustering" clusters="clusters" compared="compared" contained="contained" contigs="contigs" data="data" datasets="datasets" demonstrate="demonstrate" discovering="discovering" efficiency="efficiency" fall="fall" for="for" from="from" generating="generating" genome="genome" h="h" in="in" indicated="indicated" into="into" it="it" its="its" larger="larger" linear="linear" longer="longer" memory="memory" most="most" n50="n50" ngs="ngs" non-preprocessed="non-preprocessed" of="of" on="on" organisms.="organisms." other="other" our="our" p="p" performance.="performance." performance="performance" preprocessing="preprocessing" reduce="reduce" requirements="requirements" respectively.="respectively." results="results" rna="rna" s="s" seed="seed" sequences="sequences" showed="showed" similar="similar" small="small" stand-alone="stand-alone" study="study" tests="tests" than="than" the="the" this="this" time="time" to="to" tool="tool" tools="tools" transcriptome="transcriptome" true="true" unsequenced="unsequenced" used="used" using="using" utilities="utilities" values.="values." velvet="velvet" was="was" when="when" while="while" with="with">

AVAILABILITY:

The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed.

CONTACT:

thomas.girke@ucr.edu

SUPPLEMENTARY INFORMATION:

Supplementary data are available at Bioinformatics online.

PMID:: 21810899; [PubMed - indexed for MEDLINE]
PMCID:: PMC3167058

Free PMC Article

Friday, 3 August 2012

Genegames.org: who said online games are COMPLETE time wasters?

tweet from Open Helix .. Found at #ISMB poster session http://genegames.org (along with a lots blogging fodder) http://ow.ly/i/M3yc

Down with fever .. so am burning time with http://sulab.scripps.edu/dizeez/index.html

The Rules

You are shown one gene name
You are also shown five diseases
Pick the disease that is linked to the gene to get points
Get as many points as you can in one minute

Monday, 30 July 2012

bioawk- AWK for gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.

Alerted to this on biostars.org

http://www.biostars.org/post/show/47751/bioawk-fasta-fastq-sam-bed-gff-aware-awk-programming/

https://github.com/ialbert/bioawk/blob/master/README.bio.rst

About bioawk

Bioawk is an extension to Brian Kernighan's awk that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names.

Bioawk adds a new -c fmt option that specifies the input format. The behavior of bioawk will vary depending on the value of fmt.

For the formats that awk recognizes specially named variables will be created. For example for the supported sequence formats the$name, $seq and, if applicable $qual variable names may be used to access the name, sequence and quality string of the sequence record in each iteration. Here is an example of iterating over a fastq file to print the sequences:

    awk -c fastq '{ print $seq }' test.fq

For known interval formats the columns can be accessed via the variables called $start, $end, $chrom (etc). For example to print the feature lenght of a file in BED format one could write:

    awk -c bed '{ print $end - $start }' test.bed

One important change (and innovation) over the original awk is that bioawk will treat sequences that may span multiple lines as a single record. The parsing, implemented in C, may be several orders of magnitude faster than similar code programmed in interpreted languages: Perl, Python, Ruby.

When the format mode is header or hdr, bioawk parses named columns. It automatically adds variables whose names are taken from the first line and values from the column index. Special characters are converted to a underscore.

Bioawk also adds a few built-in functions including, as of now, and(), or(), xor(), and others (see comprehensive list below).

Detailed help is maintained in the bioawk manual page, to access it type:

    man ./awk.1

Usage Examples

Extract unmapped reads without header:

    awk -c sam 'and($flag,4)' aln.sam.gz

Extract mapped reads with header:
```
    awk -c sam -H '!and($flag,4)'
```

Saturday, 7 July 2012

Installing breakdancer.sourceforge.net

http://breakdancer.sourceforge.net/moreperl.html

Dependencies for breakdancer-1.1_2011_02_21

sudo perl -MCPAN -e 'install Statistics::Descriptive'
sudo perl -MCPAN -e 'install Math::CDF'

sudo perl -MCPAN -e 'install GD::Graph::histogram'

#if you haven't done it already, install samtools and add the binary to your path bam2cfg.pl needs to call it.

Tuesday, 29 May 2012

How Not To Be A Bioinformatician Source Code for Biology and Medicine 2012, 7:3 doi:10.1186/1751-0473-7-3

How Not To Be A Bioinformatician

Source Code for Biology and Medicine 2012, 7:3 doi:10.1186/1751-0473-7-3

abstract

Although published material exists about the skills required for a successful bioinformatics career, strangely enough no work to date has addressed the matter of how to excel at not being a bioinformatician. A set of basic guidelines and a code of conduct is hereby presented to re-address that imbalance for fellow-practitioners whose aim is to not to succeed in their chosen bioinformatics field. By scrupulously following these guidelines one can be sure to regress at a highly satisfactory rate.

http://www.scfbm.org/content/pdf/1751-0473-7-3.pdf

LMAO

"Be unreachable and isolated. Configure your contact email to either bounce back or

permanently set it to vacation. Miss key meetings or seminars where other colleagues may be presenting their seminal results and never, ever make any attempt at remembering their names or where they work. Reinvent the wheel. Do not keep up with the literature on current methods of research if you possibly can. "

was this even neccessary to be in the paper?

Friday, 17 February 2012

a tour of various bioinformatics functions in Avadis NGS

Not affliated with Avadis but this might be useful for you

We are hosting an online seminar series on the alignment and analysis of genomics data from “benchtop” sequencers, i.e. MiSeq and Ion Torrent. Our webinar panelists will give a tour of various bioinformatics functions in Avadis NGS that will enable researchers and clinicians to derive biological insights from their benchtop sequencing data.

Seminar #1: MiSeq Data Analysis

Avadis NGS 1.3 provides special support for analyzing data generated by MiSeq™ sequencers. In this webinar, we will describe how the data in a MiSeq generated “run folder” is automatically loaded into the Avadis NGS software during small RNA alignment and DNA variant analysis. This is especially helpful in processing the large number of files generated when the TruSeq™ Amplicon Kits are used. We will describe how to use the Quality Control steps in Avadis NGS to check if the amplicons have sufficient coverage in all the samples. Regions with unexpected coverages can easily be identified using the new region list clustering feature. Webinar attendees will learn how to use the “Find Significant SNPs” feature to quickly identify high-confidence SNPs present in a majority of the samples, rare variants, etc.

Seminar #2: Ion Torrent Data Analysis

Avadis NGS 1.3 includes a new aligner – COBWeb – that is fully capable of aligning the long, variable-length reads generated by Ion Torrent sequencers. In this webinar, we will show the pre-alignment QC plots and illustrate how they can be used to set appropriate alignment parameters for aligning Ion Torrent reads. For users who choose to import the BAM format files generated by the Ion Torrent Server, we will describe the steps needed for importing amplicon sequencing data into Avadis NGS. Users of the Ion AmpliSeq™ Cancer Panel will learn how to easily import the targeted mutation list and verify the genotype call at the mutation sites. We will also show the new “Find Significant SNPs” feature which helps quickly identify high-confidence SNPs present in a majority of the samples, rare variants, etc.

Free registration - http://www.avadis-ngs.com/webinar

Tuesday, 28 September 2021

Friday, 17 September 2021

Thursday, 15 July 2021

Tuesday, 11 April 2017

Friday, 29 January 2016

Friday, 28 June 2013

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data

Abstract (provisional)

The complete article is available as a provisional PDF. The fully formatted PDF and HTML versions are in production.

Friday, 22 March 2013

Wednesday, 5 September 2012

SEED: efficient clustering of next-generation sequences.

Source

Abstract

MOTIVATION:

RESULTS:

AVAILABILITY:

CONTACT:

SUPPLEMENTARY INFORMATION:

Friday, 3 August 2012

The Rules

Monday, 30 July 2012

About bioawk

Usage Examples

Saturday, 7 July 2012

Tuesday, 29 May 2012

Friday, 17 February 2012

Datanami, Woe be me