Kevin's GATTACA World: AWS

Showing posts with label AWS. Show all posts

Saturday, 29 September 2018

Koala Genome assembled on AWS

Excerpted from AWS blog

Five years ago, a research team led by Dr. Rebecca Johnson (Director of the Australian Museum Research Institute) set out to learn more about koala populations, genetics, and diseases. As a biologically unique animal with a limited appetite, maintaining a healthy and genetically diverse population are both key elements of any conservation plan. In addition to characterizing the genetic diversity of koala populations, the team wanted to strengthen Australia’s ability to lead large-scale genome sequencing projects.
Inside the Koala Genome
Last month the team published their results in Nature Genetics. Their paper (Adaptation and Conservation Insights from the Koala Genome) identifies the genomic basis for the koala’s unique biology.

This work was performed on AWS. The research team used cfnCluster to create multiple clusters, each with 500 to 1000 vCPUs, and running Falcon from Pacific Biosciences. All in all, the team used 3 million EC2 core hours, most of which were EC2 Spot Instances.

Friday, 23 February 2018

Exploring the 1000 genome dataset with Hail on Amazon EMR and Amazon Athena

Blog post from Roy Hasson

https://aws.amazon.com/blogs/big-data/genomic-analysis-with-hail-on-amazon-emr-and-amazon-athena/?nc1=b_rp

Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.

For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.

Wednesday, 22 February 2012

Amazon S3 for temporary storage of large datasets?

Just did a rough calculation on AWS calculator, the numbers are quite scary!

For a hypothetical 50 TB dataset (haven't found out the single S3 object max file size yet, seem to recall it's 1 Gbytes)
it costs $4160.27 to store it for a month!

to transfer it out it costs $4807.11!

For 3 years, the cost of storage is $149,000 which I guess you can pay for an enterprise storage solution and transfer costs are zero.

At this point in time, I guess one can't really use AWS S3 for sequence archival. I wonder if data deduplication can help reduce cloud storage costs ... I am sure in terms of bytes, BAM files should be quite similar .. no?

Wednesday, 15 September 2010

Myrna-calculate differential gene expression on Elastic MapReduce or local Hadoop

The software, termed “Myrna” was funded in part by Amazon Web Services (in addition to the Bloomberg School of Public Health and the National Institutes of Health) was, not surprisingly, making use of compute resources from Amazon. In order to test Myrna, researchers rented time and storage resources from AWS and were able to realize solid performance and cost savings. According to the study's authors, “Myrna calculated differential expression from 1.1 billion RNA sequences reads in less than two hours at a cost of about $66.”

Note:
Myrna is a cloud computing tool for calculating differential gene expression in large RNA-seq datasets. Myrna uses Bowtie for short read alignment and R/Bioconductor for interval calculations, normalization, and statistical testing. These tools are combined in an automatic, parallel pipeline that runs in the cloud (Elastic MapReduce in this case) on a local Hadoop cluster, or on a single computer, exploiting multiple computers and CPUs wherever possible.

also see

Kevin's GATTACA World

Saturday, 29 September 2018

Koala Genome assembled on AWS

Friday, 23 February 2018

Exploring the 1000 genome dataset with Hail on Amazon EMR and Amazon Athena

Wednesday, 22 February 2012

Amazon S3 for temporary storage of large datasets?

Wednesday, 15 September 2010

Myrna-calculate differential gene expression on Elastic MapReduce or local Hadoop

Cloud computing method greatly increases gene analysis

Datanami, Woe be me

Analytics code

Contributors