Showing posts with label AWS. Show all posts
Showing posts with label AWS. Show all posts
Saturday, 29 September 2018
Koala Genome assembled on AWS
Excerpted from AWS blog
Five years ago, a research team led by Dr. Rebecca Johnson (Director of the Australian Museum Research Institute) set out to learn more about koala populations, genetics, and diseases. As a biologically unique animal with a limited appetite, maintaining a healthy and genetically diverse population are both key elements of any conservation plan. In addition to characterizing the genetic diversity of koala populations, the team wanted to strengthen Australia’s ability to lead large-scale genome sequencing projects.
Inside the Koala Genome
Last month the team published their results in Nature Genetics. Their paper (Adaptation and Conservation Insights from the Koala Genome) identifies the genomic basis for the koala’s unique biology.
This work was performed on AWS. The research team used cfnCluster to create multiple clusters, each with 500 to 1000 vCPUs, and running Falcon from Pacific Biosciences. All in all, the team used 3 million EC2 core hours, most of which were EC2 Spot Instances.
Friday, 23 February 2018
Exploring the 1000 genome dataset with Hail on Amazon EMR and Amazon Athena
Blog post from Roy Hasson
https://aws.amazon.com/blogs/big-data/genomic-analysis-with-hail-on-amazon-emr-and-amazon-athena/?nc1=b_rp
https://aws.amazon.com/blogs/big-data/genomic-analysis-with-hail-on-amazon-emr-and-amazon-athena/?nc1=b_rp
Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.
For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.
Wednesday, 22 February 2012
Amazon S3 for temporary storage of large datasets?
Just did a rough calculation on AWS calculator, the numbers are quite scary!
For a hypothetical 50 TB dataset (haven't found out the single S3 object max file size yet, seem to recall it's 1 Gbytes)
it costs $4160.27 to store it for a month!
to transfer it out it costs $4807.11!
For 3 years, the cost of storage is $149,000 which I guess you can pay for an enterprise storage solution and transfer costs are zero.
At this point in time, I guess one can't really use AWS S3 for sequence archival. I wonder if data deduplication can help reduce cloud storage costs ... I am sure in terms of bytes, BAM files should be quite similar .. no?
For a hypothetical 50 TB dataset (haven't found out the single S3 object max file size yet, seem to recall it's 1 Gbytes)
it costs $4160.27 to store it for a month!
to transfer it out it costs $4807.11!
For 3 years, the cost of storage is $149,000 which I guess you can pay for an enterprise storage solution and transfer costs are zero.
At this point in time, I guess one can't really use AWS S3 for sequence archival. I wonder if data deduplication can help reduce cloud storage costs ... I am sure in terms of bytes, BAM files should be quite similar .. no?
Labels:
Amazon Web Services,
AWS,
bam,
big data,
cloud,
costs,
fastq,
Next Generation Sequencing,
storage
Wednesday, 15 September 2010
Myrna-calculate differential gene expression on Elastic MapReduce or local Hadoop
The software, termed “Myrna” was funded in part by Amazon Web Services (in addition to the Bloomberg School of Public Health and the National Institutes of Health) was, not surprisingly, making use of compute resources from Amazon. In order to test Myrna, researchers rented time and storage resources from AWS and were able to realize solid performance and cost savings. According to the study's authors, “Myrna calculated differential expression from 1.1 billion RNA sequences reads in less than two hours at a cost of about $66.”
Note:
Myrna is a cloud computing tool for calculating differential gene expression in large RNA-seq datasets. Myrna uses Bowtie for short read alignment and R/Bioconductor for interval calculations, normalization, and statistical testing. These tools are combined in an automatic, parallel pipeline that runs in the cloud (Elastic MapReduce in this case) on a local Hadoop cluster, or on a single computer, exploiting multiple computers and CPUs wherever possible.
also see
Note:
Myrna is a cloud computing tool for calculating differential gene expression in large RNA-seq datasets. Myrna uses Bowtie for short read alignment and R/Bioconductor for interval calculations, normalization, and statistical testing. These tools are combined in an automatic, parallel pipeline that runs in the cloud (Elastic MapReduce in this case) on a local Hadoop cluster, or on a single computer, exploiting multiple computers and CPUs wherever possible.
also see
Cloud computing method greatly increases gene analysis
Labels:
Amazon Web Services,
AWS,
Bioconductor,
bioinformatics,
bowtie,
cloud,
Hadoop,
news,
Next Generation Sequencing,
software
Subscribe to:
Posts (Atom)