Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options includeAWS Batchin conjunction withAWS LambdaandAWS Step Functions;AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorseAmazon EMR.
For this task, we useHail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it usingAmazon Athena.