Guidance for Development, Automation, Implementation, and Monitoring of Bioinformatics Workflows on AWS

This Guidance shows how you can build and run production-grade bioinformatics workflows at scale. Using AWS services for automation, workflow analysis, storage, and operational and cost observability, you can follow DevOps best practices to manage the lifecycle of your bioinformatics workflows. You can use this architecture as the foundation for your own infrastructure and update certain aspects as needed to integrate it with your environment and meet your needs.

Please note: [Disclaimer]

Architecture Diagram

[Architecture diagram description]

Download the architecture diagram PDF

Step 1
Transfer sequence data to Amazon Simple Storage Service (Amazon S3) using AWS DataSync. If data is in FASTQ format, it can be imported into a sequence store in AWS HealthOmics (successor to Amazon Omics) for cost savings.

Step 2
HealthOmics runs bioinformatics workflows in languages like Workflow Description Language (WDL), Nextflow, or Common Workflow Language (CWL) to analyze raw data. These workflows can be built as private or Ready2Run (hosted by HealthOmics).

Tools running within the workflows are stored as Docker images within Amazon Elastic Container Registry (Amazon ECR). Workflow outputs are uploaded to Amazon S3.

Step 3
HealthOmics publishes workflow engine logs, task logs, and workflow run logs to Amazon CloudWatch for troubleshooting and monitoring.

Step 4
HealthOmics publishes events using Amazon EventBridge, which can automate downstream actions, such as using AWS Lambda functions to launch more bioinformatics workflows or notifying users or groups about workflow failures using Amazon Simple Notification Service (Amazon SNS).

Step 5
Useful metadata from HealthOmics workflows—such as workflow run ID, tags, sample ID, workflow output file locations—can be tracked in Amazon DynamoDB tables. An AWS Glue crawler ingests this data into the AWS Glue Data Catalog, which can be queried using Amazon Athena.

Step 6
Workflow developers and bioinformaticians can iterate on new and existing workflows and maintain version control using continuous integration and continuous delivery with AWS CodeCommit. AWS CodePipeline can be used to invoke an AWS CodeBuild job to automate the creation of new workflows in HealthOmics.

Step 7
AWS Cost and Usage Reports (AWS CUR) facilitates cost monitoring. This service can be configured to create reports and upload them to an Amazon S3 bucket. An AWS Glue crawler is configured to ingest this data to AWS Glue Data Catalog, which can be queried using Amazon Athena to derive cost-related insights.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

This Guidance uses AWS CodeCommit, AWS CodeBuild, and AWS CodePipeline to create version control and automate the build and deployment of your bioinformatics workflow’s source code. Additionally, DynamoDB lets you track HealthOmics output files and run metadata. Because this Guidance uses DevOps best practices to manage your workflow code and give you visibility into workflow run metadata, you can make incremental changes to achieve accurate results. By tracking workflow run metadata, you can easily find relevant workflow run status and output files to perform downstream reporting or scientific analysis.

Read the Operational Excellence whitepaper
Security

This Guidance provides encryption at rest using AWS Key Management Service (AWS KMS) and encryption in transit for all network traffic using DataSync. Additionally, AWS Identity and Access Management (IAM) provides fine-grained access control over potentially sensitive data so that only authorized users can perform specific actions to process and analyze it.

Read the Security whitepaper
Reliability

This Guidance lets you orchestrate computationally intensive bioinformatics workflows at scale by using HealthOmics. This service has certain service quotas, such as number of virtual CPUs, to prevent accidental overprovisioning. Additionally, Amazon S3 and DynamoDB provide high availability with built-in backup. This Guidance also uses EventBridge to capture events, such as failures, and Amazon SNS can provide real-time notifications in response so that you can take appropriate action. You can quickly investigate events using Amazon CloudWatch, which provides detailed logs to give you visibility into your HealthOmics workflows and underlying tools.

Read the Reliability whitepaper
Performance Efficiency

This Guidance lets you run concurrent workflows with different CPU and memory configurations for specific tasks. You can request resources by specifying the CPUs, memory, and storage you need, and HealthOmics provisions the appropriate infrastructure. This helps you scale based on your business needs with the right resources.

Read the Performance Efficiency whitepaper
Cost Optimization

This Guidance uses an HealthOmics sequence store, which lets you store and share petabyte-scale genomics data files efficiently and at a low cost per gigabase, providing additional cost savings over Amazon S3. Additionally, you can use AWS CUR to access the most detailed information about your AWS costs and usage, identify areas for optimization, and understand your business’s trends based on attributes such as projects, departments, or users.

Read the Cost Optimization whitepaper
Sustainability

This Guidance uses managed and serverless services that help you avoid provisioning and managing your own infrastructure, helping you minimize the environmental impact of your projects. HealthOmics provisions resources only when you request a workflow run and tears down the resources when completed. Similarly, Lambda lets you run smaller tasks as functions without provisioning your own servers.

Read the Sustainability whitepaper

Implementation Resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open implementation guide

Open sample code on GitHub

Select your cookie preferences

[SEO Subhead]

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

Designing an event-driven architecture for Bioinformatics workflows using AWS HealthOmics and Amazon EventBridge

Guidance for a Laboratory Data Mesh on AWS

Guidance for Migration & Storage of Sequence Data with AWS HealthOmics

Multimodal Data Analysis with AWS Health and Machine Learning Services

Secure Your Genomic Workflows and Data with AWS HealthOmics

Disclaimer

Was this page helpful?

Select your cookie preferences

Guidance for Development, Automation, Implementation, and Monitoring of Bioinformatics Workflows on AWS

[SEO Subhead]

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

Designing an event-driven architecture for Bioinformatics workflows using AWS HealthOmics and Amazon EventBridge

Guidance for a Laboratory Data Mesh on AWS

Guidance for Migration & Storage of Sequence Data with AWS HealthOmics

Multimodal Data Analysis with AWS Health and Machine Learning Services

Secure Your Genomic Workflows and Data with AWS HealthOmics

Disclaimer

Was this page helpful?

Ending Support for Internet Explorer