This Guidance helps you to connect life sciences data instruments and laboratory system files to the AWS Cloud, either through the internet or a direct connection with low latency. You can cut down on storage expenses for data that gets accessed less often or make it accessible for high-performance computing for genomics, imaging, and other intense workloads, all on AWS.
Architecture Diagram
Step 1
A lab technician runs an experiment or test, and results are written to a folder on an on-premises file server. An AWS DataSync task is set up to sync the data from local storage to a bucket in Amazon Simple Storage Service (Amazon S3).
Step 2
Data is transferred to the AWS Cloud either through the internet, or through a low-latency direct connection that avoids the internet, such as AWS Direct Connect.
Step 3
Electronic lab notebooks (ELN) and lab information management systems (LIMS) share experiment and test metadata bidirectionally with the AWS Cloud through events and APIs. Learn more about this integration in Guidance for a Laboratory Data Mesh on AWS.
Step 4
Partnering entities like a contract research organization (CRO) can upload study results to Amazon S3 by using AWS Transfer Family for FTP, SFTP, or FTPS.
Step 5
You can optimize storage costs by writing instruments data to an S3 bucket configured for infrequent access. Identify your S3 storage access patterns to optimally configure your S3 bucket lifecycle policy and transfer data to Amazon S3 Glacier.
Step 6
Using Amazon FSx for Lustre, data is made accessible to high performance computing (HPC) on the Cloud for genomics, imaging, and other intensive workloads to provide a low millisecond-latency shared file system.
Step 7
Bioinformatics pipelines are orchestrated with AWS Step Functions, AWS HealthOmics, and AWS Batch for flexible CPU and GPU computing.
Step 8
Machine learning is conducted with an artificial intelligence and machine learning (AI/ML) toolkit that uses Amazon SageMaker for feature engineering, data labeling, model training, deployment and ML operations. Amazon Athena is used for flexible SQL queries.
Step 9
For researchers using on-premises applications for data analysis and reporting, they view and access data in Amazon S3 by using Network File System (NFS) or Server Message Block (SMB) through Amazon S3 File Gateway.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
As new data sources and partners arise, a variety of data transfer services can be used to adapt to these changing access patterns. For multi-site environments, S3 File Gateway can be used to transfer while you retain an on-site cache for other applications. Transfer Family lets partnering entities like CROs easily upload study results.
-
Security
For data protection purposes, we recommend that you protect AWS account credentials and set up individual user accounts with AWS Identity and Access Management (IAM), so that each user is given only the permissions necessary to fulfill their job duties. We also suggest that you use at-rest encryption, and the services use in-flight encryption by default.
-
Reliability
DataSync leverages single or multiple VPC endpoints to ensure that if an Availability Zone is unavailable, the agent can reach another endpoint. DataSync is a scalable service that leverages sets of agents to move data. The tasks and agents can be scaled based on the demand of the amount of data that needs to be migrated.
DataSync logs all events to Amazon CloudWatch. If a job fails, actions can be taken to better understand the issue and where the task is failing. Once the tasks are complete, post-processing jobs can be initiated to complete the next phase of the pipeline process.
Amazon S3 provides a highly durable storage infrastructure designed for mission-critical and primary data storage.
-
Performance Efficiency
FSx for Lustre storage provides sub-millisecond latencies, up to hundreds of GBs/s of throughput, and millions of IOPS.
-
Cost Optimization
By using serverless technologies that scale on-demand, you only pay for the resources you use. To further optimize cost, you can stop the notebook environments in SageMaker when they are not in use. If you don’t intend to use the Amazon QuickSight visualization dashboard, you can choose to not deploy it to save costs.
Data Transfer charges are comprised of two main areas: DataSync, which is charged on a per GB transferred rate; and Direct Connect or VPN data transferred. Additionally, cross-Availability Zone charges might apply if VPC endpoints are used.
-
Sustainability
CloudWatch metrics allow users to make data-driven decisions based on alerts and trends. By extensively using managed services and dynamic scaling, you minimize the environmental impact of the backend services. Most components are self-sustaining.
Implementation Resources
A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
Building Digitally Connected Labs with AWS
This post discusses the tools, best practices, and partners helping Life Sciences labs take full advantage of the scale and performance of AWS Cloud.
Guidance for a Laboratory Data Mesh on AWS
This Guidance demonstrates how to build a scientific data management system that integrates both laboratory instrument data and software with cloud data governance, data discovery, and bioinformatics pipelines, capturing key metadata events along the way.
Resilience Builds a Global Data Mesh for Lab Connectivity on AWS
This case study describes how biomanufacturing innovator Resilience revolutionizes the way novel medicines are produced with a connected network for data transfer on AWS.
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.