Workflows

  • HealthOmics offers two types of workflows: private workflows and Ready2Run workflows. Private workflows are custom workflows that enable you to bring your own bioinformatics scripts written in the most commonly used workflow languages. Ready2Run workflows are prebuilt bioinformatics pipelines based on common industry analyses that allow you to get started quickly without writing code. 

  • HealthOmics private workflows can be written in Nextflow, WDL, and CWL. For supported version information, see documentation.

  • HealthOmics offers a wide variety of Ready2Run workflows ranging from the Broad Institute’s GATK and AlphaFold to workflows from third-party publishers such as NVIDIA, Element Biosciences, Sentieon, and Ultima. You can view the full list of available Ready2Run workflows here.

  • Yes, HealthOmics can run bioFMs, such as NVIDIA NIMs, AlphaFold, and ESMFold. You can orchestrate multiple bioFMs within a workflow, unlocking drug discovery pipelines at scale. For example drug discovery workflows that use bioFMs, see the drug discovery workflows repository on GitHub.

  • To run your first private workflow, you need a workflow script written in Nextflow, WDL, or CWL. Additionally, all tools and dependencies much be containerized and stored in a private ECR repository. Input data can be provided in S3 or from the HealthOmics sequence store.

  • You can manage private workflow resources with run groups. Run groups allow you to control the maximum concurrent runs, maximum run duration, vCPUs, and GPUs of runs assigned to the run group. Additionally, HealthOmics provides rightsizing tools, such as Run Analyzer, that help you optimize your resource allocations to improve run efficiency. 

  • HealthOmics private workflows offer two run storage options: static run storage and dynamic run storage. With static run storage, a fixed size file system is provisioned at the start of the run and is used by tasks for intermediate file storage during the run. When the run completes, the run outputs are exported to S3 and the file system is deprovisioned. Dynamic run storage scales up and down automatically with your storage needs over the duration of the run and offers faster provisioning times. Dynamic run storage is recommended for fast, iterative development cycles and small, short running pipelines. Static run storage is suitable for large workflows. It provides higher file system throughput per GiB and lower cost per GiB than dynamic run storage.

  • HealthOmics workflows deliver real time logs to CloudWatch during the run and additional logs after the run has completed. You can use EventBridge to build automated alerts for conditions you define. 

  • Yes, HealthOmics workflows can be shared with different AWS accounts in the same region by using the resource sharing feature. To share a workflow, you need the account ID of the AWS account you want to share with. Sharing a workflow will send a share invitation to the recipient. The recipient must accept the share request before they can run the shared workflow. The workflow owner can revoke access at any time and the recipient cannot modify or delete the shared workflow. 

  • Files used as run inputs from S3 and the HealthOmics sequence store are assigned a unique ETag for file identification, containers stored in your private ECR repository are assigned a unique hash, and workflows are immutable once they are created to ensure full reproducibility of runs. Every run is assigned a globally unique uuid which can be used to identify every unique run, run results, and associated logs. This uuid can be connected to your internal laboratory information systems (LIMS), Electronic lab notebooks (ELN), or sample management systems to meet traceability and run reproducibility requirements.  

  • Customers can use workflows and data stores together or as standalone solutions. HealthOmics workflows are compatible with S3 and the HealthOmics sequence and reference store. The HealthOmics sequence and reference stores can be used with HealthOmics workflows, AWS Batch, and other compute solutions.

Data Stores

  • HealthOmics offers two types of data stores: object focused stores and queryable stores. The object focused stores are the reference and sequence stores. They are designed for cost-effectively storing and organizing molecular files. The queryable stores are the variant and annotation store. They are designed to cost effectively turn variant and annotation data into an optimized store for querying and cohorting. Together these stores are designed to deliver FAIR (findable, accessible, interoperable, reusable) sample storage, querying, cohorting, and retrieval at petabyte scale. 

  • HealthOmics data stores drive savings in many different ways. The sequence store uses usage driven tiering and compression to drive reduced storage cost for objects that have not been accessed for 30 days. This can lead to significant savings compared to traditional AWS object storage.

    The HealthOmics variant and annotation stores are zero-ETL stores so you only pay for the storage and the data scanned when querying. Savings is driven by removing the cost of the ETL and by separating the variant and annotation data so that variant data does not have to be replicated when there is a desire to change annotations. Additionally, since variant stores are partitioned by the sample information, sample based queries scan less data leading to further downstream cost savings.

  • Each data store is designed for different data types. HealthOmics reference stores support FASTA files. HealthOmics sequence stores support FASTQ, uBAM, BAM, and CRAM files. Variant stores support extracting data from VCF files. Annotation Stores support extracting data from GFF, TSV, CSV, VCF.

  • The total volume of data and number of objects you can store in AWS HealthOmics is virtually unlimited. While each store has adjustable quotas on the file sizes and counts supported, files can continue to be added as needed with customers routinely storing in the 10s of petabytes in a store.

  • HealthOmics data stores are built on top of Amazon S3's durability and resiliency which includes objects stored redundantly on multiple devices and Availability Zones in an AWS Region. The sequence store preserves and monitors object semantic identity ensuring that the contents of the file are preserved throughout the activation and archiving cycles.

  • HealthOmics sequence stores can be integrated directly with most analytical tools through either the S3 access URI for objects or using companion tools. Each object stored in the sequence store has a unique S3 URI that can be used to read it using most S3 compatible systems. If a system requires a file based interface, Mountpoint for S3 can be used to make a read set or sequence store prefix available as a mounted file for reading. If customizations are needed, integrations can be done using Amazon’s SDK or the HealthOmics transfer manager.

  • The HealthOmics sequence store is designed for storing static molecular data that is periodically and frequently accessed. The sequence store has built in compression and tiering, while also having object read scaling built on S3, so it is suitable for data of all scale with various levels of access frequency, from daily use to yearly. Each ingestion creates a new read set and the sequence store charges for a minimum storage duration of 30 days so it is not meant for temporary, scratch, or frequently updated files.

    Amazon S3 is a great for dynamic files that change frequently, files that are short lived, and for non molecular files that do not meet the supported formats. For files that need to be maintained for data archiving and compliance reasons but have very low access needs, Amazon S3 Glacier provides different storage options.

Security & Privacy