High Performance Computing FAQs

AWS Parallel Computing Service

AWS Parallel Computing Service (AWS PCS) is a managed service that makes it easy to run and scale high performance computing (HPC) workloads, and build scientific and engineering models on AWS using Slurm. Use AWS PCS to build compute clusters that integrate AWS compute, storage, networking, and visualization. Run simulations or build scientific and engineering models. Streamline and simplify your cluster operations using built-in management and observability capabilities. Empower your users to focus on research and innovation by enabling them to run their applications and jobs in a familiar environment.

AWS PCS is currently available in the following Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm).

AWS PCS currently supports Slurm, a popular open-source job scheduler and workload manager.

Slurm is a popular open-source scheduler for managing distributed HPC workloads.

AWS PCS works by provisioning a managed Slurm controller, operating the scaling logic, and launching compute nodes for you.

Without AWS PCS, you need to run a Slurm controller on a provisioned head node, launch several compute nodes, and manage fleet operations to scale capacity to match the demand present in your job queues. With AWS PCS, you can simply define your job queues and compute preferences. The service is built to manage the Slurm controller and handles fleet scaling in a highly available and secure configuration. This helps remove operational burden and allows you to focus on simulations or science instead of managing AWS infrastructure.

AWS PCS provisions Amazon Elastic Compute Cloud (Amazon EC2) instances in your account. This means you can take advantage of Amazon EC2 purchase options (On Demand, Spot) and pricing constructs (Instance Savings Plans, other discounts) and optimize that capacity through AWS PCS.

AWS PCS builds environments using services such as Amazon EC2, Amazon Elastic Block Store (Amazon EBS), Elastic Fabric Adapter (EFA), Amazon Elastic File System (Amazon EFS), Amazon FSx, Amazon DCV, and Amazon Simple Storage Service (Amazon S3) to configure the compute, visualization, storage, and networking infrastructure to run HPC workloads on AWS.

AWS PCS uses service-linked roles and managed AWS Identity and Access Management (IAM) policies for fine-grained access control. It delivers metrics and application logs to Amazon CloudWatch and emits auditable events to AWS CloudTrail. The service supports LDAP-based user authentication and authorization for Amazon EC2 instances. It can integrate with EC2 Image Builder for Amazon Machine Image (AMI) build automation. Finally, the service supports AWS CloudFormation so you can deploy and manage AWS PCS clusters and associated infrastructure.

AWS PCS is designed for a wide range of scientific and engineering workloads such as computational fluid dynamics, weather modeling, finite element analysis, electronic design automation, and reservoir simulations. AWS PCS is built to support traditional HPC customers across verticals (such as mechanical, energy, aerospace, electronics, oil and gas, weather, and public sector) that run compute or data-intensive simulations to validate their models and designs.

Scientific and engineering modeling and simulation, and high performance data analytics (HPDA) workloads are a good fit for AWS PCS.

The AWS PCS SLA can be found here.

AWS PCS supports nearly all of the EC2 instance types available in the Region in which you are using AWS PCS.

If you have a savings plan, it will automatically be applied to the EC2 instances that AWS PCS launches in your account. If you have one or more capacity reservations, you can configure AWS PCS to use them through API parameters.
 

Yes, you can use PCS to run workloads using GPUs, AWS Tranium, and AWS Inferentia instance types.

AWS PCS supports Amazon EFS, Amazon EBS, Amazon FSx for Lustre, Amazon FSx for NetApp ONTAP, Amazon FSx for OpenZFS, Amazon S3, Mountpoint for Amazon S3, and Amazon File Cache. You can also connect to your own self-managed storage resources.

AWS PCS supports a wide range of EC2 instances with advanced networking options, including use of EFA. The service supports isolated subnets, AWS PrivateLink, and Amazon Virtual Private Cloud (Amazon VPC) endpoints 

With AWS PCS, you can create compute and login node groups that launch EC2 instances either in a single Availability Zone or across multiple Availability Zones. -

Yes, you can configure your AWS PCS compute node groups to work with directory services such as Microsoft Active Directory, Microsoft Entra ID, and OpenLDAP.

Yes. You can start with any AMI that meets the AWS PCS AMI specification and install the AWS PCS client on it. You can review the AWS PCS AMI specification in the documentation. We also provide a sample AMI that you can use to try out the service, as described in the documentation.

AWS PCS is compatible with Amazon Linux 2, Ubuntu 22.04, Red Hat Enterprise Linux 9 (RHEL9), and Rocky Linux 9.

Yes. You can install AWS PCS client packages on Deep Learning AMI (DLAMI) using the best practice documentation so that it works on AWS PCS.

Yes, AWS PCS sets AWS tags at both the cluster and compute node group level, so you can track historical Amazon EC2 spend at those granularities.

Yes, you can use an on-premises node as a login node in an AWS PCS cluster and have users directly submit jobs to their AWS PCS cluster to run workloads on AWS from there. AWS PCS does not currently support Slurm federated scheduling or multi-cluster operation.

Amazon CloudWatch provides monitoring of your AWS PCS cluster health and performance by collecting metrics from the cluster at intervals. You can access historical data and gain insights into your cluster's performance over time. With CloudWatch, you can also monitor the EC2 instances launched by AWS PCS to meet your scaling requirements.

To get started, visit the AWS PCS console. You must have an AWS account to access this service. If you do not have an account, you will be prompted to create one. After signing in, visit the AWS PCS documentation page to access the getting started guide.

Research and Engineering Studio on AWS

Research and Engineering Studio on AWS (RES) is an open source, easy-to-use web-based portal for administrators to create and manage secure cloud-based research and engineering environments. Using RES, scientists and engineers can visualize data and run interactive applications without the need for cloud expertise.

You should use RES if you run engineering and research workloads and prefer to use a simple web-based portal to create and manage your virtual desktops on AWS. RES enables you to set up a virtual desktop environment; allow researchers and engineers to create and connect to Windows and Linux virtual desktops; monitor, budget, and manage a virtual desktop fleet from a single interface; manage your VDI environment through a web-based portal; and mount shared storage according to virtual desktop requirements for easy access to data.  If researchers and engineers need to interact with and discuss outputs and designs, or simulate a test case before scaling an engineering workload, RES provides powerful virtual desktops to do so.

It’s the RES administrators’ responsibility to create and maintain file systems so users have the data needed. RES supports Amazon EFS and Amazon FSx for NetApp ONTAP file system types, which the administrators can either create through RES or onboard existing file systems. For further details on managing and creating storage please refer to the documentation

RES is available at no additional charge, and you pay only for the AWS resources needed to run your applications.

RES is available in a subset of Regions. You can find the list in the documentation

You are responsible for required maintenance on EC2 instances and batch schedulers, security patching, user management and software running on Virtual Desktop instances. RES support is limited to issues related to the build-out of the resources. If you use a custom AMI instead of one of RES's default AMIs, please note that RES doesn’t support any OS issues related to the use of a custom AMI.

RES is currently compatible with Windows and Linux operating systems. For Linux, RES supports the following distributions: Amazon Linux 2, CentOS 7, Red Hat Enterprise Linux 7, Red Hat Enterprise Linux 8, and Red Hat Enterprise Linux 9.

Each Amazon EC2 instances come with two Remote Desktop Services (aka Terminal Services) licenses for administration purposes. This Quickstart  is available to help you provision these licenses for your administrators. You can also use AWS Systems Manager Session Manager , which enables remoting into EC2 instances without RDP and without a need for RDP licenses. If additional Remote Desktop Services licenses are needed, Remote Desktop user CALs should be purchased from Microsoft or a Microsoft license reseller. Remote Desktop users CALs with active Software Assurance have License Mobility benefits and can be brought to AWS default (shared) tenant environments. For information on bringing licenses without Software Assurance or License Mobility benefits, please see this section  of the FAQ.

No. Virtual desktops within RES only support On-Demand Instances.

RES is released through the Amazon Web Services repository on GitHub. You can find options there for installation.

Elastic Fabric Adapter (EFA)

EFA brings the scalability, flexibility, and elasticity of the cloud to tightly coupled high performance computing (HPC) applications. With EFA, tightly coupled HPC applications have access to lower and more consistent latency and higher throughput than traditional TCP channels, enabling them to scale better. EFA support can be enabled dynamically, on demand on any supported EC2 instance without pre-reservation, giving you the flexibility to respond to changing business and workload priorities.

HPC applications distribute computational workloads across a cluster of instances for parallel processing. Examples of HPC applications include computational fluid dynamics (CFD), crash simulations, and weather simulations. HPC applications are generally written using the Message Passing Interface (MPI) and impose stringent requirements for inter-instance communication in terms of both latency and bandwidth. Applications using MPI and other HPC middleware that supports the libfabric communication stack can benefit from EFA.

EFA devices provide all Elastic Network Adapter (ENA) devices' functionalities, plus a new OS bypass hardware interface that allows user-space applications to communicate directly with the hardware-provided reliable transport functionality. Most applications will use existing middleware, such as the Message Passing Interface (MPI), to interface with EFA. AWS has worked with a number of middleware providers to ensure support for the OS bypass functionality of EFA. Please note that communication using the OS bypass functionality is limited to instances within an Availability Zone (AZ).

For a full list of supported EC2 instances, refer to this page in our documentation.

An ENA elastic network interface (ENI) provides traditional IP networking features necessary to support VPC networking. An EFA ENI provides all the functionality of an ENA ENI, plus hardware support for applications to communicate directly with the EFA ENI without involving the instance kernel (OS-bypass communication) using an extended programming interface. Due to the advanced capabilities of the EFA ENI, EFA ENIs can only be attached at launch or to stopped instances.

EFA and ENA Express both use the SRD protocol, built by AWS. EFA is purpose-built for tightly coupled workloads to have direct hardware-provide transport communication to the application layer. ENA Express is designed to use the SRD protocol for traditional networking applications that use TCP and UDP protocols.  ENA Express works within an Availability Zone as well.  

EFA support can be enabled either at the launch of the instance or added to a stopped instance. EFA devices cannot be attached to a running instance.

Amazon DCV

Amazon DCV is a graphics-optimized streaming protocol that is well suited for a wide range of usage scenarios ranging from streaming productivity applications on mobile devices to HPC simulation visualization. On the server side, Amazon DCV supports Windows and Linux. On the client side, it supports Windows, Linux, and MacOS as well as provides a Web Client for HTML5 browser based access across devices.

No. Amazon DCV works with any HTML5 web browser. However, native clients support additional features such as multi-monitor support, with the Windows native client also supporting USB support for 3D mice, storage devices and smart cards. For workflows needing these features, you can download Amazon DCV native clients for Windows, Linux, and MacOS here .

While Amazon DCV's performance is application agnostic, customers observe a perceptible streaming performance benefit when using Amazon DCV with 3D graphics-intensive applications that require low latency. HPC applications like seismic and reservoir simulations, computational fluid dynamics (CFD) analyses, 3D molecular modeling, VFX compositing, and Game Engine based 3D rendering are some examples of applications wherein Amazon DCV's performance benefit is apparent.

Yes. Amazon DCV's custom protocol takes care of securely and efficiently transferring images generated on the server to the client and, conversely, allows the client to control the server's keyboard and mouse. The transport layer of the protocol leverages the standard WebSocket and TLS technologies, ensuring the highest level of security and compatibility with any network infrastructure.

Amazon DCV is supported on all Amazon EC2 x86-64 architecture based instance types. When used with NVIDIA GRID compatible GPU instances (such as G2, G3, and G4), Amazon DCV will leverage hardware encoding to improve performance and reduce system load.

Enabling Amazon DCV

No, you do not need a license server to install and use the Amazon DCV server on an EC2 instance. However, you need to configure your instance to guarantee access to an Amazon S3 bucket. The Amazon DCV server automatically detects that it is running on an Amazon EC2 instance and periodically connects to the Amazon S3 bucket to determine whether a valid license is available. For further instructions on Amazon DCV license setup on Amazon EC2, refer to the document here .

Yes. Amazon DCV is a downloadable software that can be downloaded and installed on running sessions. Link to the Amazon DCV download page is here .

Amazon DCV server's OS support is documented here .

Using Amazon DCV

Amazon DCV clients have tool bar ribbon displayed on the top of the remote session when not in full screen mode. Click on Settings >> Streaming Mode. This pops up a window allowing users to choose between “Best responsiveness (default) and “Best quality”. Click on “Display Streaming Metrics” at the bottom of the pop-up window to monitor real-time performance framerate, network latency and bandwidth usage.

The Amazon DCV server runs as an operating system service. You must be logged in as the administrator (Windows) or root (Linux) to start, stop, or configure the Amazon DCV server. For more information refer to the document here .

By default, the Amazon DCV server is configured to communicate over port 8443. You can specify a custom TCP port after you have installed the Amazon DCV server. The port must be greater than 1024.

GPU sharing enables you to share one or more physical GPUs between multiple Amazon DCV virtual sessions. Using GPU sharing enables you to use a single Amazon DCV server and host multiple virtual sessions that share the server's physical GPU resources. For more details on how to enable GPU sharing refer to the document here .

No, Amazon DCV GPU sharing is only available on Linux Amazon DCV servers.

Virtual sessions are supported on Linux Amazon DCV servers only. An Amazon DCV server can host multiple virtual sessions simultaneously. Virtual sessions are created and managed by Amazon DCV users. Amazon DCV users can only manage sessions that they have created. The root user can manage all virtual sessions that are currently running on the Amazon DCV server. For instructions on managing virtual sessions, refer to the document here .

AWS ParallelCluster

You should use AWS ParallelCluster if you want to run and operate self-managed HPC clusters on AWS. You can use AWS ParallelCluster to build test environments for HPC applications as well as use it as the starting point for building HPC infrastructure in the Cloud.

High performance computing applications which require a familiar cluster-like environment in the Cloud, such as MPI applications, and machine learning applications using NCCL are most likely to benefit from AWS ParallelCluster.

AWS ParallelCluster is integrated with AWS Batch, a fully managed AWS batch scheduler. AWS Batch can be thought of as a "cloud native" replacement for on-premises batch schedulers, with the added benefit of resource provisioning.

AWS ParallelCluster also integrates with Elastic Fabric Adapter (EFA) for applications that require low-latency networking between nodes of HPC clusters. AWS ParallelCluster is also integrated with Amazon FSx for Lustre, a high-performance file system with scalable storage for compute workloads, and Amazon Elastic File System.

AWS ParallelCluster provisions a head node for build and control, a cluster of compute instances, a shared filesystem, and a batch scheduler. You can also extend and customize your use cases using custom pre-install and post-install bootstrap actions.

AWS ParallelCluster supports AWS Batch—AWS’ fully managed, cloud-native batch scheduler—and is also compatible with SLURM. 

AWS ParallelCluster is currently compatible with Amazon Linux 2, Ubuntu 18.04, CentOS 7, and CentOS 8. AWS ParallelCluster provides a list of default AMIs (one per compatible Linux distribution per region) for you to use. Note that Linux distribution availability is more limited in the GovCloud and China partitions. You can learn more about distribution compatibility by reviewing the AWS ParallelCluster User Guide at https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-definition.html#base-os .

Additionally, while your cluster runs on Amazon Linux, you can run the AWS ParallelCluster command line tool to create and manage your clusters from any computer capable of running Python and downloading the AWS ParallelCluster package.

There are three ways in which you can customize AWS ParallelCluster AMIs. You can take and modify an existing AWS ParallelCluster AMI, you can take your existing customized AMI and apply the changes needed by AWS ParallelCluster on top of it, or you can use your own custom AMI at runtime. For more information, please visit https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/02_ami_customization.html .

AWS ParallelCluster does not support building Windows clusters. However, you can run the AWS ParallelCluster command line tool on your Windows machine. For more information, please visit https://docs.aws.amazon.com/parallelcluster/latest/ug/install-windows.html .

Yes. AWS ParallelCluster supports On-Demand, Reserved, and Spot Instances. Please note that work done on Spot instances can be interrupted. We recommend that you only use Spot instances for fault-tolerant and flexible applications.

Yes. You can have multiple queues and multiple instances per queue.

There is no built-in limit to the size of the cluster you can build with AWS ParallelCluster. There are, however, some constraints you should consider such as the instance limits that exist for your account. For some instance types, the default limits may be smaller than expected HPC cluster sizes and limit increase requests will be necessary before building your cluster. For more information on EC2 limits, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html .

Yes. Although AWS ParallelCluster doesn't use a placement group by default, you can enable it by either providing an existing placement group to AWS ParallelCluster or allowing AWS ParallelCluster to create a new placement group at launch. You can also configure the whole cluster or only the compute nodes to use the placement group. For more information, please see https://cfncluster.readthedocs.io/en/latest/configuration.html#placement-group.

By default, AWS ParallelCluster automatically configures an external volume of 15 GB of Elastic Block Storage (EBS) attached to the cluster’s master node and exported to the cluster’s compute nodes via Network File System (NFS). You can learn more about configuring EBS storage at https://docs.aws.amazon.com/parallelcluster/latest/ug/ebs-section.html . The volume of this shared storage can be configured to suit your needs.

AWS ParallelCluster is also compatible with Amazon Elastic File System (EFS), RAID, and Amazon FSx for Lustre file systems. It is also possible to configure AWS ParallelCluster with Amazon S3 object storage as the source of job inputs or as a destination for job output. For more information on configuring all of these storage options with AWS ParallelCluster, please visit https://docs.aws.amazon.com/parallelcluster/latest/ug/configuration.html .

AWS ParallelCluster is available at no additional charge, and you pay only for the AWS resources needed to run your applications.

AWS ParallelCluster is available in the following regions US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), EU (Stockholm), EU (Paris), EU (London), EU (Frankfurt), EU (Ireland), EU (Milan), Africa (Cape Town), Middle East (Bahrain), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Hong Kong), AWS GovCloud (US-Gov-East), AWS GovCloud (US-Gov-West), China (Beijing), and China (Ningxia).

You are responsible for operating the cluster, including required maintenance on EC2 instances and batch schedulers, security patching, user management, and MPI troubleshooting. AWS ParallelCluster support is limited to issues related to the build-out of the resources and AWS Batch integration. AWS Batch scheduler problems are supported by the AWS Batch service team. Questions regarding other non-AWS schedulers should be directed toward their own support communities. If you use a custom AMI instead of one of AWS ParallelCluster's default AMIs, please note that AWS ParallelCluster doesn't support any OS issues related to the use of a custom AMI.

AWS ParallelCluster is released via the Python Package Index (PyPI) and can be installed via pip. AWS ParallelCluster's source code is hosted on the Amazon Web Services on GitHub at https://github.com/aws/aws-parallelcluster .

Amazon EnginFrame

You should use EnginFrame because it can increase the productivity of domain specialists (such as scientists, engineers, and analysts) by letting them easily extend their workflows to the cloud and reduce their time-to-results. EnginFrame reduces overhead for administrators in managing AWS resources, as well as your users’ permissions and access to those resources. These features will help save you time, reduce mistakes, and let your teams focus more on performing innovative research and development rather than worrying about infrastructure management.

EnginFrame AWS HPC Connector is supported in EnginFrame version 2021.0 or later. Once you install EnginFrame in your environment, administrators can begin defining AWS cluster configurations from the Administrator Portal.

EnginFrame administrators can use AWS ParallelCluster to create HPC clusters running on AWS ready to accept jobs from users. To do this within EnginFrame, administrators can start by creating, editing, or uploading a ParallelCluster cluster configuration. As part of the cluster creation step, administrators create a unique name for a given AWS cluster and specify whether it is accessible to all users, a specific set of users and/or user groups, or no users. Once an AWS cluster has been created, it remains available to accept submitted jobs until an administrator removes it. By default, an AWS cluster in the created state will use only the minimal set of required resources in order to be ready to accept submitted jobs and will scale up elastically as jobs are submitted.

For EnginFrame services for which your administrator has enabled AWS as an option, you can use a drop-down menu to select from any of the available compute queues across on-premises and AWS. Administrators can include text descriptions to help you choose which of these queues is appropriate to run your workload.

EnginFrame supports Slurm for clusters that are created on AWS. You can also choose to use a different scheduler on-premises than on AWS (for example, use LSF on-premises and Slurm in AWS). In the case of EnginFrame services you set up to submit jobs both on-premises and in AWS using different job schedulers, administrators will need to ensure that any job submission scripts support submission through each of these schedulers.

EnginFrame supports Amazon Linux 2, CentOS 7, Ubuntu 18.04, and Ubuntu 20.04 operating systems on AWS. You can choose to use a different operating system on-premises than what you use on AWS. However, if you intend to use EnginFrame to run the same workload across both on-premises and AWS, we recommend using the same operating system to reduce environment difference and to simplify the portability of your workloads.

There is no additional charge for using EnginFrame on AWS. You pay for any AWS resources used to store and run your applications.

When using EnginFrame on-premises, you will be asked for a license file. To obtain an evaluation license, or to purchase new production licenses, please reach out one of the authorized NICE distributors or resellers who can provide sales, installation services, and support in your country.

Learn more about AWS services for HPC

Learn about all the AWS services you can use to build an HPC solution on AWS.

Learn more 
Sign up for a free account

Instantly get access to the AWS Free Tier. 

Sign up 
Get started with HPC on AWS

Build your first HPC cluster on AWS.

Sign in