AWS Inferentia

Get high performance at the lowest cost in Amazon EC2 for deep learning and generative AI inference

Get started with AWS Inferentia chips using AWS Neuron

Why Inferentia?

AWS Inferentia chips are designed by AWS to deliver high performance at the lowest cost in Amazon EC2 for your deep learning (DL) and generative AI inference applications.

The first-generation AWS Inferentia chip powers Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, which deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances. Many customers, including Finch AI, Sprinklr, Money Forward, and Amazon Alexa, have adopted Inf1 instances and realized its performance and cost benefits.

AWS Inferentia2 chip delivers up to 4x higher throughput and up to 10x lower latency compared to Inferentia. Inferentia2-based Amazon EC2 Inf2 instances are optimized to deploy increasingly complex models, such as large language models (LLM) and latent diffusion models, at scale. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between chips. Many customers, including Leonardo.ai, Deutsche Telekom, and Qualtrics have adopted Inf2 instances for their DL and generative AI applications.

AWS Neuron SDK helps developers deploy models on the AWS Inferentia chips (and train them on AWS Trainium chips). It integrates natively with popular frameworks, such as PyTorch and TensorFlow, so that you can continue to use your existing code and workflows and run on Inferentia chips.

Benefits of AWS Inferentia

Optimized for high throughput and low latency

Each first-generation Inferentia chip has four first-generation NeuronCores, and each EC2 Inf1 instance has up to 16 Inferentia chips. Each Inferentia2 chip has two second-generation NeuronCores, and each EC2 Inf2 instance has up to 12 Inferentia2 chips. Each Inferentia2 chip supports up to 190 tera floating operations per second (TFLOPS) of FP16 performance. The first-generation Inferentia has 8 GB of DDR4 memory per chip and also features a large amount of on-chip memory. Inferentia2 offers 32 GB of HBM per chip, increasing the total memory by 4x and memory bandwidth by 10x over Inferentia.

Native support for ML frameworks

AWS Neuron SDK integrates natively with popular ML frameworks such as PyTorch and TensorFlow. With AWS Neuron, you can use these frameworks to optimally deploy DL models on both AWS Inferentia chips, and Neuron is designed to minimize code changes and tie-in to vendor-specific solutions. Neuron helps you to run your inference applications for natural language processing (NLP)/understanding, language translation, text summarization, video and image generation, speech recognition, personalization, fraud detection, and more on Inferentia chips.

Wide range of data types with automatic casting

The first-generation Inferentia supports FP16, BF16, and INT8 data types. Inferentia2 adds additional support for FP32, TF32, and the new configurable FP8 (cFP8) data type to provide developers more flexibility to optimize performance and accuracy. AWS Neuron takes high-precision FP32 models and automatically casts them to lower-precision data types while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining.

State-of-the-art DL capabilities

Inferentia2 adds hardware optimizations for dynamic input sizes and custom operators written in C++. It also supports stochastic rounding, a way of rounding probabilistically that enables high performance and higher accuracy compared to legacy rounding modes.

Built for sustainability

Inf2 instances offer up to 50% better performance/watt over comparable Amazon EC2 instances because they and the underlying Inferentia2 chips are purpose built to run DL models at scale. Inf2 instances help you meet your sustainability goals when deploying ultra-large models.

Videos

Behind the scenes look at Generative AI infrastructure at Amazon

Introducing Amazon EC2 Inf2 instances powered by AWS Inferentia2

How four AWS customers reduced ML costs and drove innovation with AWS Inferentia

Resources

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2

Maximize Stable Diffusion performance and lower inference costs with AWS Inferentia2

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

ByteDance saves up to 60% on inference costs while reducing latency and increasing throughput using AWS Inferentia

How Amazon Search reduced ML inference costs by 85% with AWS Inferentia

Additional resources

Use AWS Neuron and get started with AWS Inferentia from within TensorFlow, PyTorch, or MXNet

Additional resources

AWS Neuron feature roadmap

Additional resources

Get started with inference on AWS Inferentia using these easy tutorials

Get started with AWS Inferentia

Start building in the console

Inference Samples/Tutorials (Inf2/Trn1)