How do I handle Spot termination notices in AWS Fargate Spot tasks?

6 minute read
0

I want to know how to handle Spot termination notices in AWS Fargate Spot tasks.

Short description

You can use Fargate Spot to run interruption tolerant Amazon Elastic Container Service (Amazon ECS) tasks. A termination notice is a two-minute warning that you receive before the Fargate Spot task is due for termination. This warning helps you to manage Spot interruptions by giving your applications time to prepare for a graceful shutdown. The termination notice is created as soon as the Fargate Spot task is marked for termination. The notice indicates the time when the Fargate Spot task that's running will be terminated. The warning is sent as a task state change event to Amazon EventBridge and as a SIGTERM signal to the running task.

To be sure that the containers on Fargate Spot exit before the task stops, specify a stopTimeout value of 120 seconds or less in the container definition used by the task. Specifying a stopTimeout value gives the container the time to exit normally. After this duration is elapsed, the container is forcefully stopped.

Note: You can specify a maximum value of 120 seconds for stopTimeout. If you don't specify any value for this parameter, then the default value of 30 seconds is used.

When the interruption signal is received on Amazon ECS services using Fargate Spot, the service scheduler determines if additional capacity is available. The service scheduler uses the minimumHealthyPercent and maximumPercent values to make this determination. If capacity is available, then the service scheduler attempts to launch additional tasks on Fargate Spot. However, if the service scheduler fails to find the capacity for new tasks, then the old tasks are terminated after the stopTimeout duration elapses.

Consider the following when you use Fargate Spot with a load balancer:

  • Tasks that are run as FARGATE_SPOT might not be deregistered from a load balancer’s target group until the task transitions to a STOPPED state.
  • With FARGATE_SPOT, you get only two minutes to deregister the target group before the task is shut down. This means that you must set the deregistration delay for any target groups associated with FARGATE_SPOT to a value less than two minutes.

Resolution

When capacity is unavailable, or the capacity is taken back for Fargate Spot, the ECS service scheduler can't launch new tasks. Instead, the scheduler terminates the existing tasks after providing a two-minute notice. However, these events aren't reported in the Amazon ECS console.

The events from ECS are delivered to EventBridge in near real time. Therefore, it's a best practice to write simple rules to indicate your chosen events and automated actions for when an event matches a rule. This article covers EventBridge rules for the following use cases:

  • A FARGATE_SPOT task is shut down due to Fargate Spot interruption.
  • A FARGATE_SPOT task can't be placed due to the unavailability of Fargate Spot capacity.

A FARGATE_SPOT task is shut down due to Fargate Spot interruption

The following is a snippet of a task state change event displaying the stopped reason and stop code for a Fargate Spot interruption:

{
  "version": "0",
  "id": "a99d3f53-4a7c-4153-a1a5-48957fc83b8f",
  "detail-type": "ECS Task State Change",
  "source": "aws.ecs",
  "account": "1111222233334444",
  "resources": [
    "arn:aws:ecs:ap-southeast-2:1111222233334444:task/4be29e5b-b05c-42a2-a596-be62090eea9b"
  ],
  "detail": {
    "clusterArn": "arn:aws:ecs:ap-southeast-2:1111222233334444:cluster/default",
    "createdAt": "2022-02-25T10:13:08.455Z",
    "desiredStatus": "STOPPED",
    "lastStatus": "RUNNING",
    "stoppedReason": "Your Spot Task was interrupted.",
    "stopCode": "SpotInterruption",
    "taskArn": "arn:aws:ecs:ap-southeast-2:1111222233334444:task/4be29e5b-b05c-42a2-a596-be62090eea9bEXAMPLE",
    ...
  }
}

Note that stopCode is mentioned as SpotInterruption when a task is stopped due to Fargate Spot interruption. You can create an EventBridge rule that sends an Amazon Simple Notification Service (Amazon SNS) alert whenever a FARGATE_SPOT task is stopped by SpotInterruption.

Do the following:

1.    Create an Amazon SNS topic.

2.    Create an EventBridge rule for this use case.

To create an EventBridge rule for this use case, do the following:

1.    Open the Amazon EventBridge console.

2.    In the navigation pane, choose Rules.

3.    Choose Create rule.

4.    Enter a name and description for the rule.

5.    For Event bus, select AWS default event bus.

6.    For Rule type, select Rule with an event pattern.

7.    Choose Next.

8.    For Event source, select AWS services.

9.    For Event pattern, choose Custom patterns (JSON editor), and add the following pattern:

{
  "source": [
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Task State Change"
  ],
  "detail": {
    "stopCode": [
      "SpotInterruption"
    ],
    "clusterArn": [
      "arn:aws:ecs:exampleregion:1111222233334444:cluster/examplecluster"
    ]
  }
}

10.    Choose Next.

11.    For Target types, select AWS service.

12.    For Select a target, select SNS topic.

13.    For Topic, select the SNS topic that you created.

14.    Choose Next.

15.    In the configure tags - optional page, choose Next.

16.    Review the options and choose Create rule.

A FARGATE_SPOT task can't be placed due to the unavailability of Fargate Spot capacity

The following is a snippet of the service task placement failure event that occurred under the conditions:

  • The task was trying to use the FARGATE_SPOT capacity provider.
  • The service scheduler couldn't acquire any Fargate Spot capacity.
{
  "version": "0",
  "id": "403b98b2-616e-4ec7-8dff-b2cba8d5bf64",
  "detail-type": "ECS Service Action",
  "source": "aws.ecs",
  "account": "1111222233334444",
  "time": "2022-02-25T14:56:32.756Z",
  "region": "ap-southeast-2",
  "resources": [
    "arn:aws:ecs:ap-southeast-2:1111222233334444:service/default/servicetest"
  ],
  "detail": {
    "eventType": "ERROR",
    "eventName": "SERVICE_TASK_PLACEMENT_FAILURE",
    "clusterArn": "arn:aws:ecs:ap-southeast-2:1111222233334444:cluster/default",
    "capacityProviderArns": [
      "arn:aws:ecs:ap-southeast-2:1111222233334444:capacity-provider/FARGATE_SPOT"
    ],
    "reason": "RESOURCE:FARGATE",
    "createdAt": "2022-02-25T14:21:04.163Z"
  }
}

When a task can't be placed due to unavailable Fargate Spot capacity, the eventName is mentioned as SERVICE_TASK_PLACEMENT_FAILURE. This means that you can create an EventBridge rule that sends out an SNS alert whenever a FARGATE_SPOT task can't be placed.

Do the following:

1.    Create an SNS topic.

2.    Create an Amazon EventBridge rule for this use case. To do this, use the instructions provided in the section A FARGATE_SPOT task is shut down due to Fargate Spot interruption except for the following change:

For Event pattern, choose Custom patterns (JSON editor), and add the following pattern:

{
  "source": [
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Deployment State Change"
  ],
  "detail": {
    "eventName": [
      "SERVICE_TASK_PLACEMENT_FAILURE"
    ],
    "clusterArn": [
      "arn:aws:ecs:example-region:1111222233334444:cluster/example-cluster"
    ],
    "reason": [
      "RESOURCE:FARGATE"
    ]
  }
}

Related information

Handling Fargate Spot termination notices

Creating Amazon EventBridge rules that react to events

AWS OFFICIAL
AWS OFFICIALUpdated a year ago