How can I be sure that CloudWatch alarms activate actions?

3 minute read
0

My Amazon CloudWatch alarm isn't activated even though I can see from my CloudWatch graphs that the alarm metric exceeds the configured threshold. How can I be sure that my CloudWatch alarms are activated and the alarm actions are performed?

Short description

CloudWatch alarms that measure time-aggregated metrics (such as five-minute averages) perform this measurement continuously in a rolling window. If all the data points collected during the evaluation period don't exceed the configured threshold, then the CloudWatch alarm isn't activated.

CloudWatch alarms start actions when the alarm state changes and is maintained for a specified number of periods. For more information, see Creating CloudWatch alarms.

Important: There is an exception to this behavior for CloudWatch alarms that are associated with Amazon EC2 Auto Scaling actions. A CloudWatch alarm keeps activating Auto Scaling actions when that alarm is in a specified state. This happens even if there are no state changes and the alarm remains in that state.

Resolution

Be sure to consider the mechanism used by CloudWatch to measure time-aggregated metrics when you create alarms.

Consider lowering the metric data thresholds to be sure the alarm works as you expect.

Troubleshooting example

In this example, there is an alarm based on average CPU utilization. The alarm is configured with a threshold of > 45. It runs for at least three consecutive periods of five minutes. The evaluation period is of three and a period of 300 seconds for the following time-aggregated metrics:

  • 05:25:00: data: {Avg=61.123}
  • 05:30:00: data: {Avg=57.847}
  • 05:35:00: data: {Avg=60.503}
  • 05:40:00: data: {Avg=55.473}
  • 05:45:00: data: {Avg=41.685}
  • 05:50:00: data: {Avg=58.390}
  • 05:55:00: data: {Avg=57.846}
  • 06:00:00: data: {Avg=61.123}

These data points result in the following alarm states:

  • 05:35 ALARM
  • 05:40 ALARM
  • 05:45 ALARM to OK
  • 05:50 OK
  • 05:55 OK
  • 06:00 OK to ALARM

The data point collected at 05:55 exceeds the Average CPU Utilization threshold of 45%. However, the alarm remains in the OK state and doesn't activate the action at 05:55. This happens because the data point collected at 05:45:00, which doesn't exceed the threshold, is included in evaluation at 05:55. However, five minutes later, the alarm starts the action because the alarm state changes from OK to ALARM at 06:00.

For the following time-aggregated metrics, the alarm state is ALARM after 05:35 because all the data points exceed the Average CPU Utilization threshold of 45%. Because there are no state changes, the alarm action isn't activated.

  • 05:25:00: data: {Avg=61.123}
  • 05:30:00: data: {Avg=57.847}
  • 05:35:00: data: {Avg=60.503}
  • 05:40:00: data: {Avg=55.473}
  • 05:45:00: data: {Avg=45.075}
  • 05:50:00: data: {Avg=58.390}
  • 05:55:00: data: {Avg=57.847}
  • 06:00:00: data: {Avg=61.123}

Related information

Dynamic scaling for Amazon EC2 Auto Scaling

Viewing available metrics

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago