Why does my Amazon Redshift query exceed the WLM timeout that I set?

4 minute read

I set a workload management (WLM) timeout for an Amazon Redshift query, but the query continues to run after this period expires

Short description

A WLM timeout applies to queries only during the query running phase. If WLM doesn't terminate a query when expected, it's usually because the query spent time in stages other than the execution stage. For example, the query might wait to be parsed or rewritten, wait on a lock, or wait for a spot in the WLM queue. Or, the query might hit the return stage or hop to another queue.

Resolution

When STV_RECENTS is queried, starttime is the time the query entered the cluster, not the time that the query begins to run. When the query is in the Running state in STV_RECENTS, it's live in the system. However, the query doesn't use compute node resources until it enters STV_INFLIGHT status. For more information about query planning, see Query planning and execution workflow.

To view the status of a running query, query STV_INFLIGHT instead of STV_RECENTS:

select \* from STV\_INFLIGHT where query = your\_query\_id;

For more information about query stages, run the following query:

select \* from SVL\_QUERY\_REPORT where query = your\_query\_id ORDER BY segment, step, slice;

Use the STV_EXEC_STATE table to find the current state of any queries that are actively running on compute nodes:

select \* from STV\_EXEC\_STATE where query = your\_query\_id ORDER BY segment, step, slice;

The following are common reasons why a query might appear to run longer than the WLM timeout period.

The query is in the "return" phase

There are two "return" steps. Check STV_EXEC_STATE to see if the query has entered one of the following return phases:

The return to the leader node from the compute nodes
The return to the client from the leader node

A rollback is in progress

A data manipulation language (DML) operation might encounter an error and roll back. This operation might not appear as "stopped" because it's already in the process of rolling back. You can query STV_EXEC_STATE to view rollbacks, and find more information in STL_UNDONE.

The query spends time queuing before running

Query STV_WLM_QUERY_STATE to see queuing time:

select \* from STV\_WLM\_QUERY\_STATE where query = your\_query\_id;

The query is waiting on a lock

If the query is visible in STV_RECENTS, but not in STV_WLM_QUERY_STATE, then the query might be waiting on a lock and hasn't entered the queue. For more information, see How do I detect and release locks in Amazon Redshift?

A query hopped to another queue

If a read query reaches the timeout limit for its current WLM queue, then the query is pushed to the next WLM queue. Or, if there's a query monitoring rule that specifies a hop action, then the query is pushed to the next WLM queue. To confirm if the query hopped to the next queue, complete the following query based on your scenario:

If the query is currently running, then query STV_WLM_QUERY_STATE.
If the query completed, then query STL_WLM_QUERY

To prevent queries from hopping to another queue, configure the WLM queue or WLM query monitoring rules. For more information about query hopping, see WLM query queue hopping.

A networking or firewall issue

If an Amazon Redshift server has a problem communicating with your client, then the server might get stuck in the "return to client" state. Check for conflicts with networking components, such as inbound on-premises firewall settings, outbound security group rules, or outbound network access control list (network ACL) rules. For more information, see Connecting from outside of Amazon EC2 --firewall timeout issue.

An issue with the cluster

Issues on the cluster itself, such as hardware issues, might cause the query to freeze. When the query freezes because of cluster issues, the cluster is in "hardware-failure" status. To recover a single-node cluster, restore a snapshot. In multi-node clusters, failed nodes are automatically replaced.

Related information

Tuning query performance

Analyzing and improving queries

Topics

Analytics

Relevant content

Redshift Serverless WLM rules for max query execution time does not update - bug or am I missing something?
AWS-user
asked 9 months ago
Amazon Redshift concurrency scaling - How much time it takes to complete scaling and setting threshold to trigger it
Accepted Answer
rePost-User-9985749
asked 2 years ago
Redshift bad short query performance
alifeson
asked 3 years ago
Redshift out of memory when running query
Accepted Answer
AWS-User-3510348
asked 2 years ago
Amazon Redshift
SumitJaiswal
asked 2 years ago
How do I use automatic WLM to manage my workload in Amazon Redshift?
AWS OFFICIALUpdated 2 years ago
Why was my query canceled in Amazon Redshift?
AWS OFFICIALUpdated 10 months ago
How do I use and manage Amazon Redshift WLM memory allocation?
AWS OFFICIALUpdated 2 years ago
How do I create and prioritize query queues in my Amazon Redshift cluster?
AWS OFFICIALUpdated 4 months ago
Why doesn't S3 respect the TLS settings in my IAM policy?
EXPERT
Brettski-AWS
published 2 years ago