Why can't I delete an index or upgrade my Amazon OpenSearch Service cluster?

6 minute read
0

I'm unable to delete an index, or upgrade my Amazon OpenSearch Service cluster. Why is this happening?

Short description

If you try to delete an index or upgrade your OpenSearch Service cluster, the change can fail for the following reasons:

  • Snapshot is already in progress.
  • Snapshot in progress is stuck.
  • Snapshot in progress has a cluster in red status.
  • Snapshot timeout or failure.

For more information about OpenSearch Service upgrade failures, see Troubleshooting an upgrade. For more information about the red health status of an OpenSearch Service cluster, see Red cluster status.

Resolution

Snapshot is already in progress

While a snapshot is in progress, you might encounter one of the following error messages:

  • "Prior snapshot operation has not yet completed" (during a cluster upgrade)
  • "Cannot delete indices that are being snapshotted" (while deleting an index)

If you received an error, try the following:

1.    For encrypted domains, use the following syntax to check whether an automated snapshot is in progress:

curl -XGET "https://domain-endpoint/_snapshot/cs-automated-enc/_status"

2.    For unencrypted domains, use the following syntax to check whether an automated snapshot is in progress:

curl -XGET "https://domain-endpoint/_snapshot/cs-automated/_status"

If there are no running snapshots, then the following output appears:

{
    "snapshots": []
}

The empty brackets indicate whether you can safely delete the index or perform an upgrade. If OpenSearch Service is unable to check whether a snapshot is in progress, then the operation can fail.

Snapshot in progress is stuck

1.    Use the following command syntax to check the start and end times of your hourly snapshots:

curl -XGET "https://domain-endpoint/_cat/snapshots/cs-automated?v&s=id"

2.    Print the start times by using a cURL output piped to the awk command:

curl -XGET "https://domain-endpoint/_cat/snapshots/cs-automated?v&s=id" | awk -F" " ' { print $4 } '

The output indicates the time that the hourly snapshots occurred. For example, this output indicates that the output runs around the 52nd minute of each hour:

22:51:11
23:51:18
00:51:19
01:51:14
02:51:16
03:51:18
04:51:16
05:51:11

3.    Check your OpenSearch Service upgrade eligibility.

Important: Don't run the upgrade eligibility check until the snapshot is complete.

Use the snapshot status API to check whether the snapshot is completed. The snapshot status API returns an empty set when your snapshot is captured. If the current status is in progress and doesn't change for a while, then your snapshot might be stuck. The same applies to snapshots that are stopped, which can prevent other snapshots from being taken. If the cluster is in red status, or there is a write block, clear the status or block to resolve the failure.

Note: The data from your snapshot can change after configuration changes are made. Therefore, don't use the snapshot for scheduled jobs.

Snapshot in progress has a cluster in red status

1.    To list only the repository names registered to your domain, use the following syntax:

curl -XGET "http://domain-endpoint/_cat/repositories?v&h=id"

2.    To list the repository names, types, and other settings registered to your domain, use the following syntax:

curl -XGET "http://domain-endpoint/_snapshot?pretty"
curl -XGET "https://domain-endpoint/_cluster/state/metadata"

3.    Check if you can list snapshots in each of the repositories, excluding the cs-automated or cs-automated-enc repositories. If you have several repositories, use a bash script like this:

#!/bin/bash
repos=$(curl -s https://domain-endpoint/_cat/repositories 2>&1 |grep  -v "cs-automated" | awk '{print $1}')

for i in $repos; do
echo "Snapshots in ... :" $i >>/tmp/snapshot
`curl -s -XGET https://domain-endpoint/_cat/snapshots/$i >> /tmp/snapshot`
\echo "done..."
done

Important: Stuck snapshots can't be manually deleted in the cs-automated or cs-automated-enc repository.

4.    To view the output in the /tmp/snapshot folder, use the following syntax:

cat /tmp/snapshot

The command returns a response similar to this:

Snapshots in ... : snapshot-manual-repo
axa_snapshot-1557497454881 SUCCESS 1557639400 05:36:40 1557639405 05:36:45  4.6s  7 31 0 31
2019-05-15                 SUCCESS 1560503610 09:13:30 1560503622 09:13:42 11.8s  4 16 0 16
epoch_test                 SUCCESS 1569151317 11:21:57 1569151335 11:22:15 18.1s 15 56 0 56

The returned error message indicates that the Amazon Simple Storage (Amazon S3) bucket is already deleted and registered as a snapshot repository:

Snapshots in ... : snapshot-manual-repo
{
    "error": {
        "root_cause": [{
            "type": "repository_exception",
            "reason": "[snapshot-manual-repo] could not read repository data from index blob"
        }],
        "type": "repository_exception",
        "reason": "[snapshot-manual-repo] could not read repository data from index blob",
        "caused_by": {
            "type": "i_o_exception",
            "reason": "Exception when listing blobs by prefix [index-]",
            "caused_by": {
                "type": "a_w_s_security_token_service_exception",
                "reason": "a_w_s_security_token_service_exception: User: arn:aws:sts::999999999999:assumed-role/cp-sts-grant-role/swift-us-east-1-prod-666666666666 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::666666666666:policy/my-manual-es-snapshot-creator-policy (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: 6b9374fx-11xy-11yz-ff66-918z9bb08193)"
            }
        }
    },
    "status": 500
}

5.    Verify that the manual snapshot repository is deleted from the Amazon S3 bucket:

aws s3 ls | grep -i "snapshot-manual-repo"

Note: Replace snapshot-manual-repo with your bucket name.

6.    Delete the repository from your cluster:

curl -XDELETE "https://domain-endpoint/_snapshot/snapshot-example-manual-repo"

Snapshot timeout or failure

If you received a snapshot timeout or failure, perform the following steps:

1.    Check whether you can take a manual snapshot. If you get a Can't take manual snapshot error, call the _cat/snapshots API:

curl -XGET "https://domain-endpoint/_cat/snapshots/s3_repository"

2.    Replace s3_repository with the name of your Amazon S3 bucket. This syntax checks how long the current snapshot has been running. If the duration seems reasonable, wait for the snapshot to complete, and then try taking the snapshot again.

Note: Your snapshot duration can take longer depending on the size of your indices or the resource consumption of your cluster.

3.    Check the health status of your cluster:

curl -XGET "https://domain-endpoint/_cluster/health?pretty"

If your cluster's health status is red, then first identify and address the root cause of your red cluster status. If OpenSearch Service is relocating or initializing shards, then wait for the process to complete before configuring any access policies. Note that shard reallocation can significantly strain the computing resources of your cluster. For more information about troubleshooting a red cluster, see Red cluster status.


Related information

How can I improve the indexing performance on my Amazon OpenSearch Service cluster?

AWS OFFICIAL
AWS OFFICIALUpdated 3 years ago