Why does my Amazon SageMaker pipeline execution fail?

2 minute read
1

I want to troubleshoot why my Amazon SageMaker pipeline execution failed.

Resolution

To troubleshoot the failed pipeline execution in SageMaker, do the following:

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

1.    Run the AWS Command Line Interface (AWS CLI) command list-pipeline-executions.

Note: Use the AWS CloudShell console if you don't have AWS CLI configured in your local machine.

$ aws sagemaker list-pipeline-executions --pipeline-name test-pipeline-p-wzx9cplzrvdk

The command returns a list of pipeline executions for your pipeline that looks similar to the following:

"PipelineExecutionSummaries": [
        {
            "PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b",
            "StartTime": "2022-09-27T12:56:44.646000+00:00",
            "PipelineExecutionStatus": "Failed",
            "PipelineExecutionDisplayName": "execution-1664283404791",
            "PipelineExecutionFailureReason": "Step failure: One or multiple steps failed."
        },
        {
            "PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/acvref9y1f47",
            "StartTime": "2022-09-27T12:13:28.762000+00:00",
            "PipelineExecutionStatus": "Succeeded",
            "PipelineExecutionDisplayName": "execution-1664280808943"
        }
    ]
}

2.    Run the list-pipeline-executions-steps command to view the steps that failed:

$ aws sagemaker list-pipeline-execution-steps --pipeline-execution-arn arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b

The output looks similar to the following:

{
    "PipelineExecutionSteps": [
        {
            "StepName": "TrainAbaloneModel",
            "StartTime": "2022-09-27T13:00:49.235000+00:00",
            "EndTime": "2022-09-27T13:01:50.056000+00:00",
            "StepStatus": "Failed",
            "AttemptCount": 0,
            "FailureReason": "ClientError: ClientError: Please ensure the security group provided is valid",
            "Metadata": {
                "TrainingJob": {
                    "Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:training-job/pipelines-lvejn1jl827b-trainabalonemodel-u9l9wjassg"
                }
            }
        },
        {
            "StepName": "PreprocessAbaloneData",
            "StartTime": "2022-09-27T12:56:45.595000+00:00",
            "EndTime": "2022-09-27T13:00:48.638000+00:00",
            "StepStatus": "Succeeded",
            "AttemptCount": 0,
            "Metadata": {
                "ProcessingJob": {
                    "Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:processing-job/pipelines-lvejn1jl827b-preprocessabalonedat-6axq0kthyg"
                }
            }
        }
    ]
}

In this case, the training job step failed because a non-existent security group was specified in the job's VpcConfig object.

If the FailureReason for the failed step isn't clear, check the Amazon CloudWatch logs for the failed SageMaker job or endpoint to troubleshoot further. You can see the logs for the training jobs in the CloudWatch log group /aws/sagemaker/TrainingJobs. The log stream looks similar to the following:

example-training-job-name/algo-example-instance-number-in-cluster-example-epoch-timestamp


Related information

Log Amazon SageMaker events with Amazon CloudWatch

AWS OFFICIAL
AWS OFFICIALUpdated a year ago