How do I troubleshoot the "Command failed with exit code" error in AWS Glue?

5 minute read
0

My AWS Glue job fails and throws the "Command failed with exit code" error.

Short description

"Command failed with exit code X" is a generic error message that's returned when your AWS Glue application is shut down. This error occurs when one or more of the following conditions are true:

  • The driver or executor in the AWS Glue job ran out of memory.
  • The job script has code-related issues.
  • The AWS Identity and Access Management (AWS IAM) role lacks the required permissions to access the script path.

To further investigate this error, review the Amazon CloudWatch logs and metrics.

Resolution

The AWS Glue Spark job fails with the error "Command failed with exit code 1" and the CloudWatch logs show the error "java.lang.OutOfMemoryError: Java heap space"

This AWS Glue error indicates that a driver or executor process in the job is running out of memory. To check if a driver or executor is causing the out-of-memory (OOM) exception, review the CloudWatch metrics for glue.driver.jvm.heap.usage and glue.executorID.heap.usage. For more information, see Monitoring AWS Glue using Amazon CloudWatch metrics.

To troubleshoot an OOM exception caused by the driver, see How do I resolve the "java.lang.OutOfMemoryError: Java heap space" error in an AWS Glue Spark job? and Debugging a driver OOM exception.

To troubleshoot an OOM exception caused by executors see Debugging an executor OOM exception.

Executors fail with "Command failed with exit code 1"

This error occurs when you cancel an AWS Glue job. This error also occurs when you forcefully shut down the executors, and the driver is terminated. Review the CloudWatch logs for AWS Glue for more details on the error.

AWS Glue version 0.9/1.0 Spark job fails with the error "Command failed with exit code 1" and the CloudWatch logs show the error "Container killed by YARN for exceeding memory limits"

Note: AWS Glue version 1.0 Spark jobs reached end of support. Upgrade to Glue 2.0 or higher for improved performance.

This AWS Glue error indicates that the executor is causing an OOM exception. To troubleshoot this error, see Debugging an executor OOM exception.

AWS Glue Python Shell job fails with the error "Command failed with exit code 1"

This error indicates that the AWS Glue IAM role doesn't have permission to access the AWS Glue script from the Amazon Simple Storage Service (Amazon S3) path. Review the permissions that the AWS Glue IAM role must have to access the script location path. Then, attach these permissions to the IAM role.

The AWS Glue job fails with the error "Command failed with exit code 1" and doesn't start

Check the CloudWatch job logs for errors that are related to Amazon S3. In the logs, you might receive an error similar to the following one:

com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied' Request ID: xxxxxxxxxxxxx)

This error occurs when the AWS Glue IAM role doesn't have permission to access the AWS Glue ETL script from the Amazon S3 path. Review the permissions that the AWS Glue IAM role must have to access the script location path. Then, attach these permissions to the IAM role.

The AWS Glue job occasionally fails with "Command failed with exit code 10" and the CloudWatch logs show an error, even with the correct IAM and S3 bucket permissions

com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied' Request ID: xxxxxxxxxxxxx)

This AWS Glue error occurs in AWS Glue versions 3.0 or 4.0. This error occurs when you're using the AWS Glue security configuration, but the S3 bucket policy denies the non-encrypted se:putObject.

To resolve this issue, run job.init() at the beginning of the script to bring the AWS Glue security configuration into effect. If you start the Spark session before job.init(), then the Spark security configuration properties are overridden, and the error occurs.

See the following example:

job = Job(glueContext)
job.init(args["JOB_NAME"], args)

#Use one of the following depending on whether Spark configuration is being set or not
spark = glueContext.spark_session
spark = glueContext.spark_session.builder.enableHiveSupport().config("hive.exec.dynamic.partition","true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()

For more information on AWS Glue security configuration, see Encrypting data written by crawlers, jobs, and development endpoints.

The AWS Glue job fails with the error "Command failed with exit code X"and JAR files are passed to the job

You might observe one of the following errors in the CloudWatch logs:

"Exception in thread "main" java.lang.NoSuchMethodError"

"Exception in thread "main” java.lang.ExceptionInInitializerError"

These errors indicate a JAR dependency conflict or Spark version conflict. Check the JAR executable and the extra JAR files that are passed in the job for conflict. If you're passing multiple JAR files, then remove one JAR file at a time and re-run the AWS Glue job. This way, you can isolate the file that's causing the issue.


Related information

Monitoring jobs using the Apache Spark web UI

Why does my AWS Glue ETL job fail with the error "Container killed by YARN for exceeding memory limits"?

AWS OFFICIAL
AWS OFFICIALUpdated 3 years ago