How can I troubleshoot EMR job failures when trying to connect to the Glue Data Catalog?

4 minute read
0

My Amazon EMR jobs can't connect to the AWS Glue Data Catalog.

Short description

Amazon EMR uses the Data Catalog as a persistent meta store when using Apache Spark, Apache Hive, or Presto/Trino. You can share the Data Catalog across different clusters, services, applications, or AWS accounts.

However, the connection to the Data Catalog might fail for the following reasons:

  • Insufficient permissions to the Glue Data Catalog.
  • Insufficient permissions to the Amazon Simple Storage Service (Amazon S3) objects specified as the table location.
  • Insufficient permissions to the AWS Key Management Service (AWS KMS) service for encrypted objects.
  • Insufficient permissions in AWS Lake Formation.
  • Missing or incorrect EMR cluster parameters configuration.
  • Incorrect query formatting.

Resolution

The EC2 instance profile doesn't have sufficient permissions for the Data Catalog or the S3 bucket

To access the Data Catalog from the same account or across accounts, the following must have permissions to AWS Glue actions and to the S3 bucket:

  • The Amazon Elastic Compute Cloud (Amazon EC2) instance profile.
  • The AWS Identity and Access Management (IAM) role calling the Data Catalog.

If permissions are missing, then you see an error similar to the following:

Unable to verify existence of default database: com.amazonaws.services.glue.model.AccessDeniedException: 
User: arn:aws:sts::Acct-id:assumed-role/Role/instance-id is not authorized to perform: glue:GetDatabase on resource: arn:aws:glue:region:Acct-id:catalog because no identity-based policy allows the glue:GetDatabase action 
(Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: request-id; Proxy: null

To troubleshoot issues when accessing the Data Catalog from the same account, check the permissions for the instance profile or the IAM user.

To troubleshoot issues when accessing the Data Catalog cross accounts, check all the permissions for the calling account and configuration. Then, verify that cross account S3 access is provided.

The EC2 instance profile doesn't have the necessary AWS KMS permissions

If the Data Catalog is encrypted using a customer managed key, then the EC2 instance profile must have the necessary permissions to access the key. If permissions are missing, then you might see an error similar to the following. The error appears in your EMR console if you're using the spark-shell, Hive CLI or the Presto/Trino CLI. The error appears in your container logs if you're submitting your code programmatically.

Caused by: MetaException(message:User: arn:aws:sts::acct-id:assumed-role/Role/instance-id is not authorized to perform: kms:GenerateDataKey on resource: arn:aws:kms:region:acct-id:key/fe90458f-beba-460e-8cae-25782ea9f8b3 because no identity-based policy allows the kms:GenerateDataKey action (Service: AWSKMS; Status Code: 400; Error Code: AccessDeniedException; Request ID: request-id; Proxy: null) 
(Service: AWSGlue; Status Code: 400; Error Code: GlueEncryptionException; Request ID: request-id; Proxy: null))

To avoid the preceding error, add the necessary AWS KMS permissions to allow access to the key.

If the AWS account calling the service isn't the same account where the Data Catalog is present, then do the following:

  • Turn on key sharing if the calling AWS account is in the same Region as the Data Catalog.
  • For multi-Region access, create a multi-Region key for sharing with other accounts.

The instance profile doesn't have access to AWS Lake Formation or the Glue tables don't have the required grants

When Data Catalog permissions are managed or registered in AWS Lake Formation, the role must have Lake Formation permissions on the object. If Lake Formation permissions are missing on the role, then you might see the following error:

pyspark.sql.utils.AnalysisException: Unable to verify existence of default database: com.amazonaws.services glue.model.AccessDeniedException: 
Insufficient Lake Formation permission(s) on default (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: request-id; Proxy: null)

To resolve the preceding error, add the required grants to the EC2 instance profile role. And, provide grants to the Glue tables or to the database along with the table permissions.

The EMR cluster doesn't have the correct configurations or the query string is incorrect

If the permissions are correct, but the configuration is incorrect, then you see the following error on spark-shell when attempting cross account Glue access:

An error occurred (EntityNotFoundException) when calling the GetTables operation: Database db-name not found.

or

org.apache.spark.sql.AnalysisException: Table or view not found: acct-id/db.table-name line 2 pos 14

To resolve this error, add all the necessary parameters for the respective configurations.


AWS OFFICIAL
AWS OFFICIALUpdated a year ago