How do I resolve the "java.lang.OutOfMemoryError: Java heap space" error in an AWS Glue Spark job?

5 minute read
0

My AWS Glue job fails with "Command failed with exit code 1". The Amazon CloudWatch Logs shows the "java.lang.OutOfMemoryError: Java heap space" error.

Short description

The "java.lang.OutOfMemoryError: Java heap space" error indicates that that a driver or executor is running out of JVM memory. To determine whether a driver or an executor causes the OOM, see Debugging OOM exceptions and job abnormalities.

Note: The following resolution is for driver OOM exceptions only.

Driver OOM exceptions are caused by the following:

  • The AWS Glue Spark job reads a large number of small files from Amazon Simple Storage Service (Amazon S3)
  • Driver intensive operations like collect(), Broadcast joins and Shared variable

Resolution

Resolve driver OOM exceptions caused by a large number of small files

To resolve driver OOM exceptions that are caused by a large number of small files with DynamicFrames, use one or more of the following methods.

Activate useS3ListImplementation

When it lists files, AWS Glue creates a file index in the driver memory lists. When you set useS3ListImplementation to True, AWS Glue doesn't cache the list of files in the memory all at once. Instead, AWS Glue caches the list in batches. This means that the driver is less likely to run out of memory.

See the following example of how to activate useS3ListImplementation with from_catalog:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "database", table_name = "table", additional_options = {'useS3ListImplementation': True}, transformation_ctx = "datasource0")

See the following example that activates useS3ListImplementation with from_options:

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": ["s3://input_path"], "useS3ListImplementation":True,"recurse":True}, format="json")

The useS3ListImplementation feature is an implementation of the Amazon S3 ListKeys operation. This splits large results sets into multiple responses. It's a best practice to use useS3ListImplementation with job bookmarks.

Grouping

A Spark application processes every small file using a different Spark task. This can lead to OOM because the driver stores and keeps track of the location and task information. When you activate the grouping feature, tasks process a group of multiple files instead of individual files. Grouping is automatically turned on when you use dynamic frames, and when the Amazon S3 dataset has more than 50,000 files. For more information, see Reading input files in larger groups.

Filtering with Push Down Predicates

Reduce the number of Amazon S3 files and Amazon S3 partitions that the AWS Glue job reads by using push down predicates. This prunes the unnecessary partitions from the AWS Glue table before the underlying data is read. For more information, see Pre-filtering using pushdown predicates.

Driver OOM exceptions caused by driver heavy operations

Resolve driver OOM exceptions that are caused by driver heavy operations by using one of the following methods.

Be mindful of driver intensive operations

collect() is a Spark operation that collects the results from workers, and then returns them back to the driver as a single object. The results can be very large, and that overwhelms the driver. By default, the Spark configuration spark.driver.maxResultSize is set to 1 GB, and helps to protect the driver from being overwhelmed.

So, limit these actions, and instead use actions like take(), takeSample() or isEmpty() where possible.

Also, be aware that broadcast joins in Spark can cause OOM errors if the relation (table) is bigger than driver’s available memory. Before a relation is broadcasted to executors, it's materialized at the driver node. If multiple tables are being broadcasted, or the relation is too large, then the driver might run into memory shortage. Use the Spark configuration spark.sql.autoBroadcastJoinThreshold and Spark Join Hints to control this.

Regularly destroy shared variables

Be sure to use shared variables carefully. Destroy shared variables when you no longer need them because they can cause Spark driver OOM exceptions. There are two types of shared variables, broadcast variables and accumulators.

  • Broadcast variables are read-only data that are sent to executors only once. They are a good solution for storing immutable reference data like small dictionary or small tables shared among all executors.
  • Accumulators provide a writable copy across Spark executors, and can be used to implement distributed counters (as in MapReduce) or sums.

Additional troubleshooting

Related information

Optimize memory management in AWS Glue

Reading input files in larger groups

Best practices for successfully managing memory for Apache Spark applications on Amazon EMR

Monitoring jobs using the Apache Spark web UI

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago