Why does my AWS Glue job fail with lost nodes when I migrate a large data set from Amazon RDS to Amazon S3?

4 minute read
0

I'm migrating a large dataset from Amazon Relational Database Service (Amazon RDS) or an on-premises JDBC database to Amazon Simple Storage Service (Amazon S3) using AWS Glue. My ETL job runs for a long time, and then fails with lost nodes.

Short description

AWS Glue uses a single connection to read the entire dataset. If you're migrating a large JDBC table, the ETL job might run for a long time without signs of progress on the AWS Glue side. Then, the job might eventually fail because of disk space issues (lost nodes). To resolve this issue, read the JDBC table in parallel. If the job still fails with lost nodes, use an SQL expression as a pushdown predicate.

Resolution

Use one or more of the following methods to resolve lost node errors for JDBC datasets.

Read the JDBC table in parallel

If the table doesn't have numeric columns (INT or BIGINT), then use the hashfield option to partition the data. Set hashfield to the name of a column in the JDBC table. For best results, choose a column with an even distribution of values.

If the table has numeric columns, set the hashpartitions and hashexpression options in the table or while creating the DynamicFrame:

  • hashpartitions: defines the number of tasks that AWS Glue creates to read the data
  • hashexpression: divides rows evenly among tasks

The following is an example of how to set hashpartitions and hashexpression while creating a DynamicFrame with a JDBC connection. In the connection_option, replace the JDBC URL, user name, password, table name, and column name.

connection_option= {"url": "jdbc:mysql://mysql–instance1.123456789012.us-east-1.rds.amazonaws.com:3306/database", "user": "your_user_name", "password": "your_password","dbtable": "your_table","hashexpression":"column_name","hashpartitions":"10"}

datasource0 = glueContext.create_dynamic_frame.from_options('mysql',connection_options=connection_option,transformation_ctx = "datasource0")

Here's an example of how to set hashpartitions and hashexpression while creating a DynamicFrame from the AWS Glue catalog:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database", table_name = "your_table",additional_options={"hashexpression":"column_name","hashpartitions":"10"}, transformation_ctx = "datasource0")

Note: Setting larger values for hashpartitions can reduce your table's performance. That's because each task reads the entire table, and then returns a set of rows to the executor.

For more information, see Reading from JDBC tables in parallel.

Use a SQL expression as a pushdown predicate

Note: The following SQL expression doesn't work as a pushdown predicate for Oracle databases. However, this expression does work as a pushdown predicate for all other databases that are natively supported by AWS Glue (Amazon Aurora, MariaDB, Microsoft SQL Server, MySQL, and PostgreSQL).

If the table contains billions of records and tebibytes (TiB) of data, the job might take a long time to complete or fail with lost nodes, even after you set hashpartitions and hashexpression. To resolve these issues, use a SQL expression similar to the following with the hashexpression option:

column_name > 1000 AND column_name < 2000 AND column_name

The SQL expression acts as a pushdown predicate and forces the job to read one set of rows per job run, rather than reading all the data at once. The full statement looks similar to the following:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "test_table",additional_options={"hashexpression":"column_name > 1000 AND column_name < 2000 AND column_name","hashpartitions":"10"}, transformation_ctx = "datasource0")

Note: Be sure to turn off job bookmarks for initial job runs with this configuration. When you run a job with a job bookmark, AWS Glue records the maximum value of the column. When you run the job again, AWS Glue processes only the rows that have values greater than the previous bookmark value. You can turn on job bookmarks during the last job run as needed.


Related information

Why is my AWS Glue job failing with the error "Exit status: -100. Diagnostics: Container released on a *lost* node"?

Defining connections in the AWS Glue Data Catalog

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago