How do I troubleshoot a failed or stuck Spark SQL query in Amazon EMR?

3 minute read
0

I want to collect information to troubleshoot issues with my Spark SQL queries in Amazon EMR.

Resolution

Spark SQL completed queries are located on the Application User Interfaces tab of your EMR cluster's Spark History Server. For more information, see View persistent application user interfaces.

To access completed Spark SQL queries, do the following:

  1. On the Spark History Server, select SQL/DataFrame to view completed Spark SQL queries.
  2. Select the query's Spark SQL job application IDs to open the completed job ID information on the Jobs tab. Note that a SQL query might have more than one job ID.

The following information is available from the Jobs tab:

  • On the Jobs tab, you can see the status of the job, the job duration, and the associated SQL query.
  • On the Jobs tab, you can review the application ID's timeline. The timeline displays the addition and removal of the Spark executors in chronological order.
  • Scroll down to see the DAG (Direct Acyclic Graph). The DAG is a visualization of the Spark SQL query. You can also see the chain of RDD dependencies.
  • Scroll further to see the completed stages of the Spark SQL job.
  • Select the stage ID description to see the query's total time across all tasks, a locality level summary, and the associated job ID. The Stages view provides details of all the RDDs that correspond to this stage. This view also provides information about the lower level RDD operation related to the high level Spark SQL operation.
  • Expand Aggregated Metric by Executor to view the Executors log. The Executors log provides additional details about the Spark SQL job.

The following is an example log:

23/01/17 18:10:17 INFO Executor: Finished task 0.0 in stage 16.0 (TID 16). 1882 bytes result sent to driver
23/01/17 18:10:17 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 17
23/01/17 18:10:17 INFO Executor: Running task 0.0 in stage 17.0 (TID 17)
23/01/17 18:10:17 INFO TorrentBroadcast: Started reading broadcast variable 17 with 1 pieces (estimated total size 4.0 MiB)
23/01/17 18:10:17 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 7.2 KiB, free 4.8 GiB)
23/01/17 18:10:17 INFO TorrentBroadcast: Reading broadcast variable 17 took 8 ms
23/01/17 18:10:17 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 13.8 KiB, free 4.8 GiB)
23/01/17 18:10:17 INFO PythonRunner: Times: total = 52, boot = -31, init = 83, finish = 0
23/01/17 18:10:17 INFO Executor: Finished task 0.0 in stage 17.0 (TID 17). 1883 bytes result sent to driver
23/01/17 18:11:20 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
23/01/17 18:11:20 INFO MemoryStore: MemoryStore cleared
23/01/17 18:11:20 INFO BlockManager: BlockManager stopped
23/01/17 18:11:20 INFO ShutdownHookManager: Shutdown hook called

For detailed information, see Jobs Tab in the Web UI section of the Apache Spark documentation.

Related information

Examine the log files

AWS OFFICIAL
AWS OFFICIALUpdated a year ago