How can I troubleshoot high or full disk usage with Amazon Redshift?

9 minute read
0

I'm experiencing high or full disk utilization on Amazon Redshift and want to troubleshoot this issue.

Resolution

High disk usage errors can depend on several factors, including:

  • Distribution and sort key
  • Query processing
  • Tables with VARCHAR(MAX) columns
  • High column compression
  • Maintenance operations
  • Cartesian products with cross-joins
  • Minimum table size
  • Tombstone blocks
  • Copying a large file

Distribution and sort key

Review the table's distribution style, distribution key, and sort key selection. Tables with distribution skew—where more data is located in one node than in the others—can cause a full disk node. If you have tables with skewed distribution styles, then change the distribution style to a more uniform distribution. Note that distribution and row skew can affect storage skew and intermediate rowset when a query is running. For more information about distribution keys and sort keys, see Amazon Redshift engineering’s advanced table design playbook: preamble, prerequisites, and prioritization.

To determine the cardinality of your distribution key, run the following query:

SELECT <distkey column>, COUNT(*) FROM <schema name>.<table with distribution skew> GROUP BY <distkey column> HAVING COUNT(*) > 1 ORDER BY 2 DESC;

Note: To avoid a sort step, use SORT KEY columns in your ORDER BY clause. A sort step can use excessive memory, causing a disk spill. For more information, see Working with sort keys.

In the filtered result set, choose a column with high cardinality to view its data distribution. For more information on the distribution style of your table, see Choose the best distribution style.

To see how database blocks in a distribution key are mapped to a cluster, use the Amazon Redshift table_inspector.sql utility.

Query processing

Review any memory allocated to a query. While a query is processing, intermediate query results can be stored in temporary blocks. If there isn't enough free memory, then the tables cause a disk spill. Intermediate result sets aren't compressed, which affects the available disk space. For more information, see Insufficient memory allocated to the query.

Amazon Redshift defaults to a table structure with even distribution and no column encoding for temporary tables. But if you are using SELECT...INTO syntax, use a CREATE statement. For more information, see Top 10 performance tuning techniques for Amazon Redshift. Follow the instructions under Tip #6: Address the inefficient use of temporary tables.

If insufficient memory is allocated to your query, you might see a step in SVL_QUERY_SUMMARY where is_diskbased shows the value "true". To resolve this issue, increase the number of query slots to allocate more memory to the query. For more information about how to temporarily increase the slots for a query, see wlm_query_slot_count or tune your WLM to run mixed workloads. You can also use WLM query monitoring rules to counter heavy processing loads and to identify I/O intensive queries.

Tables with VARCHAR(MAX) columns

Check VARCHAR or CHARACTER VARYING columns for trailing blanks that might be omitted when data is stored on the disk. During query processing, trailing blanks can occupy the full length in memory (the maximum value for VARCHAR is 65535). It's a best practice to use the smallest possible column size.

To generate a list of tables with maximum column widths, run the following query:

SELECT database, schema || '.' || "table" AS "table", max_varchar FROM svv_table_info WHERE max_varchar > 150 ORDER BY 2;

To identify and display the true widths of the wide VARCHAR table columns, run the following query:

SELECT max(octet_length (rtrim(column_name))) FROM table_name;

In the output from this query, validate if the length is appropriate for your use case. If the columns are at maximum length and exceed your needs, adjust their length to the minimum size needed.

For more information about table design, review the Amazon Redshift best practices for designing tables.

High column compression

Encode all columns (except sort key) by using the ANALYZE COMPRESSION or using the automatic table optimization feature in Amazon Redshift. Amazon Redshift provides column encoding. It's a best practice to use this feature, even though it increases read performance and reduces overall storage consumption.

Maintenance operations

Be sure that the database tables in your Amazon Redshift database are regularly analyzed and vacuumed. Identify any queries that running against tables that are missing statistics. Preventing queries from running against tables that are missing statistics keeps Amazon Redshift from scanning unnecessary table rows. This also helps optimize your query processing.

Note: Maintenance operations such as VACUUM and DEEP COPY use temporary storage space for their sort operations, so a spike in disk usage is expected.

For example, the following query helps you identify outdated stats in Amazon Redshift:

SELECT * FROM svv_table_info WHERE stats_off > 10 ORDER BY size DESC;

Additionally, use the ANALYZE command to view and analyze table statistics.

For more information on maintenance operations, see the Amazon Redshift Analyze & Vacuum schema utility.

Cartesian products with cross-joins

Use the EXPLAIN plan of the query to look for queries with Cartesian products. Cartesian products are cross-joins that are unrelated and can produce an increased number of blocks. These cross-joins can result in higher memory utilization and more tables spilled to disk. If cross-joins don't share a JOIN condition, then the joins produce a Cartesian product of two tables. Every row of one table is then joined to every row of the other table.

Cross-joins can also be run as nested loop joins, which take the longest time to process. Nested loop joins result in spikes in overall disk usage. For more information, see Identifying queries with nested loops.

Minimum table size

The same table can have different sizes in different clusters. The minimum table size is then determined by the number of columns and whether the table has a SORTKEY and number of slices populated. If you recently resized an Amazon Redshift cluster, you might see a change in your overall disk storage. This is caused by the change in number of slices. Amazon Redshift also counts the table segments that are used by each table. For more information, see Why does a table in an Amazon Redshift cluster consume more or less disk storage space than expected?

Tombstone blocks

Tombstone blocks are generated when a WRITE transaction to an Amazon Redshift table occurs and there is a concurrent Read. Amazon Redshift keeps the blocks before the write operation to keep a concurrent Read operation consistent. Amazon Redshift blocks can't be changed. Every Insert, Update, or Delete action creates a new set of blocks, marking the old blocks as tombstoned.

Sometimes tombstones fail to clear at the commit stage because of long-running table transactions. Tombstones can also fail to clear when there are too many ETL loads running at the same time. Because Amazon Redshift monitors the database from the time that the transaction starts, any table written to the database also retains the tombstone blocks. If long-running table transactions occur regularly and across several loads, enough tombstones can accumulate to result in a Disk Full error.

You can also force Amazon Redshift to perform the analysis regarding tombstone blocks by performing a commit command.

If there are long-running queries that are active, then terminate the queries (and release all subsequent blocks) using the commit command:

begin;
create table a (id int);
insert into a values(1);
commit;
drop table a;

Then, to confirm tombstone blocks, run the following query:

select trim(name) as tablename, count(case when tombstone > 0 then 1 else null end) as tombstones from svv_diskusage group by 1 having count(case when tombstone > 0 then 1 else null end) > 0 order by 2 desc;

Copying a large file

During a COPY operation, you might receive a Disk Full error even if there is enough storage available. This error occurs if the sorting operation spills to disk, creating temporary blocks.

If you encounter a Disk Full error message, then check the STL_DISK_FULL_DIAG table. Check which query ID caused error and the temporary blocks that were created:

select '2000-01-01'::timestamp + (currenttime/1000000.0)* interval '1 second' as currenttime,node_num,query_id,temp_blocks from pg_catalog.stl_disk_full_diag;

For more best practices, see Amazon Redshift best practices for loading data.

Additional troubleshooting

Check the percentage of disk space under the Performance tab in the Amazon Redshift console. For each cluster node, Amazon Redshift provides extra disk space, which is larger than the nominal disk capacity.

If you notice a sudden spike in utilization, use the STL_QUERY to identify the activities and jobs that are running. Note which queries are running at the time of a disk spill:

select * from stl_query where starttime between '2018-01-01 00:30:00' and '2018-01-01 00:40:00';

Note: Update the values with the time when the spike occurred.

To identify the top 20 disk spill queries, run the following query:

select A.userid, A.query, blocks_to_disk, trim(B.querytxt) text from stl_query_metrics A, stl_query B where A.query = B.query and segment=-1 and step = -1 and max_blocks_to_disk > 0 order by 3 desc limit 20;

View the column value blocks_to_disk to identify disk spilling. Terminate queries that are spilling too much, if needed. Then, allocate additional memory to the queries before running them again. For more details, refer to STL_QUERY_METRICS.

To determine if your queries are properly writing to a disk, run the following query:

SELECT q.query, trim(q.cat_text)
FROM (
SELECT query,
replace( listagg(text,' ') WITHIN GROUP (ORDER BY sequence), '\\n', ' ') AS cat_text
FROM stl_querytext
WHERE userid>1
GROUP BY query) q
JOIN (
SELECT distinct query
FROM svl_query_summary
WHERE is_diskbased='t' AND (LABEL ILIKE 'hash%' OR LABEL ILIKE 'sort%' OR LABEL ILIKE 'aggr%' OR LABEL ILIKE 'save%' OR LABEL ILIKE 'window%' OR LABEL ILIKE 'unique%')
AND userid > 1) qs
ON qs.query = q.query;

This command also identifies queries that are spilling to disk.


Related information

Performance

Amazon Redshift system overview

AWS OFFICIAL
AWS OFFICIALUpdated a year ago