Why is the point-in-time restore of my Amazon RDS DB instance taking a long time?

6 minute read
0

I am performing a point-in-time restore operation of my Amazon Relational Database Service (Amazon RDS) DB instance. This operation is taking a long time to complete.

Short description

With the Restore to point in time option, you can restore a DB instance to a specific point in time when you have automated backups enabled. Amazon RDS instances that have automated backups enabled can be restored up to the earliest restorable time. The earliest restorable time specifies how far back within the backup retention period that you can restore your instance. The maximum retention period is set to 35 days, which is five calendar weeks. You can change this value from 0 to 35 days by modifying the instance. You can initiate a point-in-time restore and specify any date and time within your retention period up to the latest restorable time. The latest restorable time is the latest time that you can restore the instance from.

You can use the AWS Command Line Interface (AWS CLI) command describe-db-instances to return the latest restorable time for your instance. The latest restorable time is typically within the last five minutes. You can also find the latest restorable time by choosing Automated backups in the Amazon RDS console.

When you initiate a point-in-time restore for your DB instance, Amazon RDS calls the RestoreDBInstanceToPointInTime API on your behalf. This API is tracked in AWS CloudTrail. For more information, see Viewing CloudTrail events in the CloudTrail console.

You can also check the progress of the point-in-time restore by reviewing the RDS log files. After the RDS instance is restored from backup, you can use the RDS log files to track the recovery progress.

RDS uploads transaction logs for DB instances to Amazon Simple Storage Service (Amazon S3) every five minutes. During a point-in-time restore, the snapshot that's closest to the time mentioned in point-in-time is restored first. Then, the transaction logs are applied until the time that you mentioned when you initiated the point-in-restore. This operation might take longer depending on the number of transaction logs that must be applied.

For example, suppose that you have the automated backup of a DB instance taken at 04:00 UTC today, and you must perform a point-in-time restore at 06:15 UTC. In this case, the instance is restored from the backup that's created at 04:00 UTC. Then, the transaction logs until 06:16 UTC are applied to the restored instance to complete the point-in-time restore process.

Resolution

Use the following best practices to reduce the time taken to restore your DB instance to a specific point in time:

  • If you enabled automatic backups on your source instances, then it's best practice to take manual snapshots at regular intervals to avoid a large number of delta changes from piling up and decrease the recovery point objective.
  • Be sure that no long-running queries are running on the source database at the time of point-in-time restore. Long-running transactions can extend the recovery time, which means that the database can take longer to become available.
  • Use the correct transaction log size whenever possible. Large transactions are written to the transaction file at one time and aren't split up among different files. Therefore, the transaction log file ends up becoming larger than necessary, increasing the crash recovery time.
  • During the point-in-time restore, after the snapshot of your instance is restored, data from the snapshot that's located in S3 is copied into the underlying Amazon Elastic Block Store (Amazon EBS) volume. This process is known as lazy loading. When you try to access a block that's not already copied, RDS pulls that block from S3, resulting in I/O latency. To help mitigate the effects of lazy loading on tables that you require quick access to, you can perform operations that involve full-table scans, such as SELECT *. This allows Amazon RDS to download all the backed-up table data from S3.

Example:

You might see the following information in a log file when you restore your RDS for PostgreSQL DB instance to a specific point in time:

2022-06-01 13:16:19 UTC::@:[8613]:LOG: starting point-in-time recovery to 2022-06-01 12:54:30+00
2022-06-01 13:16:19 UTC::@:[8613]:LOG: redo starts at 0/48A3220
waiting for 000000010000000000000001 archive /rdsdbdata/log/restore/pg-wal-archive.1.* to be downloaded
2022-06-01 13:17:22 UTC:127.0.0.1(46110):rdsadmin@rdsadmin:[10322]:FATAL: the database system is starting up
2022-06-01 13:17:25 UTC::@:[8613]:LOG: restored log file "000000010000000000000001" from archive
recovering 000000010000000000000002
2022-06-01 13:17:26 UTC::@:[8613]:LOG: restored log file "000000010000000000000002" from archive
recovering 000000010000000000000003
2022-06-01 13:17:28 UTC::@:[8613]:LOG: restored log file "000000010000000000000003" from archive
recovering 000000010000000000000004
2022-06-01 13:18:54 UTC::@:[8613]:LOG: restored log file "000000010000000000000022" from archive
recovering 000000010000000000000023
.
.
2022-06-01 13:33:16 UTC::@:[8613]:LOG: restored log file "00000001000000060000000B" from archive
2022-06-01 13:33:16 UTC::@:[8613]:LOG: recovery stopping before commit of transaction 9266438, time 2022-06-01 12:56:14.648042+00
2022-06-01 13:33:16 UTC::@:[8613]:LOG: redo done at 6/2C0003C0
2022-06-01 13:33:16 UTC::@:[8613]:LOG: last completed transaction was at log time 2022-06-01 12:51:14.646151+00
recovering 00000002.history
2022-06-01 13:33:16 UTC::@:[8613]:LOG: selected new timeline ID: 2
2022-06-01 13:33:16 UTC::@:[8613]:LOG: archive recovery complete
recovering 00000001.history
2022-06-01 13:33:16 UTC::@:[8620]:LOG: checkpoint starting: end-of-recovery immediate wait
2022-06-01 13:33:16 UTC::@:[8620]:LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 8 recycled; write=0.002 s, sync=0.003 s, total=0.031 s; sync files=2, longest=0.003 s, average=0.002 s; distance=655360 kB, estimate=1611806 
kB
2022-06-01 13:33:16 UTC::@:[8607]:LOG: database system is ready to accept connections
2022-06-01 13:37:18 UTC::@:[8607]:LOG: received fast shutdown request
2022-06-01 13:37:18 UTC::@:[8607]:LOG: aborting any active transactions
2022-06-01 13:37:18 UTC::@:[8607]:LOG: background worker "logical replication launcher" (PID 7394) exited with exit code 1
2022-06-01 13:37:18 UTC::@:[8620]:LOG: shutting down
2022-06-01 13:37:18 UTC::@:[8620]:LOG: checkpoint starting: shutdown immediate
2022-06-01 13:37:18 UTC::@:[8620]:LOG: checkpoint complete: wrote 9 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.003 s, sync=0.003 s, total=0.024 s; sync files=7, longest=0.003 s, average=0.001 s; distance=65535 kB, estimate=1457179
        kB
2022-06-01 13:37:20 UTC::@:[8607]:LOG: database system is shut down
2022-06-01 13:37:24 UTC::@:[10870]:LOG: starting PostgreSQL 13.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit
2022-06-01 13:37:24 UTC::@:[10870]:LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-06-01 13:37:24 UTC::@:[10870]:LOG: listening on IPv6 address "::", port 5432
2022-06-01 13:37:24 UTC::@:[10870]:LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-06-01 13:37:24 UTC::@:[10875]:LOG: database system was shut down at 2022-06-01 13:37:18 UTC
2022-06-01 13:37:24 UTC::@:[10870]:LOG: database system is ready to accept connections

Related information

Restoring from a DB snapshot

Amazon EBS snapshots

Why is my Amazon Aurora DB cluster clone, snapshot restore, or point in time restore taking so long?

Implementing a disaster recovery strategy with Amazon RDS

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago