Go to file

dodger 3c572a1edf TOC		2023-10-28 10:34:05 +02:00
.gitignore	gitignore	2023-10-24 11:26:20 +02:00
intervention.md	TOC	2023-10-28 10:34:05 +02:00
README.md	report	2023-10-24 11:25:49 +02:00

README.md

Servers

ip	role
`10.3.0.3`	primary
`10.6.1.3`	backup

The problem

Just open the log give us information about the error:

The network name cannot be found.
2023-10-19 09:45:56.140 BST [4600] LOG:  archive command failed with exit code 1
2023-10-19 09:45:56.140 BST [4600] DETAIL:  The failed archive command was: copy "pg_wal\000000010000007C00000019" "\\10.6.1.3\archivelog\000000010000007C00000019"
2023-10-19 09:45:56.140 BST [4600] WARNING:  archiving write-ahead log file "000000010000007C00000019" failed too many times, will try again later

The command executed by archive_command is failing.
The archive_command on PostgreSql main configuration file postgresql.conf is set as:

archive_command = 'copy "%p" "\\\\10.6.1.3\\\archivelog\\%f"'

Additionally, We can see the following line:

The network name cannot be found.

Without connecting again, in my oppinion, the ip address: 10.6.1.3 (which is the backup server), either is not accessible via windows shared drive or the shared drive \\10.6.1.3\archivelog is full (0% space available).

Anyway, the result of the archive_command is ERROR. And the default behaviour of PostgreSql in such case is DON'T delete the WAL file until archive_command succeed.

PostgreSql Documentation is clear at that point, including an example:

While designing your archiving setup, consider what will happen if the archive command fails repeatedly because some aspect requires operator intervention or the archive runs out of space. For example, this could occur if you write to tape without an autochanger; when the tape fills, nothing further can be archived until the tape is swapped. You should ensure that any error condition or request to a human operator is reported appropriately so that the situation can be resolved reasonably quickly. The pg_wal/ directory will continue to fill with WAL segment files until the situation is resolved. (If the file system containing pg_wal/ fills up, PostgreSQL will do a PANIC shutdown. No committed transactions will be lost, but the database will remain offline until you free some space.)

That's the cause of the problems you're facing. As the result of archive_command fail, WAL are not being automatically deleted then the disk will run out of space. PostgreSql is waiting for manual intervention so archive_command ends with OK and it can continue deleting old WAL files.

Solutions

Solution 1: Solve the connectivity problems

First solution should be fast and doesn't involve PostgreSql reconfiguration/restart.
You have to check:

Connectivity from primary to the shared Network drive: \\10.6.1.3\archivelog
Disk space available on the backup server in the underlying drive of the shared network drive \\10.6.1.3\archivelog

Additional actions on this solution

To avoid possible future failures.

Add check to the monitorization tool to check:

Connectivity from primary server to the shared Network drive
Connectivity from backup server to the shared Network drive
Disk space available on the backup server in the underlying drive of the shared network drive

Solution 2: Change archive_command

This solution will involve PostgreSql server to be restarted to apply changes.

Modify archive_command to copy WAL files locally:

archive_command = 'copy "%p" "C:\\APIS\\Postgres12\\primary_archivelog\\%f"'

Create the destination directory:

mkdir C:\APIS\Postgres12\primary_archivelog

And restart PostgreSql.

Note: This solution is not recommended because you won't have cross-repliaction. That is, in our initial recomendations we suggest that primary server copies the WAL files to the backup server to be able to execute a Point-in-time-recover (PITR), that is a best-practice.

Additional notes and recommendations

PostgreSQL version

You're running version 12 which has EOL (end-of-life) support on November 14, 2024.
Our recommendation is to upgrade to v16.
v16 will be on production state for many months when final upgrade happens, so we suggest to plan the version upgrade as soon as possible.

Archivelog location

Consider having a external shared storage shared with both primary and backup for archivelog storage.

Thales Hyderabad info

You can safely ignore this.

Archive:  wetransfer_hyderabad-metro-apis-pg-logs_2023-10-19_0855.zip
Written using Zippy
  Length      Date    Time    Name
---------  ---------- -----   ----
  9315434  2023-10-19 10:56   10.3.0.3 primary server PG log.zip
  5903497  2023-10-19 10:56   10.6.1.3 backup server PG logs.zip
---------                     -------
 15218931                     2 files