One of the persistent challenges with PostgreSQL major version upgrades is maintaining logical replication during the process. The standard pg_upgrade utility doesn't preserve logical replication slots, which typically means tearing down and rebuilding replication configurations. For production environments running multi-cluster topologies, this has always been a significant operational hurdle.

I recently conducted an experiment to test whether Spock logical replication could survive a CloudNativePG (CNPG) major version upgrade without manual intervention. While the conventional wisdom holds that pg_upgrade doesn't preserve logical replication slots, two components work in tandem to solve this elegantly: Spock's architecture stores all replication metadata—nodes, subscriptions, and replication sets—in dedicated tables within the spock schema, which survive the upgrade intact as user data. The pgEdge Helm chart's init-spock-job.yaml then reads this preserved metadata and automatically recreates the necessary logical replication slots after the upgrade completes. This combination of persistent metadata and intelligent automation is what makes the entire process seamless.

The Experiment

The test environment consisted of three PostgreSQL clusters running version 16, configured with Spock logical replication between them. The goal was straightforward: upgrade all three clusters to PostgreSQL 17 and verify that logical replication continued functioning without rebuilding subscriptions or replication slots.

Test Parameters:

  • Three CNPG clusters (pgedge-n1, pgedge-n2, pgedge-n3)

  • Initial version: PostgreSQL 16.10

  • Target version: PostgreSQL 17.6

  • Spock logical replication configured between all clusters

  • One cluster (pgedge-n1) configured with three instances for high availability by default

  • No manual scaling operations during upgrade

The Helm chart used for this demonstration is available at https://github.com/pgEdge/pgedge-helm.git

Initial State: Three Clusters with Active Replication

Starting with three single-node PostgreSQL 16 clusters, Spock replication was already established. Each cluster could both publish and subscribe to changes from the others.

helm install \
  --values examples/configs/single/values.yaml \
  --wait \
  pgedge ./
NAME        AGE   INSTANCES   READY   STATUS               PRIMARY
pgedge-n1   5s    1                   Setting up primary   
pgedge-n2   5s    1                   Setting up primary   
pgedge-n3   5s    1                   Setting up primary

After initialization completed, the pgedge-n1 cluster was already running with three instances (the chart's default configuration for high availability):

NAME        AGE     INSTANCES   READY   STATUS                     PRIMARY
pgedge-n1   3m17s   3           3       Cluster in healthy state   pgedge-n1-1
pgedge-n2   3m17s   1           1       Cluster in healthy state   pgedge-n2-1
pgedge-n3   3m17s   1           1       Cluster in healthy state   pgedge-n3-1

Creating a test table on pgedge-n1 and verifying replication:

app=# CREATE TABLE IF NOT EXISTS test_table (
  id SERIAL PRIMARY KEY,
  val TEXT
);
INFO:  DDL statement replicated.
CREATE TABLE
app=# INSERT INTO test_table VALUES(1,'A');
INSERT 0 1
app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
(1 row)

Within moments, the change appeared on pgedge-n2:

app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
(1 row)
app=# INSERT INTO test_table VALUES(2,'B');
INSERT 0 1
And on pgedge-n3:
app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
  2 | B
(2 rows)
app=# INSERT INTO test_table VALUES(3,'C');
INSERT 0 1
app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
  2 | B
  3 | C
(3 rows)

Verification back on pgedge-n2 confirmed multi-directional replication was working:

app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
  2 | B
  3 | C
(3 rows)

At this stage, all three clusters were synchronizing changes bidirectionally. The real test would come next.

Must Check Before Upgrade: WAL Lag Verification

Before starting any major version upgrade with CNPG and Spock, especially when upgrading to PostgreSQL 18, always ensure all Spock replication slots are fully caught up. PostgreSQL 18 introduces stricter pg_upgrade verification: any negative or high WAL lag in logical replication slots can cause the upgrade to fail.

Running the verification check on each cluster:

# Run this on each cluster
for NODE in pgedge-n1-1 pgedge-n2-1 pgedge-n3-1; do
  echo "========================================"
  echo "Checking: $NODE"
  echo "========================================"
  kubectl exec -it $NODE -- psql -U postgres -d app -c "
    SELECT
      slot_name,
      plugin,
      active,
      pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as wal_lag
    FROM pg_replication_slots
    WHERE plugin = 'spock_output'
    ORDER BY slot_name;
  "
  echo ""
done

The wal_lag column should show 0 bytes for all slots before proceeding with the upgrade. If you observe negative or high WAL lag values, these must be addressed—either through resyncing or repairing the affected slots—before attempting the upgrade.

Important considerations:

PostgreSQL 18 Requirement: This verification step is particularly critical for PostgreSQL 18 upgrades, as pg_upgrade now performs stricter checks on replication slot synchronization. Slots that aren't fully caught up will block the upgrade process.

Backup First: Always take a physical backup before initiating the upgrade. This provides a rollback path if issues are discovered during or after the upgrade process.

Zero Tolerance: Don't proceed with any non-zero WAL lag. Even small amounts of lag can indicate synchronization issues that should be resolved in a controlled manner before the upgrade.

The Upgrade: PostgreSQL 16 to 17

The upgrade was triggered by updating the container image in the CNPG cluster specification. CloudNativePG handles the rest—orchestrating the upgrade process, managing temporary upgrade pods, and ensuring minimal downtime.

helm upgrade \ 
   --values examples/configs/single/values.yaml \ 
   --wait \ 
   pgedge ./
NAME        AGE     INSTANCES   READY   STATUS                             PRIMARY
pgedge-n1   7m46s   3                   Upgrading Postgres major version   pgedge-n1-1
pgedge-n2   7m46s   1                   Upgrading Postgres major version   pgedge-n2-1
pgedge-n3   7m46s   1                   Upgrading Postgres major version   pgedge-n3-1
NAME                                  READY   STATUS     RESTARTS   AGE
pgedge-init-spock-9j2kk               1/1     Running    0          3s
pgedge-n1-1-major-upgrade-2lrz8       0/1     Init:0/2   0          2s
pgedge-n2-1-major-upgrade-nq9dn       0/1     Init:0/2   0          1s
pgedge-n3-1-major-upgrade-rn2qr       0/1     Init:0/2   0          1s

Notice the pgedge-init-spock job running alongside the upgrade pods. This initialization job is crucial—it recreates Spock replication slots after the upgrade completes, ensuring logical replication can resume immediately.

As the upgrade progressed:

NAME        AGE    INSTANCES   READY   STATUS                                       PRIMARY
pgedge-n1   9m7s   2           1       Waiting for the instances to become active   pgedge-n1-1
pgedge-n2   9m7s   1           1       Cluster in healthy state                     pgedge-n2-1
pgedge-n3   9m7s   1           1       Cluster in healthy state                     pgedge-n3-1

And finally:

NAME        AGE   INSTANCES   READY   STATUS                     PRIMARY
pgedge-n1   17m   3           3       Cluster in healthy state   pgedge-n1-1
pgedge-n2   17m   1           1       Cluster in healthy state   pgedge-n2-1
pgedge-n3   17m   1           1       Cluster in healthy state   pgedge-n3-1

Post-Upgrade Verification

After the upgrade was completed, all three clusters were running PostgreSQL 17.6. The critical question: did logical replication survive?

Testing on pgedge-n1:

app=# select version();
                                                    version                                                     
----------------------------------------------------------------------------------------------------------------
 PostgreSQL 17.6 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), 64-bit
(1 row)
app=# INSERT INTO test_table VALUES(4,'A');
INSERT 0 1
app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
  2 | B
  3 | C
  4 | A
(4 rows)

On pgedge-n2:

app=# select version();
                                                    version                                                     
----------------------------------------------------------------------------------------------------------------
 PostgreSQL 17.6 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), 64-bit
(1 row)
app=# INSERT INTO test_table VALUES(5,'B');
INSERT 0 1
app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
  2 | B
  3 | C
  4 | A
  5 | B
(5 rows)

And on pgedge-n3:

app=# select version();
                                                    version                                                     
----------------------------------------------------------------------------------------------------------------
 PostgreSQL 17.6 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), 64-bit
(1 row)
app=# INSERT INTO test_table VALUES(6,'C');
INSERT 0 1
app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
  2 | B
  3 | C
  4 | A
  5 | B
  6 | C
(6 rows)

Confirming replication on pgedge-n2:

app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
  2 | B
  3 | C
  4 | A
  5 | B
  6 | C
(6 rows)

And pgedge-n1:

app=# SELECT * FROM test_table;
 id | val 
----+-----
  1 | A
  2 | B
  3 | C
  4 | A
  5 | B
  6 | C
(6 rows)

Logical replication was fully operational. No manual intervention was required.

How Spock Survives the Upgrade

The key to understanding why this works lies in how Spock manages replication metadata. Unlike native PostgreSQL logical replication, which relies entirely on system catalogs and replication slots, Spock stores its configuration in dedicated tables within the spock schema:

  • spock.node — cluster definitions

  • spock.subscription — replication subscriptions

  • spock.replication_set — publication configurations

  • Additional metadata tables for conflict resolution, progress tracking, and state management

During a pg_upgrade, PostgreSQL preserves user schemas and their data while replacing system binaries and catalogs. Since Spock's metadata lives in user tables, it survives the upgrade intact. The init-spock job that runs after the upgrade reads this metadata and recreates the necessary logical replication slots, allowing replication to resume immediately.

This is fundamentally different from trying to preserve native PostgreSQL logical replication through an upgrade, where the replication slot configuration itself is lost.

Practical Implications

This capability has significant implications for production PostgreSQL deployments:

Simplified Upgrade Workflows: The init-spock-job.yaml eliminates the need to manually disable replication, perform upgrades in isolation, and rebuild replication configurations afterward. The automation handles slot recreation transparently.

Zero-Configuration Slot Recreation: The automated initialization job handles slot recreation based on Spock's stored metadata. Operators don't need to track subscription configurations separately or rebuild them manually post-upgrade.

Operational Confidence: Knowing that replication configuration survives version upgrades reduces the risk profile of major version upgrades in complex multi-cluster environments.

Considerations

While this approach works reliably, there are important operational factors to consider:

Upgrade Downtime: CNPG's in-place upgrade using pg_upgrade requires downtime. Plan your maintenance windows accordingly and ensure your application can tolerate the interruption.

Backup Strategy: It's strongly recommended to take physical backups both before and after the upgrade completes. This provides a rollback path if issues are discovered post-upgrade.

Docker image compatibility: Make sure base docker images of current version and upgrade version are compatible. As per CNPG documentation you can't update bullseye image with bookworm image.

Spock Extension Compatibility: The Spock extension itself must be compatible with both the source and target PostgreSQL versions. For 16 to 17 upgrades, this is well-supported.

Initialization Job Dependency: The init-spock job must run successfully after the upgrade. Monitor this job to ensure slot recreation completes as expected.

Physical Replication Independence: CNPG's physical replication (for replicas within a cluster) operates independently of Spock's logical replication. Both can coexist without interference.

Testing Recommended: As with any upgrade strategy, thorough testing in a non-production environment remains essential.

Conclusion

The real breakthrough here isn't that logical replication can survive a major version upgrade—it's that the pgEdge Helm chart's init-spock-job.yaml makes it happen automatically. This simple but powerful automation removes what has traditionally been a significant operational barrier when upgrading PostgreSQL in multi-cluster environments.

By leveraging Spock's metadata persistence and coupling it with intelligent post-upgrade slot recreation, the pgEdge Helm chart transforms a typically manual, error-prone process into a repeatable, automated workflow. While the upgrade itself still requires downtime and careful planning (as any major version upgrade should), the replication configuration aspect is completely handled.

For teams managing multi-cluster PostgreSQL environments with logical replication, this approach eliminates an entire category of upgrade complexity. The combination of CloudNativePG's orchestration capabilities and pgEdge's automation delivers a practical, production-ready solution for maintaining logical replication across major version upgrades. No manual reconfiguration, no risk of subscription setup errors, and no need for complex runbooks—just reliable, automated replication slot recreation when you need it most.