pgEdge PostgreSQL Posts

Unleashing the Power of PostgreSQL with pgEdge Distributed Multi-Master Replication and Postgres Platform - Part 1

Wed, 07 May 2025 04:47:08 GMT

Before we delve into the main subject of this blog, it is essential to understand the benefits of PostgreSQL replication, and the difference between single-master replication (SMR) and multi-master replication (MMR). In every modern business application, the database is becoming a critical part of the architecture and the demand for making the database performant and highly available is growing tremendously.

Planning Ahead for Better Performance

Our goal when designing a system for high performance is to make the database more efficient when handling an application request - this ensures that the database is not becoming a business bottleneck. If your database resides on a single host, the resources of the system that is hosting the database can be easily exhausted; having a system that supports scaling the database so it can more effectively respond to the application's heavy load.With pgEdge Distributed Postgres and the power of PostgreSQL, you can perform both horizontal and vertical scaling:

The technique of replicating data across multiple PostgreSQL databases that are running on multiple servers can also be considered horizontal scaling. The data is not distributed, but database changes are replicated to each cluster node so the application load can be divided across multiple machines to achieve better performance.Reliability and high-availability are also crucial for a powerful and responsive system:

Reliability means that the database is able to respond to user/application requests at all times with consistency and without any server interruption.

High-availability is also a critical consideration that ensures that database operations are not interrupted and the database downtime is minimized.

Statistically, downtime per year reflects the ability of your database and application to handle failures and outages without user downtime. Often, downtime per year is negotiated into a service level agreement (SLA) for applications that require high-availability; this clause specifies the cumulative length of time within that year that the database can be down. To minimize downtime, pgEdge can actively replicate the same data to each node in the cluster. Components that handle failover and query routing are also used to ensure that the database remains highly available under stress.PostgreSQL provides two methods of replication: asynchronous and synchronous.

If you are using
asynchronous replication
, data is written to the primary server first and then it is replicated to other database nodes without waiting on confirmation from each replicated node that the data has been written.

If you are using
synchronous replication
, data is written to primary and replica nodes simultaneously.

There are tradeoffs between asynchronous and synchronous replication. Synchronous replication is safer for critical data or high-end transactional workloads that require resiliency. Asynchronous replication is suitable for most workloads, but failover might take longer when compared to a synchronous replication configuration, and there might be some risk of data loss if all the changes are not replicated to all nodes.In this summary, we've defined the terms used to describe replication in a PostgreSQL database. Lets now delve into the two deployment models for PostgreSQL replication.

Single-Master Replication

A single-master replication model consists of one primary node and one or more secondary nodes. In this model, write transactions are only sent to the primary node while read transactions are sent to both primary and secondary nodes. The secondary nodes (read-only replicas) are used to handle query requests that don't modify data. This scenario employs middleware products (like HAProxy) that sort the write and read requests between the primary and secondary nodes. In the event of a failure, the secondary node is promoted to become a primary node with automated failovers handled by products like Patroni and Pgpool. When a failover completes, the middleware (HAproxy) is updated to ensure that writes are sent to the new primary node.

Multi-Master Replication and Conflicts

The multi-master replication deployment model consists of multiple nodes that act as the primary (or master) node. Each node is performing active-active replication between each secondary node; in an MMR cluster, client applications can perform both write and read operations against any node in the cluster. This configuration employs shared-nothing architecture without a coordinator node.You can configure single-master replication using only native PostgreSQL tooling, but multi-master replication capabilities must be provided by companies like pgEdge. pgEdge provides a fully distributed and 100% PostgreSQL based cluster with benefits like low latency for high performance, selective filtering for data residency, and conflict resolution. Once configured, a pgEdge MMR cluster enables a client application to send write commands to all of the nodes in the cluster. It's worth noting that multiple clients updating the same record concurrently can lead to conflicts that are handled by the conflict-resolution solution provided by pgEdge.During active-active replication, synchronization of data between nodes can cause a conflict if changes are applied to the same row on multiple nodes concurrently by more than one client session. A conflict can occur even if the transactions causing the problem take place in a different timestamp; the conflict will arise when replicating the changes to synchronise the nodes.Different types of transactions will cause different types of conflicts in a MMR replication scenario; this will help you get a better understanding of MMR conflicts:

Conflict Detection and Resolution

From the PostgreSQL documentation:“Logical replication behaves similarly to normal DML operations in that the data will be updated even if it was changed locally on the subscriber node. If incoming data violates any constraints the replication will stop. This is referred to as a conflict. A conflict will produce an error and will stop the replication; it must be resolved manually by the user. Details about the conflict can be found in the subscriber's server log”The MMR solution from pgEdge provides a solution for detecting and resolving conflicts without breaking replication between nodes. In the examples that follow, conflicts are used to demonstrate how pgEdge platform detects and resolves issues automatically without impacting replication. pgEdge platform utilizes an open source extension named Spock that provides MMR capabilities with automatic DDL updates, conflict detection/resolution, and more.In our example, we are going to use a 3 node pgEdge cluster that is running on on different ports. The spock.node table below displays the nodes in the cluster.We have created the table shown below, and used automatic DDL replication functionality from the Spock extension to replicate it across our cluster.The above command spawns three sessions in the background and tries to update the employee name for the same row on all three nodes; this could potentially result in an conflict. The conflict is resolved automatically by pgEdge Spock extension, by applying the commit with the latest timestamp hence using a last update wins strategy.

Exception Logging

The pgEdge distributed Postgres Spock extension provides exception logging that logs the errors that are encountered while trying to apply changes at the replication subscriber. Exception logging ensures that replication between nodes isn't broken due to the errors caused by applying the replication changes.The examples below cause an conflict by inserting the same value in the primary key column from multiple psql clients. The duplicate key violation error is captured in the exception log, and replication continues to function without any interruptions.The example that follows causes a conflict while deleting the same record from multiple psql clients. The error occurs because during synchronization of nodes, the row to be deleted is missing on some nodes; this error is captured in the Spock exception log table without causing any interruption to the replication between nodes.Spock's exception logging ensures that replication between nodes doesn’t fail when a discrepancy is encountered while trying to replicate changes to a node. The above examples demonstrate how conflicts are captured in the exception log table without causing any interruption to the replication. This allows you to review issues at a time that is convenient for you.

Preserving replication slots across major Postgres versions - PostgreSQL high availability for major upgrades

Mon, 27 Jan 2025 13:45:05 GMT

In this blog (the third in my series), I'd like to present yet another new feature in the PostgreSQL 17 release: enhancement to logical replication functionality in PostgreSQL. The blog will also provide a small script that demonstrates how to use this feature when upgrading from Postgres 17 to a future version. In my prior blogs, (also published on Planet PostgreSQL, and DZone) I have written about other PG-17 features which you can read about:

PostgreSQL 17 - A Major Step Forward in Performance, Logical Replication and More

PostgreSQL 17 and its key improvements

PostgreSQL 17 is a really powerful major release from the PG community - with this new release, community focus continues to be on making PostgreSQL even more performant, scalable, secure, and enterprise ready. Postgres 17 also improves the developer experience by adding new features for compatibility, and making existing features more powerful and robust.These features also help products that provide distributed Postgres improve their PostgreSQL high availability (HA) experience, especially related to system upgrades across major versions. pgEdge also provides a PostgreSQL-based distributed database platform for low latency, high availability, and data residency. The HA capabilities of pgEdge ensure that major PostgreSQL upgrades can be done with nearly zero downtime so your applications can continue to work without user interruption. There is work in progress that will provide a path to a zero downtime upgrade by adding and removing nodes from the cluster. This is not the main topic of this blog but stay tuned for more about this functionality.The diagram below (from one of my older blogs, updated for PG-17) describes the evolution of the logical replication feature in PostgreSQL. The building blocks for logical replication were added in PostgreSQL 9.4, but the logical replication feature wasn't added until PostgreSQL 10. Since then, there have been a number of important improvements to logical replication.

Preserving Replication Slots

Now coming back to our topic, this new feature makes it possible to preserve replication slots while performing upgrades between major versions of Postgres, eliminating the requirement to resync the data between two nodes that were replicating the data using logical replication. Please note this feature is only available for use when performing upgrades from Postgres 17 to future major versions. Upgrades from versions prior to Postgres 17 still need to follow the process of recreating the replication slots and creating subscribers that rsync the data between the replicating nodes.This patch was authored by Hayato Kuroda and Hou Zhijie and committed by Amit Kapila. Here is the Postgres commit log entry for this feature:commit 29d0a77fa6606f9c01ba17311fc452dabd3f793dAuthor: Amit Kapila Date: Thu Oct 26 06:54:16 2023 +0530 Migrate logical slots to the new node during an upgrade. While reading information from the old cluster, a list of logical slots is fetched. At the later part of upgrading, pg_upgrade revisits the list and restores slots by executing pg_create_logical_replication_slot() on the new cluster. Migration of logical replication slots is only supported when the old cluster is version 17.0 or later. If the old node has invalid slots or slots with unconsumed WAL records, the pg_upgrade fails. These checks are needed to prevent data loss. The significant advantage of this commit is that it makes it easy to continue logical replication even after upgrading the publisher node. Previously, pg_upgrade allowed copying publications to a new node. With this patch, adjusting the connection string to the new publisher will cause the apply worker on the subscriber to connect to the new publisher automatically. This enables seamless continuation of logical replication, even after an upgrade.

Sample script

Now let's write a little script to test this feature; as I've mentioned before this only works when you are upgrading from Postgres 17 to a future major release. Any replication slots on the old cluster that are invalid or have unconsumed WAL will need to be repaired prior to the upgrade or the upgrade will fail.Please note that the script below uses Postgres 17.2 for both the old and new clusters to demonstrate the functionality; we'll have to wait for another major version to become available before we can actually show the functionality at its best. The results from the script are also listed below, showing the replication slot created in the old cluster has been copied over to the new cluster. Results from running the script :

PostgreSQL 17 - A Major Step Forward in Performance, Logical Replication and More

Fri, 11 Oct 2024 07:01:00 GMT

After a successful 3rd beta in August 2024, the PostgreSQL development group released the GA version of Postgres 17 on September 26th. Recently, I blogged about some of the key logical replication features that you'll see in PostgreSQL 17 https://www.pgedge.com/blog/logical-replication-features-in-Postgres 17. In this blog I'll describe a couple of new performance features that you'll find in Postgres 17 as well as another important logical replication feature that I didn't cover in my earlier blog of this series.PostgreSQL has grown remarkably over the years, and with each major release has become a more robust, reliable, and responsive database for both mission critical and non-mission critical enterprise applications. The global and vibrant PostgreSQL community is contributing to PostgreSQL success, diligently ensuring that all changes are carefully scrutinized and reviewed before they are added to the project source code. It is also very encouraging to see big technology names like Microsoft, Google, and others investing in Postgres by developing in-house expertise and giving back to the open source community.Improvements to logical replication are making it even more robust and reliable for enterprise use, while providing core capabilities that vendors like pgEdge can build on to deliver fully distributed PostgreSQL. Distributed PostgreSQL refers to the implementation of PostgreSQL in a distributed architecture, allowing for enhanced scalability, fault tolerance, and improved performance across multiple nodes. A pgEdge fully distributed PostgreSQL cluster already provides essential enterprise features like improved performance with low latency, high availability, data residency, and fault tolerance.Now without further adieu let's discuss some PostgreSQL 17 performance features:

Improved Query Performance with Materialized CTEs

Common Table Expressions (CTEs) in PostgreSQL are temporary result sets that can be referenced within a , , , or statement. They enhance the readability and organization of complex queries and can be recursive, making them particularly useful for hierarchical data. The basic syntax of a CTE query is as follows:Include the keyword in a query to create the CTE; the parent query (that defines the result set) follows the clause after the CTE name. After defining the CTE, you can refer to the CTE by name to reference the result set of the CTE and carry out further operations on the result set within the same query.PostgreSQL 17 continues to enhance performance and capabilities around CTEs, including improvements in query planning and execution. Older versions of Postgres treat CTEs as optimization fences, meaning the planner could not push down predicates into them. However, from PostgreSQL 12 onward, you can define more efficient execution plans. You should always analyze your queries and consider the execution plans when performance is critical.Performance tip: If you will be referring to the same result set multiple times, create the CTE with the keyword. When you create a materialized CTE, Postgres computes and stores the result of the parent query. Then, subsequent queries aren't required to perform complex computations multiple times if you reference the CTE multiple times.

Extracting column statistics from CTE references; Postgres 17 improves materialized CTE’s

A materialized CTE basically acts as an optimization fence, which means that the outer query won’t influence the plan of the sub-query once that plan is chosen. The outer query has visibility into the estimated width and row counts of the CTE result set, so it makes sense to propagate the column statistics from the sub-query to the planner for the outer query. The outer query can make use of whatever information is available, allowing the column statistical information to propagate up to the outer query plan but not down to the CTE plan.This bug reported to the community contains a simple test case that can demonstrate the improvement and effect on the query planner as a result of this improvement.https://www.postgresql.org/message-id/flat/18466-1d296028273322e2%40postgresql.orgExample - Comparing Postgres 16 behavior to Postgres 17First, we create our work space in Postgres 16 and run ANALYZE against it; two tables and indexes:Then, we create our materialized CTE:The query plan from our Postgres 16 code sample contains:As you can see in the query plan, the column statistics of 200 rows from the sub-query is wrong, which is impacting the overall plan.Then, we test the same setup and query against PostgreSQL 17As you can see in the query plan for Postgres 17, the column statistics from the subquery are correctly propagating to the upper planner of the outer query. This helps PostgreSQL choose a better plan that improves the execution time of the query.This is a simple query, but with bigger and complex queries this change can result in a major performance difference.

Propagating pathkeys from a CTE to an Outer Query

Another interesting improvement to CTE functionality in Postgres 17 is the propagation of path keys from the sub-query to the outer query. In PostgreSQL, pathkeys are a part of the query execution planning process used primarily for sorting and ordering rows in queries that require ordered results, such as queries with an clause, or when sorting is needed for other operations like merge joins.Prior to Postgres 17, the sort order of the materialized CTE sub-query was not shared with the outer query, even if sort order was guaranteed by either an index scan node or sort node. Not having a guaranteed sort order allows the PostgreSQL planner to choose a less optimized plan, whereas having a guaranteed sort order will make it more likely to choose an optimized plan.With PostgreSQL 17, if a CTE is materialized and has a specific sort order, the planner can reuse that information in the outer query, improving performance by avoiding redundant sorting or enabling more efficient join methods. As noted in the commit comments by Tom Lane, "The code for hoisting pathkeys into the outer query already exists for regular subqueries, but it wasn't getting used for CTEs, possibly out of concern for maintaining an optimization fence between the CTE and the outer query."This simple modification to the Postgres source code should result in performance improvements for queries involving complex CTEs, especially those where sorting or merge joins can be optimized based on the inherent order of CTE results.Here is an example using the data in PostgreSQL regression The query plan from our Postgres 16 code sample contains:The query plan from our Postgres 17 code sample contains:The query plans in Postgres 16 and Postgres 17 are significantly different due to this version 17 enhancement. This is a small example; you can see the performance gain will be significant in larger queries. Please note that this improvement is only effective if the CTE subquery has an clause.

Fast B-Tree index scans for Scalar Array

In PostgreSQL, is a node type in the execution plan that handles queries involving operations like or with arrays or lists of values. It's particularly useful for queries where you compare a column against a set of values, such as: allows PostgreSQL to optimize queries that involve multiple comparisons that use or . PostgreSQL 17 has introduced new performance enhancements to make these operations even faster.In PostgreSQL 17, significant improvements have been made to B-tree index scans, which optimize performance, particularly for queries with large lists or conditions. These enhancements reduce the number of index scans performed by the system, thereby decreasing CPU and buffer page contention, resulting in faster query execution.One of the key improvements is in handling Scalar Array Operation Expressions (), which allows more efficient traversal of B-tree indexes, particularly for multidimensional queries. For example, when you have multiple index columns (each with its own list), PostgreSQL 17 can now process these operations more efficiently in a single index scan, rather than multiple scans as in earlier versions. This can lead to performance gains of 20-30% in CPU-bound workloads where page accesses were previously a bottleneck.Additionally, PostgreSQL 17 introduces better management of internal locks, further enhancing performance for high-concurrency workloads, especially when scanning multiple dimensions within a B-tree index.We can demonstrate this with a simple example. We'll use the same table and data that we used in the previous example from the Postgres regression suite.Our example, first run on Postgres 16:In the previous query you can see that the shared buffer hit for the query was 9 and that it took 3 index scans to get the results from the index scan. In PostgreSQL, the term shared hit refers to a specific type of cache hit related to buffer management. A shared hit occurs when PostgreSQL accesses a data block or page from the shared buffer pool rather than from disk, improving query performance.The same example, this time run on Postgres 17:As you can see, with Postgres 17 the shared buffer hit is reduced to 5, and most importantly it is only doing one index scan (as opposed to 3 scans in the case of Postgres 16). With this improvement in Postgres 17, the performance of scalar array operations is greatly improved, and Postgres can choose from better optimized query plans.

Retention of logical replication slots and subscriptions during upgrade

The retention of logical replication slots and migration of subscription dependencies during themajor upgrade process is another logical replication feature added to PostgreSQL 17. Please note that this feature will only be useful in upgrading from PostgreSQL 17 to later versions, this is not supported for upgrade prior to Postgres 17. The replication slots and replication origins are generated when building a logical replication environment. However this information is specific to the node in order to record replication status, application status and WAL transmission status so they aren’t upgraded as part of the upgrade process. Once the published node is upgraded the user needs to manually construct these objects.The pg_upgrade process is improved in PostgreSQL 17 to reference and rebuild these internal objects; this functionality enables replication to automatically resume when upgrading a node that has logical replication. Previously, when performing a major version upgrade, users had to drop logical replication slots, requiring them to re-synchronize data with the subscribers after the upgrade. This added complexity and increased downtime during upgrades.You need to follow these steps when upgrading the publisher cluster:

Ensure any subscriptions to the publisher are temporarily disabled by performing an
ALTER SUBSCRIPTION….DISABLE
. These are enabled after the upgrade process has completed.

Set the new cluster's
wal_level
to logical.

The
max_replication_slots
on the new cluster must be set to a value greater than or equal to replication slots on the old cluster.

Output plugins used by the slots must be installed in the new cluster.

All the changes from the old cluster are already replicated to the target cluster prior to the upgrade.

All slots on the old cluster must be usable; you can ensure this by checking conflicting columns in
pg_replication_slots
view.
Conflicting
should be
false
for all the slots on the old cluster.

No slots in the new cluster should have a value of
false
in the
Temporary
column of the
pg_replication_slots
view. There should be no permanent logical replication slots in the new cluster.

The pg_upgrade process of upgrading replication slots will result in an error if any of the above prerequisites aren’t met.

PostgreSQL 17 and its key improvements now available for pgEdge Distributed PostgreSQL

Wed, 02 Oct 2024 11:00:00 GMT

The PostgreSQL community released PostgreSQL 17 to GA on September 26, 2024. With PostgreSQL 17, community focus continues to be on making PostgreSQL more performant, scalable, secure, and enterprise ready. Postgres 17 also improves the developer experience by adding new features for compatibility, and making existing features more powerful and robust. pgEdge, which provides a PostgreSQL based distributed database platform for low latency, high availability, and data residency this week made PostgreSQL 17 available as a supported Postgres version in pgEdge Platform, alongside PostgreSQL versions 15 and 16. Support for PostgreSQL 17 in pgEdge Cloud will come later in Q4.pgEdge support for PostgreSQL 17 makes it available as part of a responsive multi-master cluster that offers enhanced replication capabilities like DDL replication, conflict management, conflict avoidance, and more. pgEdge supports clusters running on a mix of different PostgreSQL versions, permitting zero downtime major version upgrades.Recently, I blogged about some of the key logical replication features that you'll see in PostgreSQL 17 https://www.pgedge.com/blog/logical-replication-features-in-Postgres 17.In this blog, we'll pick up where I left off. The following sections detail the major improvements in Postgres 17 that enhance database behavior in a multi-master distributed cluster.

Logical Replication

The most notable improvements in PostgreSQL 17 are improvements to logical replication features:

Storage with Incremental Backup

Block level incremental backup is a major feature added to pg_basebackup in PostgreSQL 17. The incremental backup feature allows you to only backup the changes since the last full backup. This feature will greatly improve the efficiency of backups and reduce the storage you need to use for storing backups. Instead of performing a full backup every time you can instruct the server to backup changes since the last full backup, significantly reducing the size of the backup and decreasing the time it takes to perform the backup.

Performance

Several enhancements have been made to Postgres 17 to improve performance:

Major improvements to
common table expression (CTE)
queries: By propagating information like pathkeys and column statistics to the upper level plan, PostgreSQL significantly improves query planning and populates CTE queries faster.

Better
memory management of VACUUM
: The vacuum process is optimized to reduce memory usage by up to 20 times by introducing a more efficient
internal memory structure
for use during vacuum operations. This leads to faster execution, especially on large tables, and frees up more shared memory resources for other operations.

Improved
WAL throughput
: Write ahead log handling is significantly improved in Postgres 17, allowing twice the WAL throughput in certain high concurrency workloads.

Compatibility

Key compatibility improvements were introduced, including MERGE command updates and better JSON support:

The MERGE command benefits from the following improvements in Postgres 17:

Allow the
MERGE
command to modify updateable views.

The use of the
RETURNING
clause is now supported in the
MERGE
command; the new function
merge_action()
reports on the DML that generated the row.

pgEdge Platform Support for Large Object Logical Replication

Wed, 07 Aug 2024 07:08:00 GMT

Replication of large objects isn't currently supported by the community version of PostgreSQL logical replication. If you try to replicate a large object with logical replication, PostgreSQL will return: . It's a meaningful error (always nice), but not helpful if you have large objects that you need to replicate.pgEdge has developed an extension named LargeObjectLOgicalReplication (LOLOR) that provides support for replicating large objects. The primary goal of LOLOR is to provide seamless replication of large objects with pgEdge Spock multi-master distributed replication.You can access and manipulate large objects in a PostgreSQL database with the following client interface functions:

The pgEdge LOLOR extension supports the same large objects functions put in place by PostgreSQL, so all of your existing applications that use the previously mentioned functions will continue to work seamlessly. The easiest way to install the LOLOR extension is with pgEdge Platform. After installing pgEdge Platform, you can use pgEdge Platform to install LOLOR, create the extension, and add it to the parameter by navigating into the installation directory and running the command:In this blog, we are going to create a two node pgEdge cluster on the localhost to demonstrate how pgEdge Platform replicates large objects. We'll also share a native PSQL example of using the extension for replicating large objects, and a JDBC example that shows how we can use the extension from a Java program using a JDBC driver.In any directory owned by your non-root user, use the following command to install pgEdge on all nodes of the cluster; you'll need to invoke this command on each replication node host:Node 1 setupNavigate into the directory on node 1 and perform the following steps :Run the following command to set up the pgEdge platform; this command installs PostgreSQL version 16 and the pgEdge Spock and Snowflake extensions.Then, run the following command to create a Spock node (we are creating a node named ). Note that user named in the command below (in our command ) needs to be an OS user:The next command creates the subscription between and . You should run this command after completing the initial pgEdge setup on .Then, use the following command to install the LOLOR extension : Then, source your PostgreSQL installation, connect with PSQL, and run the statement to create the LOLOR extension:You'll also need to set the configuration parameter before using the extension. Set the value to the number that corresponds to the node on which you're setting the parameter; the value can be from 1 to 2^28.Please restart the server after adding the above configuration parameter to the file. The postgresql.conf file is located in the data directory under your PostgreSQL installation.Before using LOLOR functionality, you also need to add the large object catalog tables to the replication set. You can use the following commands:The following commands are executed to enable automatic DDL replication : Node 2 setupNavigate into the directory on node 2 and perform the following steps to configure the LOLOR extension:Run the following command to install pgEdge Platform, this will install PG-16, and the pgEdge Spock and Snowflake extensions.Use the following command to create a Spock node. Please note that the user provided in the following command needs to be a OS user : Then, use the following command to create the subscription between and : Now we are ready to install the LOLOR extension with the command:Then, log in PSQL and invoke the statement:You must set to a number that represents the node in the replication cluster before using LOLOR. Acceptable values range from 1 to 2^28.Please restart the server after adding the above configuration parameter to .After setting the parameter, use the following commands to add the large object catalog tables to the replication set:Then, execute the following commands to enable automatic DDL replication :

Example: Using the PSQL Command Line to Exercise LOLOR

In the sections that follow, we are going to do a short test that demonstrates large object replication using the PSQL client. PSQL is a secure, native PostgreSQL client that uses the libpq driver to negotiate connections.First, we are going to perform the following SQL commands on node 1:We have auto_ddl enabled so the table is also getting replicated to other nodes. We can query node 2 with the following statement to confirm that the large object was replicated:

Example: Using a JDBC Connection to Query a Large Object

The following program code connects with a pgEdge node and loads file in the database as a large object and perform retrieval operations.To simplify connection management, you can specify connection information in the app.properties file, and then reference the file in your JDBC connection.example.java

Logical Replication Features in PG-17

Thu, 23 May 2024 06:49:54 GMT

Introduction

About a year ago, I blogged about logical replication improvements in PostgreSQL version 16. PostgreSQL 16 was a really good release for logical replication improvements, with performance critical features like parallel apply, providing replication origin for supporting bi-directional replication, and allowing a standby server to be a publisher. Please refer to the old blog post for more details on version 16 replication-related features - you'll find that post at:https://www.pgedge.com/blog/postgresql-16-logical-replication-improvements-in-actionPostgreSQL 17 also includes a number of significant improvements for logical replication. The enhancements are geared towards improving the usability of logical replication, and meeting high-availability (HA) requirements. In this blog we are going to discuss some of the key logical replication features added to PostgreSQL 17; we won’t be covering all the new features in this blog so there will likely be more than one blog in this series.I want to thank my PostgreSQL community friends Amit Kapila for introducing me to logical replication features in PostgreSQL 17, and Hayoto Kurado for helping me to understand and test these features.

Synchronizing Slots from Primary to Standby (Failover Slot)

My top pick among the logical replication improvements in version 17 is the failover slot synchronization improvements; this is essentially a high availability feature that allows logical replication to continue working in the event of a primary failover. The feature keeps the replication slot on the primary node synchronized with the designated slots in the standby server. To meet this goal, the server starts slotsync worker(s) on the standby server that ping the primary server at regular intervals for the logical slots information, and updates the local slot if there are changes.There are two ways to use this feature:

The first approach is to enable the
sync_replication_slots
GUC on the standby node. In this approach, the slotsync worker periodically fetches information and updates locally. Note that if you take this approach, you should not query the
pg_sync_replication_slot()
function.

The other way to use this functionality is to call the
pg_sync_replication_slot()
function. If you use the function to update your slot, the backend process connects to the primary and performs the update operation once. Note that you cannot call the function if
sync_replication_slots
is turned on, and the slotsync worker is already periodically refreshing the slots between the standby and primary.

To enable this feature, you need to call the pg_create_logical_replication_slot() function or use the CREATE REPLICATION SLOT ...LOGICAL command on the primary node to configure a replication slot. When configuring the slot, set the property for the slot to .You also need to set the following parameters to keep the physical standby synchronized with the primary server :

You can use the pg_replication_slots view to review the properties of a replication slot. Those slots with a synced value of in the pg_replication_slots view can resume logical replication after failover; these slots have been synchronized.Another important step after failover to a synced slot is to update the connection information to the primary node for each subscriber. Connect to each subscriber, and use the ALTER SUBSCRIPTION command to update the connection information of the new primary.

Failover Slots in Action

In our example, we are going to spin up two instances of PostgreSQL; one instance will be our primary server, and the other will be our standby server. We will call the publisher instance node1, and the standby server node 2 for the purposes of this example. We'll keep the replication slot on the standby server synchronized with the replication slot of the primary so in the event of a failover, the standby will be promoted to primary. After promoting the standby server to primary, any other standby server will need to be updated to connect to the new primary server.

pg_createsubscriber

pg_createsubcriber is an executable included in PostgreSQL 17 that converts a physical standby server into a logical replica. This utility creates a replication setup for each of the databases that are specified in the pg_createsubscriber command. If you specify multiple databases, the utility will create a publisher node and subscriber node for each database, and all the tables within the specified database(s).When setting up replication, the initial data copy can be a slow process. When you use the pg_createsubscriber utility you can avoid the initial data synchronization, making this ideal for large database systems.The source server wal_level needs to be set to , and max_replication_slots needs to be greater than the number of databases specified in the pg_createsubscriber command. You should review the complete list of Prerequisites and Warnings at the project page before using pg_createsubscriber.The automated script that follows shows how to use the pg_createsubscriber utility to convert a physical standby server in a logical replication setup. The script will convert a primary and standby server into logical replication setup with publisher and subscriber for each database specified in the command. All the user tables that are part of the primary database will be added to the publication. In the example below, the pgbench tables are included in the publication.Result of running the above scripts:

Conclusion

The demand for distributed PostgreSQL databases by the Enterprise is growing rapidly, and replication is a vital and core part of any distributed system. Starting with PostgreSQL 10, the logical Replication features in PostgreSQL are evolving to become more mature and feature rich with every major release.pgEdge builds on this strong foundation to provide fully distributed Postgres that delivers multi master capability and the ability to go multi-region and multi-cloud. pgEdge adds essential features such as conflict management, conflict avoidance, automatic DDL replication and more to cater to the demands of always on, always available and always responsive global applications.

PostgreSQL clustering solutions

Mon, 01 Apr 2024 12:12:00 GMT

Introduction

In my previous post A Brief History of Logical Replication in Postgres — and Looking Ahead at its Likely Future Evolution , I provided a retrospective journey of the logical replication feature in PostgreSQL, starting from Postgres 9.6, where some of the building blocks were laid down. The blog also provides an insight into how a big feature like logical replication evolves and matures in the PostgreSQL community.This is the second blog of a two blog series. In this post, I will be talking about PostgreSQL cluster solutions that are based on logical replication and the pgEdge approach to creating a high availability cluster.We have recently seen unprecedented growth in the user base for most enterprises; this in turn has led to exponential data growth. Scalability in a distributed PostgreSQL environment has become the most pressing need of a replication solution. In addition to scalability for better performance and low latency, enterprises need high availability. High availability means that there is near zero percent downtime for users in the event of hardware/software/network issues or maintenance windows.This is where distributed PostgreSQL comes into play. Before we go into specific Postgres cluster solutions, it is important to understand the concept of database clustering and its benefits. A Postgres cluster involves setting up a group of servers (nodes) to work together to provide a higher level of availability, reliability, and scalability than can be achieved with a single database server. In simpler terms, database clustering refers to the practice of linking several servers or instances together to work as a single system. This configuration enhances the performance, availability, and scalability of database systems. This is crucial for applications requiring high availability and performance, as it allows for data to be replicated across multiple nodes and for queries to be distributed among them, enhancing both fault tolerance and load distribution.Now let's switch our attention to the main topic of this post. In this post, we are going to discuss PostgreSQL clustering solutions that are based on logical replication. Our solution provides active-active multi-master capabilities - this means that all nodes in the cluster will have the same copy of the data, providing data redundancy. The nodes are configured with asynchronous multi-master replication, and application user traffic is distributed across the nodes to provide better performance and high availability.

pgEdge - Fully Distributed PostgreSQL

Applications these days have to be highly responsive and always available - even during maintenance windows. The user base for an application may be spread across a country or around the globe. Your application needs to be able to respond in real time, even during peak hours. The exponential growth in data seen in most businesses makes serving this data up to users in a short turnaround time is a challenging task. To achieve low latency and high availability, you need to deploy instances in data centers that are both close to your user and close to your business.pgEdge has combined cutting edge technology, unique solutions, and deep PostgreSQL expertise to provide a solution. pgEdge is a fully distributed PostgreSQL database, optimized for the network edge, and deployable across multiple cloud regions or data centers. The solution is a true multi-master (active-active) distributed database system that allows read and write operations at any node on the network. It seems almost magical, providing:

reduced data latency

high availability

targeted data residency

and most importantly, an improved customer experience.The best part is you can get all of this, typically without any code changes. pgEdge allows both read and write operations to take place on any database node in a geographically distributed cluster. Each node runs standard PostgreSQL (version 14, 15 or 16), and a cluster can span multiple cloud regions or data centers. pgEdge nodes are loosely coupled, and are kept updated via asynchronous logical replication with conflict resolution.

pgEdge Solutions

Keeping the industry demand at the forefront, pgEdge offers fully-distributed multi-master PostgreSQL clustering solutions for both cloud (with pgEdge Cloud) and on-prem deployments (with pgEdge Platform).

pgEdge Cloud high availability clusters

pgEdge Cloud is fully-distributed PostgreSQL, deployable across multiple cloud regions or data centers. The pgEdge Cloud console harnesses the low latency, high availability, and data residency benefits of pEdge distributed PostgreSQL in a fully managed cloud service running in multiple regions across AWS, Azure, or Google Cloud. pgEdge Cloud offers a free trial version that lets you experience a global, serverless PostgreSQL database in less than 90 seconds with powerful benefits and capabilities. You can deploy a highly-available three-node active-active multi-master cluster that handles read/write operations with built in conflict resolution and:

Low latency - achieve high performance with low latency by deploying read/write nodes in regions close to the user.

Edge integration - Providing integration with cloud flare workers and other edge platforms.

Rapid deployment - One click provisioning for global clusters on a secure private network.

pgEdge Platform high availability clusters

pgEdge Platform is self-managed distributed PostgreSQL for developer evaluations or production use; use pgEdge Platform to self-host and self-manage pgEdge distributed PostgreSQL in your own data center or cloud account.Database nodes running pgEdge Platform can participate in clusters that span data centers and any of the major cloud providers( AWS, Azure and Google Cloud). pgEdge Platform runs on a variety of common hardware and OS combinations, and enterprise class support plans are available.

Installing pgEdge Platform

In any directory owned by your non-root user, install pgEdge on all nodes of the cluster:On each node of the cluster, move into the directory and install pgedge, specifying a name for the database superuser, a password, and a database name. Note that the name cannot be the name of an OS user, pgedge, or any of the PostgreSQL reserved words. You can also use the --port option to install PostgreSQL on a port other than the default port (5432).The command will download the required pgEdge components and verify the system prerequisites before installing the latest version of PostgreSQL 16 supported by pgEdge and configuring the server to support the pgEdge replication requirements. The server hosts a database (named ) with a database superuser (`admin`) that can log in to the database with the credentials specified (`mypassword1`).The command will also install the spock and snowflake extensions. The spock extension provides multi-master replication with conflict resolution. The snowflake extension provides support for sequences for multi-node multi-master clusters; regular PostgreSQL sequences are single host only.When executed, the command also creates a replication user with the same name as the OS user that invokes the command. This is the user that you will use in connection strings when you create nodes and subscriptions.If you encounter a permissions error on EL9 running this command, you may need to update your SELINUX mode to or , reboot, and retry the operation.

Create Nodes

Next you will register each of the databases as a spock node. Using node names with a naming sequence like n1, n2, n3 (.etc) will automatically set the correct value for snowflake.node, enabling the use of snowflake sequences. The user named in the connection string is a replication user, and has to match the OS user that invoked the setup command; in this example that user is named rocky.Node (IP address 10.1.2.5):Node (IP address 10.2.2.5):

Create Subscriptions

Next we need to create the subscriptions between the nodes in your cluster to support bi-directional replication. The connection string for sub_n1n2 should specify the connection details for n2 in the create node command; the string specified for sub_n2n1 should specify the connection details for n1 in the create node command. Again, you'll include the identity of the replication user (rocky) in the connection string.Node (IP address 10.1.2.5):Node (IP address 10.2.2.5):Our example is a simple two-node cluster; if you have a three-node cluster, the subscriptions should allow traffic between any node in each direction. This means that for a three-node cluster you would create:

sub_n1n2 between node 1 and node 2

sub_n1n3 between node 1 and node 3

sub_n2n1 between node 2 and node 1

sub_n2n3 between node 2 and node 3

sub_n3n1 between node 3 and node 1

sub_n3n2 between node 3 and node 2

As your cluster grows, the subscriptions required also grow.

Adding tables to the default Replication Set

The next step is to use spock commands to add tables to the default replication set and start replication. The default replication set is created when you install pgEdge; you have the option to create a custom replication set and add it to the subscription, but using the default replication set provided simplifies configuration for our example. You also have the option of using spock to add all the tables in a schema to the replication set. The power of logical replication that underpins the pgEdge multi-master platform allows you to configure extremely granular replication.For this example, we'll use pgbench to add some tables. When you open pgbench or psql, specify your database name after the utility name.On each node, source the PostgreSQL environment variables to add pgbench and psql to your OS PATH; this will make it easier to move between the nodes:Then, use pgbench to set up a very simple four-table database. At the OS command line, (on each node of your replication set), create the pgbench tables in your database (demo) with the pgbench command. You must create the tables on each node in your replication cluster:Then, connect to each node with the psql client:Once connected, alter the numeric columns, setting equal to . This will make these numeric fields conflict-free delta-apply columns, ensuring that the value replicated is the delta of the committed changes (the old value plus or minus any new value) to a given record:Then, exit psql:On the OS command line for each node, use the command to add the tables to the system-created replication set (named ); the command is followed by your database name :The fourth table, , is excluded from the replication set because it does not have a primary key. The primary key is needed because the replication set is configured to replicate UPDATEs and/or DELETEs.

Adding a Custom Replication Set to a Subscription

Since we're using the default replication set (created by the pgEdge installer) we don't need to add the replication set to the subscription. If you are using a custom replication set, it needs to be added to the subscription. The following spock command adds a replication set to the subscription.Please see the pgEdge documentation https://docs.pgedge.com/platform/installing_pgedge for detailed information on creating custom replication sets and adding or removing replication sets from a subscription.

Useful Replication Status Views

You can use spock functions and tables to check the replication status of your tables. The pgEdge documentation also provides a list of functions and tables available for checking replication status and debugging issues.

To check available subscriptions:

To check tables and their assigned replication set:

To check subscription status:

Conclusion: Postgres High Availability Clusters

It is pretty clear that nearly every enterprise needs scalability to support its business needs and growing data requirements. PostgreSQL has done well in scaling upwards but in most cases it is proven that one machine is not enough to entertain application performance and high availability needs.PostgreSQL has several clustering offerings, both open source and proprietary, based on physical streaming replication and on logical replication. pgEdge has a unique and robust product, and has proved itself as a leader in PostgreSQL distributed multi-master replication. pgEdge Cloud offers a state-of-the-art and user-friendly cloud console that simplifies cluster management. The pgEdge Platform provides a true and robust multi-master distributed PostgreSQL solution. Conflict management and conflict avoidance capabilities are truly unique to pgEdge, and are instrumental in a multi-master logical replication environment. The product plans for pgEdge platform for 2024 are even more exciting. We are working on game changing logical replication capabilities that are increasingly in demand by enterprise applications. The upcoming features in pgEdge platform will continue to simplify ease of use and minimize adjustments needed to adopt multi-master replication for real world database applications. These features will include support for replication of DDL commands as-well as working with large objects. Above all of this the pgEdge team is working on increasing replication throughput across nodes.I will keep everyone posted on the above developments and will share information about our new features as they become available.Stay tuned….

Logical replication evolution in chronological order & clustering solution built around logical replication

Wed, 17 Jan 2024 05:42:54 GMT

A brief history of PostgreSQL logical replication — and looking ahead at its likely future evolution

This blog is divided into two parts. In this section, we walk through how the logical replication feature has evolved over the years, what the recent improvements for Postgres logical replication are, and how the feature will likely change in the future. The second blog of the series will discuss the multi-master (active-active), multi-region, and highly available PostgreSQL cluster created by pgEdge that is built on top of logical replication and pglogical. Postgres replication is the process of copying data between systems. PostgreSQL supports two main methods of replication: logical replication and physical replication. Physical replication copies the data exactly as it appears on the disk to each node in the cluster. Physical replication requires all nodes to use the same major version to accommodate on-disk changes between the major versions of PostgreSQL.Logical replication on the other hand is the method of replicating data based on data changes. The building blocks of the logical replication feature were introduced in PostgreSQL 9.4, however the feature was completed in PostgreSQL 10. Logical replication provides fine grained control over the replication set via a publisher/subscriber model where multiple subscribers can subscribe to one or more publishers. Logical replication uses logical decoding plugins that format the data so it can be interpreted by other systems. This makes replication possible among heterogeneous systems and across major PostgreSQL releases; this means it requires zero downtime for major version upgrades. Logical replication also provides fine grained control over the replication set so you can decide whether to replicate an entire table, only certain columns from a table, or all of the tables within a schema.

Postgres logical replication evolution in Chronological order

As mentioned above, the community began developing the underlying technology that made logical replication possible in PostgreSQL 9.4. These features are the core building blocks for the logical replication feature.This section describes the main features for logical replication that were added in each release. To review a complete list of logical replication features for each release, please refer to the section of each version of the release notes.This blog provides some context to the life cycle involved when building a major feature for PostgreSQL, and allows you to see how a feature matures over time. The basic logical replication feature was committed to PostgreSQL 10 however it required important patches in subsequent releases to make the feature performance feasible and usable. Logical replication is not finished yet; please read my thoughts in the final section on what might be on the roadmap for replication in the next set of releases.

PostgreSQL 9.4 - 2014

PostgreSQL 9.5 - 2016 Jan

PostgreSQL 9.6 - 2016 Sep

PostgreSQL 10 - 2017

PostgreSQL 11 - 2018

PostgreSQL 12 - 2019

PostgreSQL 13 - 2020

PostgreSQL 14 - 2021

PostgreSQL 15 - 2022

PostgreSQL 16 - 2023

PostgreSQL Logical Replication - Looking ahead

The building blocks for logical replication were added in PostgreSQL 9.4, but the logical replication feature was added in PostgreSQL 10. Since that release, there have been a number of important improvements to logical replication. The last two major releases of PostgreSQL have contributed to the performance and usability of logical replication with parallel application on the subscriber, allowing binary mode initial copy, supporting row/column based filtering, and more.Looking ahead at PostgreSQL 17 (and beyond) for logical replication, there is definitely a requirement for more performance improvement by increasing the replication rate and reducing the replication lag. I believe this can be achieved with parallelism support and worker optimization. There is also a need for better integration of logical replication with external tools for high availability and upgrades. The possibility of active-active (multi-master) replication is also approachable as part of the PostgreSQL core, but it is missing major features like conflict detection and resolution. Some of the missing but important features are provided by pgEdge's Spock extension. pgEdge provides a fully distributed PostgreSQL cluster that supports active-active replication with low latency, high availability, and data residency. Multi-master replication and the pgEdge clustering solution will be discussed in the next post of this series.

Embedding near the edge: pgEdge Distributed PostgreSQL with pgVector

Wed, 20 Sep 2023 06:14:00 GMT

Introduction

We are excited to be announcing that we now support the increasingly popular pgVector Postgres extension for storing and searching vector embeddings in AI-powered applications. Bringing pgVector and pgEdge’s distributed capabilities together makes for a powerful combination that greatly improves performance for users regardless of their geographic location.In this blog we'll demonstrate how to configure pgVector with pgEdge to provide similarity search functionality across a pgEdge Distributed PostgreSQL cluster. I will start with brief summary of the products mentioned in the title of this blog: pgEdge is fully-distributed PostgreSQL, optimized for the network edge and deployable across multiple cloud regions or data centers. pgEdge is available as pgEdge Platform, self-hosted software available for download from [download link]; or as pgEdge Cloud, a fully managed service. This blog is applicable to both pgEdge Cloud and pdEdge Platform.pgvector is an open source extension for PostgreSQL that enables efficient similarity search and other vector-based operations. It's often used for applications like recommendation systems and image search. The pgvector extension provides an indexable vector data type that stores vectors in a PostgreSQL database. pgvector supports the index, which implements the method of indexing.

Vector Database

Vector data stores data as high-dimensional vectors, which are mathematical representations of features or attributes. The number of dimensions in a vector ranges from tens to thousands, depending on the complexity and granularity of the data. The main advantage of a vector database is that it allows for fast and accurate similarity search and retrieval of data based on their vector distance or similarity. So instead of using the conventional methods for searching data using predefined criteria or exact matches or wildcards, one can use the vector database to find similar or relevant data based on semantic or contextual meaning.Vector databases enable accurate and efficient search and analysis of large datasets by utilizing the characteristics of vectors. A vector database's capacity to locate comparable items is its key benefit. For example, two statements with comparable meanings will produce vectors that are close to one another. This allows you to use the vector database to locate all the vectors that are near to one another. For example, a vector database can be used to find:

images that are similar to a given image based on visual content and style.

documents that are similar to a given document based on topic and content.

products that are similar to a given product based on features and ratings.

Vector databases are currently the popular choice. With the rise of large-language AI models (LLMs), efficiently managing and searching large-scale, high-dimensional data has become a tremendously important use case. The solution to this challenge lies in vector databases – a powerful and increasingly popular data storage technology that enables faster and more accurate searches.With the addition of the open-source pgvector extension, PostgreSQL is being used as a vector database. There is a lot of excitement about using PostgreSQL as a vector database, but there is more innovation to come, and work to be done to make the vector workload more secure, performant, and scalable.

Vector Data

Before showing an example of how pgEdge works with pgvector extension, it is important to understand the dynamics of vector data, and how it is stored in the database. Vector data refers to a type of data representation where each data point is described by a set of numerical values arranged in a specific order. These values are usually referred to as components or features and they capture different aspects or attributes of a data point. Vectors are commonly used to represent a wide range of information in many fields: mathematics, computer science, data science, and machine learning.Real-world applications utilize far more than just two dimensions; OpenAI embeddings may use more than a thousand dimensions to vectorize data. One method for converting high-dimensional data into a low-dimensional space is embedding. Embedding allows us to extract data from multiple dimensions and sources, including text, photos, audio, and video, and convert it into vectors. Embedding is a widely-used technique in machine learning and natural language processing (NLP) to represent sparse symbols or objects as continuous vectors.For example, tree data like a car, truck, cycle, helicopter, or hoverboard object may all be converted into vectors using embeddings. Two-dimensional embeddings are shown behind the object they describe in the following list:

car: embedding [2.0,2.3]

truck: embedding [3.4, 5.9]

motorcycle: embedding [0.5,1.2]

bicycle: embedding [0.2,0.8]

helicopter : embedding [13.2,19.8]

hoverboard: embedding [0.1,0.2]

A review of the result set shows us that a bicycle and a motorcycle are similar and that their vectors (if charted) would be fairly close in distance. Vehicle characteristics can also be categorized along dimensions that include color, model, year, and manufacturer. The finer-grained your data is when describing an object, the more precise your results will be in the resulting vehicle grouping.Vector databases can efficiently find items that satisfy a query using vector representations. They use similarity metrics like Euclidean distance, Cosine similarity, or Manhattan distance to determine data point proximity, resulting in relevant and similar results.

pgvector syntax

The pgvector extension introduces a vector data type that can be used as the column type in a PostgreSQL database. The simple examples that follow show how to use the vector data type in statements, and search the vector data. Invoke the following commands with the psql client:Creating a Sample TableRetrieving DataManaging Data OperationsQuerying AggregatesCreating IndexesPostgreSQL can create indexes for vectors that hold up to 2000 dimensions.You can create embeddings using tools like the OpenAI API client. Similarity searches of vector embeddings have a variety of commercial uses like fraud detection, food industry use, security systems.

pgvector real world example

The following example is a real world sample code of an AI based enquiry system that tries to automatically answer client queries. It has a limited knowledge base, if it doesn't know the answer, it replies appropriately.This generates the following log:The above sample code elaborates the use of pgvector extension for a real world example of AI based enquiry system that tries to automatically answer client queries. We can divide the application into four sections:

Questions mimic client queries to drive learning. Since it is an intelligent automatic reply enquiry system, we have fed all the client queries in
QUERIES
array.

The system has a knowledge base that contains all the information that we want the system to learn. The knowledge base grows.

We perform a similarity search that exercises pgvector/PostgreSQL capabilities. We iterate and get responses from the system for each query.

We generate a response from an AI model. Since we have a limited knowledge base, if our AI model doesn't know the answer, it replies accordingly. We expect that it will reply to all the related queries.

Exercising the example

Query:The knowledge base contains the following entry to educate the automatic system to answer correctly:Enquiry System Response:The system has capability to do similarity search to correctly answer the posted query by the client. This is possible with the help of the PostgreSQL pgvector extension and the OpenAI embedding generation feature. When we use PostgreSQL with pgvector, not only does it provide vector search, but it helps with storage and other RDBMS features that help us develop a professional and industrial quality application.To generate a good reasonable response to the client, we used the OpenAI model () to generate an answer to the query. If the knowledge base provides no related knowledge, it will reply with This application is written in basic python code to demonstrate the real world use of the pgvector extension. It was tested with PostgreSQL 15 (with pgvector extension installed), OpenAI (via online internet access), and Python 3.9.

PostgreSQL 16 Logical Replication Improvements in Action

Wed, 02 Aug 2023 13:02:55 GMT

In my previous blog, we started discussing this topic: https://www.pgedge.com/blog/postgresql-replication-and-upcoming-logical-replication-improvements-in-postgresql-16I briefly discussed replication methods in PostgreSQL, and provided a summary of some of the key features of logical replication that made it in PostgreSQL 16. In this blog, I will dive deep into a couple of performance features for logical replication, demonstrate the steps for seeing the features in action, and share the results of performance benchmarking.The blog will focus on the parallel apply and binary copy features in PostgreSQL 16. The parallel apply feature enables the functionality of using parallel background workers at the subscriber node for apply change for large in-progress transactions. The number of parallel workers to use for applying changes from the publisher is . The second performance feature is binary copy. This feature allows logical replication to do the initial data copy in binary format. This provides a good performance boost when copying tables with binary columns.

Parallel Apply

Parallel apply is a performance feature that provides performance benefits for replicating large in-progress transactions. To achieve this, we start the changes streaming to the subscriber node, and then use parallel background workers at the subscriber node to apply the changes while they are being streamed from the publisher. You can configure the number of parallel workers to use at the subscriber node for applying the changes with the configuration parameter.The example below demonstrates how to use this exciting logical replication feature. We've also provided sample performance numbers taken while running a test with a couple of AWS instances in different regions.For this example, I have the publisher running on AWS us-east-1 and subscriber node running on AWS us-west-2.

Publisher

To configure the publisher node, connect to the node and:1. Create a fresh PostgreSQL cluster with and set the following configuration parameters. Specify values that work well with your server specification: 2. Create a table for publication; we've used the following command:3. Create a publication ; you can optionally create a publication for just the large_test table created in the previous step:

Subscriber

To configure the subscriber node, connect to the node and:1. Create a fresh cluster with and set the following configuration parameters. The parameters need to be set according to your server specification:For our test server, I set to to spawn four parallel workers for applying changes to the subscriber node.2. Create a table for publication to receive the replication stream from the publisher:3. Create a subscription with connection properties to the publisher:Please note that we are setting the parameter to for the purposes of this test so we can stream the table changes instead of doing the initial data copy. We are also setting the streaming type to ; this will enable the parallel apply feature and apply the changes to the subscriber node with the specified number of workers.

Publisher

To set up our test scenario, we connect to the publisher node and:1. Set to the name of the subscriber; you don't need to do this to make use of the parallel apply feature; this was only done for the purpose of this test. Setting the parameter ensures that the backend waits for the application on the subscriber node, so we can measure the timing:2. Restart the PostgreSQL server.3. Use psql to run the following command. The command starts and times a large transaction on the publisher node:

Results

With streaming set to
parallel
, it takes
58887.540 ms (00:58.888)
to complete the transaction and apply the changes at the subscriber node.

With streaming set to
off
, it took
106909.268 ms (01:46.909)
to complete the transaction and apply the changes at the subscriber node.

This gives us up to 50-60% performance gain for large in-progress transactions using parallel apply.

Binary Copy

Binary copy is another performance feature of logical replication added in PostgreSQL 16. The binary copy feature makes it possible to do the initial copy of table data in binary format. Streaming data in binary format was added in previous releases but doing the initial table copy in binary mode wasn’t supported prior to PostgreSQL 16.I've conducted a test using two AWS instances to demonstrate the performance benefit gained with this feature. The following example shows how to enable this feature and provides the performance numbers of testing the initial data load with binary vs non-binary format.

Publisher

To set up our binary copy test scenario, connect to the publisher node and:1. Set the following configuration parameters to maximize your system performance:2. Create a table that includes columns:3. Create a publication, specifying the FOR ALL TABLES clause:4. Add records to the table:5. Check the table size after the initial data load:

Subscriber

Connect to the subscriber node and:1. Set the following configuration parameters appropriately for your system:2. Create a table with the same bytea columns:3. Create the subscription; set the parameter to and the parameter to for the initial data transfer.4. Create the following function to time the initial data copy from publisher to subscriber:5. Call the function to time the transfer:

Results

Without binary load (
binary
set to
false
), it took
383884.913 ms (06:23.885)
to complete the transaction and apply the changes at the subscriber node.

With binary load (
binary
set to
true
), it took
267149.655 ms (04:27.150)
to complete the transaction and apply the changes at the subscriber node.

This provides a 32% performance gain when performing the initial table copy in binary format.

Conclusion

The use of distributed PostgreSQL databases is growing rapidly, and replication is a vital and core part of any distributed system. Replication features in PostgreSQL are evolving to become more mature and feature rich with every major release. The groundwork for logical replication was laid prior to PostgreSQL 10, but the logical replication feature itself developed into a usable form in PostgreSQL 10. Since then, replication support has grown tremendously, and the major features added in each release warrant a separate blog post that I will cover in due course. This blog covers new logical replication performance features added in PostgreSQL 16; stay tuned for more blogs discussing the remaining PostgreSQL 16 logical replication features.

PostgreSQL Replication and upcoming Logical Replication Improvements in PostgreSQL 16

Tue, 02 May 2023 17:40:22 GMT

Replication is a process that reliably copies data from one database server to another database server in an automated fashion. Replication is a core part of an enterprise database solution that:

offers fault tolerance in-case of data mishaps

enables high availability in the event of a node failure

allows incoming traffic to be distributed across replicas for provide better performance

… and more.

This blog is the first of a series discussing the future of logical replication. In this post, I’ll focus on the improvements the community has added to logical replication for PostgreSQL 16. The next post will describe the in-flight PostgreSQL 16 logical replication improvements (those changes that are in progress, but not yet committed). The last post in the series will delve into a new PostgreSQL extension for logical replication called Spock. Spock is a replication solution recently released by pgEdge that leverages both the pgLogical and BDR2 open-source projects as a solid foundation for this enterprise-class extension. Please visit our official site to learn more about pgEdge and Spock.Spock provides multi-master (multi-active) PostgreSQL replication optimized for the network edge of cloud-based systems (with the cloud provider of your choice) or for databases hosted on-prem. With its logical replication foundation, Spock offers fine-grained control for your data replication and security needs.

PostgreSQL Replication Methods

PostgreSQL supports two native methods of replication: logical replication and physical replication (also called streaming replication).Logical replication uses a publisher/subscriber model to replicate changes between PostgreSQL servers. The primary node (where the database lives) is called the publisher, and the stand-by node (which receives copies of database transactions) is called the subscriber. Database changes are copied from the publisher node to one or more subscriber node(s) identified by the subscription.When you set up logical replication, you take a snapshot of the data on the published database, and copy it to the subscriber. When you start the subscription, changes on the publisher are sent to the subscriber as they occur. Logical replication uses a transactional model to apply changes to the subscriber in the same order that they are applied to the publisher. This guarantees transactional consistency.The other native method of PostgreSQL replication is physical (or streaming) replication. Streaming replication passes the data from the primary node to the stand-by node in WAL (write-ahead log) files. You can configure streaming replication to be either synchronous or asynchronous; by default, streaming replication is asynchronous.

Asynchronous replication ships each log file to the stand-by node after the transaction is committed on the primary server. If something happens to the primary server before the transaction is written to the stand-by, you can potentially lose data.

Both synchronous and asynchronous modes of streaming have their own pros and cons. As a rule, synchronous replication offers better data protection in the event of a server problem, while asynchronous replication is more cost effective in terms of required resources. Review the PostgreSQL documentation for more information about native replication methods.

Logical Replication Improvements in PostgreSQL 16

Let’s turn our attention to the main topic of this blog, and summarize the key logical replication improvements that are added to PostgreSQL 16 so far.

Applying changes to the subscriber with background workers

Currently, the changes for large, in-progress transactions are sent from the publisher to subscriber in multiple streams, with the changes divided into chunks based on the value of the logical_decoding_work_mem parameter. PostgreSQL version 16 adds a feature that improves performance by parallelizing the process of applying changes to the subscriber node by using multiple background workers.The parallel application to the stand-by node begins while the transaction is still in-progress on the primary node. When the application starts, a single worker applies the top-level transaction, while parallel workers begin to apply the sub-transactions. If any of the parallel workers error out, the entire transaction is exited. This functionality provides transactional consistency to ensure that a partially completed bulk insert does not remain in your database.Performance benchmarking shows that the patch offers a 30 to 40% performance improvement for bulk inserts. You can review the benchmarking as part of the patch history at https://commitfest.postgresql.org/42/3621.

Creating a subscription in binary format

In PostgreSQL version 16, when you create a subscription, you have the option to use binary format for the initial data transfer. Prior to version 16, the initial sync was performed in text format; you could change the format to binary only after logical replication was started. This new functionality allows you to perform the initial sync in the same format that you plan to use for replication.The COPY command is used behind the scenes of the CREATE SUBSCRIPTION command to copy the data for the initial sync. Since the COPY command supports both binary and text formats, it makes perfect sense to support both. You can use the following clauses to specify the data transfer mode:

When you set
binary=false
(the default), data is sent in
text
format.

When you set
binary=true
, data is sent in
binary
format.

If your column type supports binary, copying tables in binary format may reduce your initial sync time.Note that this feature is supported only when both the publisher and subscriber are version 16 or later. Please review the commit fest entry for more details https://commitfest.postgresql.org/42/3840/.

Improving performance by using indexes on the subscription node

The REPLICA IDENTITY attribute helps the server identify the correct row on the subscriber node to UPDATE or DELETE when a change occurs to the primary node. If your table does not have a key, specifying REPLICA IDENTITY FULL tells the server to use a combination of all of the columns in a row to identify the correct row on the subscriber to modify.Specifying REPLICA IDENTITY FULL on the publication node, can trigger a full table scan on the subscriber node in the event of an UPDATE or DELETE to ensure that the correct row is updated. A full table scan can be time-consuming, and uses more resources than an index.This commit improves performance by allowing you to specify which index will be used on the subscriber when applying UPDATES and DELETES. The index must be:

a btree index

a non-partial index

include at least one column that does not consist solely of expressions

If multiple indexes meet these requirements, the server will select the first valid index, instead of using a smart approach to select the best index. If you specify a REPLICA IDENTITY other than FULL, the subscriber must have a similar replica identity.The functionality provided by this feature is only enabled when REPLICA IDENTITY FULL is specified. The functionality is skipped when the remote relation doesn’t contain the left most column of the index, primarily because a sequential scan provides better performance in such cases. Please see the commit fest entry for more details https://commitfest.postgresql.org/42/3765/

Allow logical decoding on stand-by

Prior to PostgreSQL 16, logical decoding was supported only for the primary node; this commit allows minimal logical decoding on the stand-by node as well. To make use of this functionality, you need to set wal_level higher than replica (the default) on the primary node.This feature allows you to:

create a logical replication slot on a stand-by node

create a subscription to a stand-by node

perform logical decoding on the stand-by node

Prior to this commit, those actions would result in the following error:logical decoding cannot be used while in recoveryThis commit also introduces the pg_log_standby_snapshot() function. The function takes a snapshot of a running transaction, and writes it into WAL files without requiring a checkpoint. This function makes the process of creating logical replication slots on a stand-by much faster; the function helps create the replication slot on the stand-by if the primary node is in an idle state.For more information, please see the commit fest entry at: https://commitfest.postgresql.org/42/3740/.

Conclusion

PostgreSQL logical replication continues to improve and become more robust. Some of the features added in this release also lay the groundwork for more great features in future releases. This post summarizes some of the key logical replication features added to PostgreSQL 16. My next post will go over the improvements that are in progress and discuss the likelihood of those making it into the release.