<?xml version="1.0" encoding="UTF-8" ?>
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
        <channel>
            <title>pgEdge Posts from Shaun Thomas</title>
            <link>https://www.pgedge.com/blog</link>
            <description>The latest pgEdge Posts from Shaun Thomas</description>
            <atom:link href="https://www.pgedge.com/feeds/rss/user/shaun-thomas/postgresql.xml" rel="self" type="application/rss+xml" />
            <language>en-us</language>         
            
            <item>
            <category>PostgreSQL,PostgreSQL High Availability</category>
            <title><![CDATA[No Compiler Required: Writing SQL-Only Postgres Extensions]]></title>
            <link>https://www.pgedge.com/blog/no-compiler-required-writing-sql-only-postgres-extensions</link>
            <pubDate>Fri, 08 May 2026 12:11:40 GMT</pubDate>
            <description><![CDATA[ <p>Recently at <a href="https://postgresconf.org/conferences/postgresconf_2026"><u>Postgres Conference 2026</u></a> in San Jose, I presented a talk called <a href="https://postgresconf.org/conferences/postgresconf_2026/program/proposals/let-s-build-a-postgres-extension"><u>Let's Build a Postgres Extension!</u></a> Since that entire presentation was primarily focused on writing a C extension while exploring the Postgres source code, I only mentioned pure SQL extensions as an aside. But what's more likely in the Postgres community in general: C devs, or people who know SQL?It turns out that you can do a lot with functions, triggers, views, tables, and various other Postgres-native capabilities. The <a href="https://www.postgresql.org/docs/current/extend-extensions.html"><u>extension system</u></a> doesn't care whether the contents are compiled C or plain SQL. It just wants a control file, a SQL script, and an optional  to help with installation.So let's build a relatively trivial extension article entirely in SQL.<h2>What Do We Want?</h2>First things first: we need a plan. What should this extension actually do? I wrote about <a href="https://www.pgedge.com/blog/introduction-to-postgres-extension-development"><u>blocking DDL</u></a> a while back with a C extension, so why not revisit that example with SQL?This being pure SQL, there are other handy elements we can add with very little effort, so how about:<ul><li>A setting to enable or disable the extension.</li></ul><ul><li>A setting to allow or block superusers from executing DDL.</li></ul><ul><li>A role that allows members to bypass the DDL restriction.</li></ul><ul><li>A function to add users to the bypass role.</li></ul><ul><li>A function to remove users from the bypass role.</li></ul><ul><li>A view to see which users are in the bypass role.</li></ul><ul><li>An </li><li>event trigger</li><li> to actually block DDL attempts.</li></ul>Rather than a simple event trigger to prevent DDL execution, we are building a kind of DDL execution management suite. That should hopefully demonstrate just how capable a purely SQL implementation can be.<h2>Three Files and a Dream</h2>Every Postgres extension, regardless of complexity, boils down to the same basic structure:<ul><li>A control file to describe the extension.</li></ul><ul><li>A SQL script to create the tables, views, functions, etc.</li></ul><ul><li>An optional Makefile to copy the SQL script and control file to the right place. Unlike a C project, there's no build step for a SQL-only extension because there's nothing to compile.</li></ul>Here's what our project directory looks like:Let's start with the control file. It tells Postgres the extension's name, version, description, and the settings of a handful of behavioral flags. Ours looks like this:The  is what shows up in  and the  catalog view. The  tells Postgres which SQL script to load when someone runs  without specifying a version. The  flag means only superusers can install or update this extension. That's the default, but it's better to be explicit.The  flag deserves a quick explanation. A relocatable extension can be moved between schemas after installation with . Ours can't because the SQL script references a specific schema internally using the  substitution token. It's still possible (and recommended) to define the schema during installation, but not afterwards.Next comes the . For a C extension, the Makefile orchestrates compilation and linking. For a SQL-only extension, it just copies the control and SQL file to the library folder where Postgres keeps extensions. Here’s the whole thing:There's normally also a  line to specify C source files for compilation. Without it,  simply copies the control file and SQL script to the correct directories. The <a href="https://www.postgresql.org/docs/current/extend-pgxs.html"><u>PGXS</u></a> build infrastructure handles the rest.With the boilerplate out of the way, it's time to have some fun.<h2>A Bit of Bookkeeping</h2>Before we really get going, the extension needs to live in a schema. Some of the objects in that schema need to be publicly accessible. So the first thing in our file needs to look like this:Usage just means the schema objects are visible. Users won't be able to create objects or even select from tables unless granted specifically.After that, we need to account for the configuration settings. You may think the first choice for this is to use <a href="https://www.pgedge.com/blog/it-depends-using-session-variables-in-postgres"><u>session variables</u></a>, but that's a subtle trap. The problem here is that SQL-only extensions don't have access to the finer points of system variables, such as limiting them to superusers, system start, service reload, etc. That means there's nothing preventing users from overriding them with a simple  statement.The next option is a configuration table. The extension documentation says we can register these such that dumping and restoring a database retains values, and it's trivial to control table updates. So let's start our extension with this:Now only superusers can configure the extension! Regular users still need to be able to read the configuration table because the event trigger runs as that user. In any case, we now have a convenient configuration interface.<h2>On a Role</h2>The next step is to allow certain users to bypass the DDL restriction. The easiest way to do this is to create a role that a superuser can grant to these allowed users. We can also take care of our helpful grant/revoke functions here:The reason for the long  role name which incorporates the extension name is to prevent name collisions. This role likely isn't already in use, and its purpose is obvious. The functions mean admins don't need to remember the role name itself, but they're not required either.The final thing to add is the view which lists bypass users:Is that a query you knew off the top of your head? Probably not. Now the extension takes care of it so you don't have to.<h2>You Shall Not Pass</h2>The core of our extension is a DDL blocker: an event trigger that fires on  and raises an exception unless the session user is a superuser. The C version of this blocking routine was quite a bit more complex than what we're building here.Here's our DDL blocking function:The  declaration makes this function eligible for use with . It's a special return type that signals to Postgres how to invoke the function.The superuser check queries the  for the . This allows superusers to masquerade as other users for testing purposes, and potentially to prevent accidental DDL executions, provided they  first. The final check is against the  view we created. It may be tempting to use the  <a href="https://www.postgresql.org/docs/current/functions-info.html"><u>information function</u></a> for this, but that shows effective privileges, not actual membership. Superusers have all privileges, so would automatically pass this check if we didn't explicitly validate against role membership.With the function in place, creating the event trigger is a one-liner to call the function:The  event fires before any DDL command executes. If our function raises an exception at this point, the command never runs. Easy peasy.What qualifies as "DDL" in the eyes of Postgres? Quite a lot, actually. The  event fires for , and . It does not fire for commands targeting databases, roles, tablespaces, or ironically, event triggers themselves.We could also filter to specific command tags using a  clause:But where's the fun in that?<h2>Kicking the Tires</h2>Time to see if this thing actually works. First, install the extension files:That copies  and  to the extension directory. Now connect to a database and create the extension:The extension is installed. Let's verify the event trigger is in place:The  in  means "origin," which is the default enabled state (fires in all contexts except replication). It's go time!<h2>Testing the Blocker</h2>Blocking is off by default. Let's confirm by creating a throwaway table:No complaints. Now let's enable the blocker:Then another test:Still works. Superusers get a free pass by default. Let's fix that loophole:Now DDL commands are now stopped cold. This should work for any potential DDL:Does our bypass system work?The explicit bypass now allows DDL. What about regular users? Let's create one and test again:Works exactly as advertised!<h2>The Fine Print</h2>SQL-only extensions are powerful, but they're not a complete replacement for C. A few tradeoffs are worth understanding before you commit to one approach or the other.The GUC security gap. In the C version of this extension, the GUC is registered with  context, meaning only superusers can change it. In our SQL-only version,  would be a custom parameter that any session can modify. We had to devise a somewhat convoluted workaround for this by using a configuration table. This wouldn't be necessary if there was some kind of SQL interface to true variable registration for extensions.Event trigger blind spots. Some DDL commands don't fire event triggers at all. Operations on databases, roles, tablespaces, and event triggers themselves are exempt. Operations such as  or  are entirely exempt. That's where Postgres's built-in privilege system (or  restrictions) should do the heavy lifting. Once again, a C extension has access to capabilities our SQL version can only dream about.No background workers or hooks. C extensions can register background workers, intercept query planning, hook into the executor, and modify server behavior at a fundamental level. SQL-only extensions operate entirely within the SQL layer. If your use case involves any of those deeper capabilities, C is the only option.For everything else? Functions, triggers, event triggers, views, types, domains, operators, aggregates, tables, and more can all live inside a SQL-only extension. That covers a remarkable amount of ground.<h2>Wrapping Up</h2>The Postgres extension system is often perceived as something that requires C expertise, a compiler toolchain, and a deep understanding of the server internals. That’s only really the case if you need the deep internals. If you've ever written a collection of utility functions and wished you could install them with a single command, you're already thinking in extensions. The packaging is the point.Our  extension demonstrates custom configuration tables, roles, functions, views, and event triggers. All of these are standard SQL primitives that any Postgres user already knows. The only new pieces are a minimal control file and . That's a few lines of overhead to gain clean installation and removal, version management, and dependency tracking.If you've got a handful of functions, views, or triggers that you deploy to every database in your environment, consider taking an afternoon to wrap them in an extension. Your future self, and anyone else who inherits those databases, may thank you for it.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/no-compiler-required-writing-sql-only-postgres-extensions</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>PostgreSQL</category>
            <title><![CDATA[It Depends: Using Session Variables in Postgres]]></title>
            <link>https://www.pgedge.com/blog/it-depends-using-session-variables-in-postgres</link>
            <pubDate>Fri, 01 May 2026 05:36:14 GMT</pubDate>
            <description><![CDATA[ <p>There's been a kind of persistent myth regarding Postgres since I first started using it seriously over 20 years ago: "Postgres doesn't support user variables." This hasn't really been true since version 8.0 way back in 2005. Part of this stems from the fact it doesn't do things the same way as other common database engines.Why don't we spend a little time exploring the functionality that time forgot?<h2>What Everyone Else Is Doing</h2>Before I delve into the Postgres approach, let's take a look at the competition. If anyone wants to switch to Postgres (as they should), they'll bring along plenty of assumptions.Let's start with MySQL, the formerly undisputed database king of the <a href="https://en.wikipedia.org/wiki/LAMP_(software_bundle)"><u>LAMP stack</u></a>. MySQL session variables merely prefix any name with  to assign a value:Simple, right? It's even possible to use them directly in queries:We don't have to get into the finer minutiae here, as the MySQL documentation on <a href="https://dev.mysql.com/doc/refman/8.4/en/user-variables.html"><u>user-defined variables</u></a> does that job splendidly. The point is that some users expect this level of compatibility and balk when it's missing.When it comes to SQL Server, things are very similar to MySQL, though perhaps a bit more structured:Once again, the SQL Server documentation on <a href="https://learn.microsoft.com/en-us/sql/t-sql/language-elements/variables-transact-sql?view=sql-server-ver17"><u>variables</u></a> is pretty clear about how these work. The primary caveat here is that these are limited to the current batch, making them somewhat tedious to work with in some cases.The picture for Oracle is a bit different. Oracle calls them <a href="https://docs.oracle.com/en/database/oracle/oracle-database/23/sqpug/using-substitution-variables.html"><u>substitution variables</u></a>, and prefixes using  rather than : This is also closer to a macro system than a true variable; the SQL*Plus or SQLcl clients substitute the values prior to sending statements to the server. It's not something other drivers or clients can use unless they added it themselves for compatibility purposes.<h2>Postgres Has Entered the Chat</h2>So where does Postgres fit into all of this?If Oracle's  substitution is what you're accustomed to, Postgres actually has a direct equivalent. The <a href="https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMAND-SET"><u>psql client</u></a> supports  for defining client-side variables: The  tool has supported these practically since the beginning, but some users find them insufficient. As with Oracle client-side substitutions, they only work specifically in the  client. This can be a major limitation when interacting with Postgres in any other manner.The real answer lives server-side, and it's been hiding in plain sight since at least version 9.2 in 2012. It doesn't require any special syntax, extensions, or package declarations, and uses the same configuration parameter system that controls options like  and . That's right, it's just the regular <a href="https://www.postgresql.org/docs/current/sql-set.html"><u>SET statement</u></a>.The trick is in the usage. Postgres treats any name containing a period as a custom parameter:That's it. Those values now exist for the duration of the current session. Reading them back is just as straightforward with the <a href="https://www.postgresql.org/docs/current/sql-show.html"><u>SHOW statement</u></a>: And clearing them:The dot in the name is mandatory for custom parameters. The part before the dot acts as a namespace (like , , or ). Postgres uses this to distinguish custom settings from its own built-in configuration parameters.But  and  are SQL statements, which means it's not conventionally possible to embed them in expressions or subqueries. So how do we use these handy new parameters?<h2>Getting and Setting</h2>Postgres provides two <a href="https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-SET"><u>system administration functions</u></a> that make manipulating user variables easy and convenient.Let's take a look at  first. It's the functional analog of the  statement:The third parameter just means: "should this only be a local parameter?" Setting it to  allows the value to persist for the entire session, which is what most users may expect. Because  is a regular function, it works everywhere a SQL expression is valid.Then we can use  to retrieve that value:As the functional equivalent of , it returns the current value of any configuration parameter as text.The optional second parameter () is a small but important detail. By default, requesting an unset parameter raises an error:Rather than wrapping everything in an exception handler, it's possible to suppress the error and return  by setting the second parameter to : Because  and  are plain functions, they integrate naturally with any SQL statement. Perhaps we need to convert a username to a tenant ID:Need to filter by the current tenant?Need to set context at connection time and have it flow through to every query automatically?Need to capture the current user in an audit trigger?Allowing unset variables in the trigger is a nice safety net. If the application forgot to set the variable, the audit column gets  instead of aborting the insert. Whether that's the right behavior depends on the specific use case, but it's great to have options.<h2>Properly Scoped</h2>Remember that third parameter in ? When  is , the value applies only for the current transaction. Once the transaction commits or rolls back, the value reverts to whatever the session-level setting was before. This is equivalent to the  SQL statement:The same behavior works with : Why is this useful? Consider a web application where each HTTP request maps to a single database transaction. The application sets the request context at the start of the transaction, and every query within that transaction can see it. After the transaction commits, the context is automatically cleaned up. No manual  needed, and no risk of context leaks between transactions.This is especially valuable with connection poolers like PgBouncer, where a single database session handles requests from many different application users. Transaction-scoped variables guarantee that context from user A never bleeds into a query for user B, because each user's transaction carries its own isolated state.Rollbacks work as expected, too:Local scope also applies when transactions rollback to a savepoint:<h2>A Practical Exercise</h2>Let's build something a bit more operational. We'll set up a multi-tenant orders table with automatic audit stamping and row-level security, all driven by session variables.First, the table:Next, a trigger that automatically stamps every new row with the application user:Now add a Row-Level Security policy that filters rows by the current tenant:With this in place, the application sets context when checking a connection out of the pool:And then every operation automatically respects the tenant boundary:If we switch tenants, the view changes automatically:No explicit  in the query or risk of forgetting the filter. The session variable drives the RLS policy, and the RLS policy enforces the boundary. This is the same pattern SQL Server achieves with RLS plus , and what Oracle targets with Virtual Private Database plus  (though Oracle requires considerably more setup to get there). And Postgres achieves it with far less ceremony than either.For transaction-scoped isolation (ideal with connection poolers), swap  for : <h2>A Few Caveats</h2>Custom GUC parameters are powerful, but they come with some characteristics that deserve full disclosure.Everything is a string. Postgres stores all custom parameter values as text. There is no type enforcement whatsoever. If you , Postgres will utter nary a single complaint. You may have noticed the manual casting through all of the examples with .There is no access control. Any session can  any custom parameter. RLS policies that rely on  are only as secure as the application's control over SQL execution. Users with direct access can trivially  and bypass the policy. A more reasonable RLS-based security system might leverage  or  and a mapping table to prevent such escape attempts.History lesson: . Before PostgreSQL 9.2, custom parameters required pre-registration. It was necessary to list extra namespaces in : Without this declaration,  would raise an error. PostgreSQL 9.2 removed this restriction, and thus the freestyle dot syntax was born. If you encounter old documentation or Stack Overflow answers referencing , they're describing behavior from over a decade ago. Sadly, that's more common than it should be. They integrate with the full GUC machinery. Custom parameters aren't some bolted-on feature. They participate in everything the built-in parameters do: These defaults then act as the baseline that  and  override at the session level, and  reverts to. It's the same layered configuration model that governs , , and other GUCs. Custom parameters are first-class citizens of the Postgres configuration system, especially in recent releases.<h2>Reduce, Reuse, Recycle</h2>Postgres didn't forget to implement session variables. Why invent a separate subsystem for functionality that was there from the beginning? The configuration parameter system, originally designed for things like  and  turned out to be a natural fit for arbitrary session state once the custom namespace restriction was lifted in 9.2.Does it lack the syntactic sugar of MySQL or SQL Server's  approach? There's no denying that. But it also avoids MySQL's type ambiguity, Oracle's client-side-only limitation, and SQL Server's batch-scoping limitation. There's always a compromise and no approach is truly perfect; Postgres straddles the line between power and convenience.Either way, the next time someone claims Postgres can't do session variables, point them at  and . They've been around for a long time, but recent versions have made them much more approachable. And if you haven't given them a try yet, there's no time like the present!</p> ]]></description>
            <guid>https://www.pgedge.com/blog/it-depends-using-session-variables-in-postgres</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>Distributed Postgres,pgEdge,PostgreSQL,postgres,PostgreSQL</category>
            <title><![CDATA[The Scaling Ceiling: When One Postgres Instance Tries to Be Everything]]></title>
            <link>https://www.pgedge.com/blog/the-scaling-ceiling-when-one-postgres-instance-tries-to-be-everything</link>
            <pubDate>Fri, 24 Apr 2026 11:36:30 GMT</pubDate>
            <description><![CDATA[ <p>There's a persistent belief in the database world that vertical scaling solves all problems. Need more throughput? Add CPUs. Running out of cache? More RAM. Queries hitting disk? Higher IOPS. It's a comforting philosophy because it's simple, and for a surprisingly long time, it works. A single beefy Postgres instance can handle an enormous amount of punishment before collapsing under the strain.But there's a ceiling up there, and it's not made of hardware. Postgres was designed as a single-instance database engine, and many of its internal structures are shared across every database the instance contains. These shared resources are rarely concerning in a single modest instance. But with twenty databases running a mixture of heavy OLTP workloads, analytical queries, or even mostly idle, the shared nature of these internals becomes very relevant.Let’s talk about the barriers these over-provisioned instances eventually hit, with references to the Postgres source code itself for good measure. Some of these are well known, while others are the kind of thing that strikes suddenly at 2 AM when all the monitoring dashboards turn red simultaneously.<h2>One Pool to Rule Them All</h2>The  parameter is probably the first tunable every Postgres administrator encounters. It controls the size of Postgres's own buffer cache, the region of shared memory where frequently accessed disk pages live so they don't need to be fetched from storage on every read. The <a href="https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-SHARED-BUFFERS"><u>documentation</u></a> suggests starting at 25% of system RAM, and that's reasonable advice for a single-database instance. Most experts in the subject agree.It’s easy to forget that this allocation is instance-wide. The contents of  bear this out, as the buffer pool gets allocated once at startup as a flat array of pages in shared memory:There is no per-database partitioning, no priority system, no reservation mechanism. Every database on the instance competes for the same pages in the same pool. An analytics query scanning a 500GB table in one database will happily evict cached pages that belong to a latency-sensitive OLTP workload in another. The buffer replacement algorithm (a clock-sweep <a href="https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_Recently_Used_(LRU)"><u>LRU</u></a> variant) has no concept of "this page belongs to an important database."The same applies at the operating system level. The kernel's filesystem cache, often called the "double buffer" in Postgres circles because  accounts for it, is also shared across all processes on the machine. Two databases with fundamentally different access patterns, one doing sequential scans and the other doing random index lookups, will thrash each other's cached pages with no way to intervene.Will throwing more RAM at the problem help? Only until the largest working sets collide. At that point, it becomes the worst example of the Noisy Neighbor problem.<h2>The 32-Bit Treadmill</h2>The 32-bit nature of the Postgres transaction ID (XID) is practically venerated as something of an old joke by this point. Blogs warning about the dreaded "XID wraparound" terror are easy to find. The Postgres fix for this is <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND"><u>VACUUM</u></a>, specifically the  operation. Most tuples have an associated XID, but since there are a limited number of those, tuples past a certain horizon get "frozen". Frozen tuples still have an XID, but Postgres ignores it and treats the data as if it has always existed. And thus by magic, that 4-billion transaction window only cares about "recent" transactions (for varying definitions of recent).Unfortunately, this counter persists across the entire instance. In , the function  draws from a single global well:Read that error message carefully. The instance refuses all new transactions to protect a specific database. A single neglected database out of dozens can accumulate enough XID age and cause the entire instance to grind to a halt. Every tenant suffers because one database didn't get vacuumed in time, or some resource artificially held onto a visible tuple so long it couldn’t be cleaned up.The  function in the same file makes this even more explicit. It computes the wraparound danger threshold based on "the oldest XID that might exist in any database of our cluster." One database's frozen-XID age becomes the constraint for every other database sharing that instance.<h2>Multixact: The Other Wraparound</h2>If XID wraparound is Postgres's well-publicized villain, <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND"><u>multixact</u></a> wraparound is the esoteric threat. Multixacts exist to track shared row-level locks; when multiple transactions hold locks on the same row, Postgres records them as a "multixact" group rather than storing each lock individually. Like XIDs, multixact IDs are 32-bit counters that wrap around, and like XIDs, they're instance-wide.But the member storage, the actual record of which transactions participate in each multixact, has its own nasty limit. The source code in  spells out the on-disk layout with typical Postgres clarity:The math is straightforward but the implications are severe. With 409 groups per 8KB page and 4 XIDs per group, we can work out the total SLRU address space: 2^32 member offsets divided by 1,636 members per page, multiplied by 8KB per page. That comes out to roughly 21GB of multixact member storage for the entire instance.That 21GB ceiling might sound generous until you consider a multi-tenant setup with aggressive row-level locking. A workload that performs  across many rows, or any application pattern that causes multiple transactions to hold shared locks on the same tuples, burns through multixact members quickly. Once exhausted, the instance starts refusing operations just as it does for XID wraparound, except the monitoring for multixact usage is far less mature in most environments.Worse, the same "slowest database wins" dynamic applies. The global minimum across all databases governs when the SLRU can be truncated. One database with inadequate vacuuming of multixact-heavy tables can pin that minimum in place for the entire instance. Similarly, a single database can greedily monopolize that precious resource simply due to unusual or aggressive locking behavior.<h2>The One-Lane Highway of WAL Replay</h2>Postgres streaming replication works by shipping <a href="https://www.postgresql.org/docs/current/wal-intro.html"><u>Write-Ahead Log</u></a> (WAL) records from the primary to replicas, which then replay them to stay current. It's a utilitarian and reliable workhorse, but there's a fundamental constraint: replay is single-threaded.In , the main redo loop that processes WAL on a replica is exactly what it looks like:One record at a time, sequentially, in a single process. The startup process (which handles WAL recovery) is the sole consumer of WAL data on a replica. There is no parallel apply. Even an over-provisioned 128-core machine acting as a replica can only leverage a single core for processing WAL data.The <a href="https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-RECOVERY-PREFETCH"><u>recovery_prefetch</u></a> parameter (defaulting to  since Postgres 15) helps when the bottleneck is IO. It looks ahead in the WAL stream and issues asynchronous reads for pages that will be needed soon, reducing stalls caused by cold cache hits. The prefetcher documentation in  describes it as a "drop-in replacement for an XLogReader that tries to minimize IO stalls by looking ahead in the WAL."But if a primary generates WAL faster than a single core can process it, prefetching won't help. The bottleneck shifts from IO to CPU, and there's nowhere to go. A write-heavy primary with many concurrent backends can produce WAL at a rate that structurally outpaces what a single replay process can consume. The replica falls behind, and the gap only widens under sustained load. I've personally witnessed a replica where this process is pinned at 100% CPU for hours while replication lag continues to accumulate.This is especially painful in a multi-database instance. Every database's WAL goes through that same single-threaded funnel. A batch import into one database generates a torrent of WAL that delays replay of another database's critical transaction. On separate instances, each database has its own replica and independent replay process—no more cascading latency from a single busy database.<h2>The Singleton Bottleneck Brigade</h2>Beyond the big-ticket items, Postgres runs several background processes that are each a single worker serving the entire instance. Individually, they're rarely a problem. Collectively, they form a convoy of potential bottlenecks.Autovacuum gets a shared pool of workers, defaulting to a maximum of 3 (controlled by <a href="https://www.postgresql.org/docs/current/runtime-config-autovacuum.html#GUC-AUTOVACUUM-MAX-WORKERS"><u>autovacuum_max_workers</u></a>). The launcher process in  schedules these workers across all databases in the instance. In an instance with ten databases and three workers, a couple of databases with heavy churn can monopolize the pool while others accumulate dead tuples and XID age. This kind of autovacuum starvation feeds directly into the XID and multixact wraparound risks discussed earlier.It's possible to raise  of course, but those workers draw from the same CPU budget as application backends. How many workers will we need to accommodate all databases? It's not possible to assign workers to specific databases, so the problem never really goes away, it just becomes less likely. Separate instances would ensure that each database gets its own full complement of autovacuum workers without competing.The checkpointer is a single process responsible for flushing dirty buffers to disk at <a href="https://www.postgresql.org/docs/current/wal-configuration.html"><u>checkpoint</u></a> intervals. A checkpoint triggered by one database's heavy write activity forces a flush of all dirty pages across the instance, including pages dirtied by other databases. The IO storm from a large checkpoint can cause latency spikes for every tenant, not just the one that triggered it.The background writer is also a single process that continuously writes dirty shared buffers to disk to keep a supply of clean pages available. It manages the entire shared buffer pool, and its pace is governed by instance-wide settings like  and . There's no way to prioritize one database's dirty pages over another.<h2>Splash Damage</h2>Maybe the most straightforward argument against cramming everything into one instance is the blast radius of failure. When a Postgres instance goes down, whether from a crash, an OOM kill, a kernel panic, or just planned maintenance, every database on that instance goes with it.The postmaster treats many failure modes as potentially corrupting shared memory. A single backend crash triggers a full restart cycle and termination of all user sessions. This comment in the checkpointer code captures the philosophy:<i>“If the checkpointer exits unexpectedly, the postmaster treats that the same as a backend crash: shared memory may be corrupted, so remaining backends should be killed."</i>Maintenance windows compound the problem. A Postgres major version upgrade, an extension update, or even a configuration change requiring a restart affects all tenants simultaneously. Coordinating downtime across multiple teams with different SLAs, different peak hours, and different tolerance for interruption is an organizational headache that grows geometrically (or worse) with the number of databases.And then there's the dreaded emergency vacuum. If one database approaches XID wraparound, Postgres will refuse transactions for all databases (as we saw in ). An urgent maintenance task on one database is now a high-severity outage incident for everyone. The blast radius of a forgotten cron job or a stuck long-running transaction just expanded to encompass the entire data tier.<h2>Splitting the Atom</h2>The solution to most of these problems is, perhaps counter-intuitively, not beefier hardware but more instances. Take the same physical machine, carve it into virtual environments (VMs, containers, or even just multiple Postgres installations on different ports), and run one database per instance.What changes? Let's see...<ul><li>Each instance gets its own </li><li>shared_buffers</li><li>, sized appropriately for its workload. An OLTP database can have a large, hot buffer pool while an analytics database gets a smaller one tuned for filesystem cache access. No more buffer thrashing between incompatible access patterns.</li></ul><ul><li>Transaction IDs become per-instance. One database's vacuum debt can't drag others into wraparound territory. The same applies to multixact members; that 21GB ceiling now applies to a single workload rather than the sum of all tenants.</li></ul><ul><li>WAL replay is per-instance. A write-heavy database generates WAL that only its own replica needs to replay. A latency-sensitive OLTP replica isn't waiting behind a batch import's WAL records destined for a completely different database.</li></ul><ul><li>Autovacuum workers, the checkpointer, and the background writer each serve a single database. No more starvation, no more shared checkpoint storms, no more one-size-fits-all background writer pacing.</li></ul><ul><li>Failures become isolated. A crash in one instance is invisible to the others. Maintenance windows can be scheduled independently. Emergency vacuums don't trigger cross-tenant incidents.</li></ul>The trade-off is operational complexity. More instances means more configuration to manage, more backup schedules to maintain, more monitoring dashboards to watch. But with modern infrastructure tooling (Ansible, Terraform, Kubernetes operators), the marginal cost of an additional Postgres instance is low compared to the cost of debugging an emergency multi-tenant resource exhaustion event.<h2>Knowing When to Quit</h2>Vertical scaling is a perfectly valid strategy, and there's a reason so many Postgres installations run happily on a single large instance. For moderate workloads, the shared nature of Postgres internals is not just acceptable but efficient. Shared memory, shared processes, shared caches: they all reduce overhead when the workloads are playing nicely together.The trouble starts when "playing nicely" is no longer a given. Databases with fundamentally different I/O profiles, vacuum requirements, availability SLAs, activity patterns, and other concerns, don't always mix well. Resources become contested rather than efficient. No amount of RAM, CPU, or storage can counteract that because the constraints are architectural.The signals are usually subtle at first. Autovacuum can't keep up across all databases. Replica lag increases during batch jobs in an unrelated database. Checkpoint duration creeps up. Multixact warnings appear in the logs that nobody configured alerts for. By the time XID wraparound threatens to lock the whole instance, there have usually been many other signs that simply went unseen. There's a reason many in the community consider multiple-database instances a type of anti-pattern; shared resources are also a shared throttle.So if you're staring at a single Postgres instance that hosts a growing number of databases, or a shrinking number of exceptionally large ones, take a hard look at the shared internals. Read the source. Do the math on your multixact headroom. Check whether your autovacuum workers are keeping pace across every database, not just the ones you're watching. And if the numbers start looking uncomfortable, consider splitting before it becomes absolutely necessary.It's a lot easier to plan a migration than to execute one during an incident.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/the-scaling-ceiling-when-one-postgres-instance-tries-to-be-everything</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>Distributed Postgres,pgEdge,PostgreSQL,postgres,PostgreSQL</category>
            <title><![CDATA[Enforcing Constraints Across Postgres Partitions]]></title>
            <link>https://www.pgedge.com/blog/enforcing-constraints-across-postgres-partitions</link>
            <pubDate>Fri, 17 Apr 2026 05:48:50 GMT</pubDate>
            <description><![CDATA[ <p>Postgres table partitioning is one of those features that feels like a superpower right up until it isn't. Just define a partition key, carve up data into manageable chunks, and everything hums along beautifully. And what's not to love? Partition pruning in query plans, smaller tables, faster maintenance, easy archiving of old data; it's a smorgasbord of convenience.Then you try to enforce a unique constraint without including the partition key, and Postgres behaves as if you just asked it to divide by zero. Well... about that.<h2>The Rule Nobody Reads Until It's Too Late</h2>The <a href="https://www.postgresql.org/docs/current/ddl-partitioning.html"><u>Postgres documentation on partitioning</u></a> spells it out pretty clearly in the limitations section:Read that again. The constraint's columns must include all of the partition key columns. Not "should." Not "it would be nice if." Must. And the reasoning is maddeningly justified: each partition maintains its own index, and a local index can only enforce uniqueness within its own partition. Postgres has no concept of a global index that spans all partitions simultaneously, so it has no mechanism to check whether some value in partition A already exists in partition B.Other database engines (Oracle, for instance) have global indexes that solve this at the storage layer. Postgres does not, and there's been no serious movement on the mailing lists to add them. So we're left to our own devices.<h2>When Theory Meets the Event Pipeline</h2>Consider a fairly common (if somewhat contrived) scenario: an  table partitioned by range on an identity column. The table includes a  that the application uses to prevent duplicate event processing. Naturally, that should be unique across all partitions.Now try adding  to that table without an error. The partition key is , and  doesn't include it, so Postgres rejects the constraint. You could make a composite unique constraint on , but that's effectively useless for deduplication since  is already unique. Every row would satisfy the constraint regardless of duplicate  values.This is especially painful for date-range partitions, which are probably the most popular partitioning strategy in the wild. It's common to partition by month or week, but there's no universal uniqueness strategy there, just distinct intervals. The partition key is there for data management, not for data integrity, and Postgres can't separate those concepts without assistance.So what can we do to help it?<h2>Brute Force to the Rescue</h2>If Postgres won't enforce global unique constraints for us, can we do it ourselves? This is Postgres after all, so there are many tools at our disposal. <a href="https://www.postgresql.org/docs/current/plpgsql-trigger.html"><u>Triggers</u></a>, for example, exist for exactly this kind of scenario.The simplest approach is a  trigger that scans the entire partitioned table set for duplicates:The index on  is critical here; without it, every insert triggers a catastrophically slow sequential scan across all partitions. The index enables a much more optimal index scan across all partitions. This is still more overhead than ideal, but it could certainly be worse.Let's throw two million rows at it and see what happens:Without the trigger, this insert completes in about 7 seconds. With the trigger? 25 seconds. That's more than a 3x overhead, and we only have three partitions. The trigger must probe every partition's index for each row to confirm no duplicate exists. As the partition count grows, so does the probe time, because the query planner has to touch more and more partition indexes. Fifty partitions, a hundred partitions, three hundred partitions... each one adds another index lookup to every single insert.Does our duplicate check work, though? Absolutely:That's pretty satisfying, but performance at scale has much to be desired. Is that a problem we can solve?<h2>Fake it 'till You Make it</h2>What if instead of scanning the partitioned table, we maintained a separate, unpartitioned table whose sole job is tracking which  values already exist? Then we can leverage the primary key of that table to do the uniqueness check for us, and Postgres handles all the heavy lifting with a single B-tree lookup.Now the trigger changes from a cross-partition scan to a simple insert into the  (dedup) table. If the insert violates the primary key, Postgres immediately catches the duplicate. No scanning, no partition probing, no existential dread:But we also need to keep the dedup table honest. If rows get deleted from , the corresponding dedup entries should be cleaned up. Otherwise we'd reject future inserts for values that no longer exist:Now let's run the same two-million-row insert:This time the inserts completed in 14 seconds. Roughly double the baseline of 7 seconds without any trigger. This makes sense, as every insert now performs one additional B-tree insert into the dedup table. On the other hand, that overhead doesn’t change whether there are three partitions or three hundred. The dedup table is a single unpartitioned table with a single index. The cost is constant regardless of how many partitions exist on the  table.That trade-off is fairly compelling: pay a fixed 2x overhead per insert in exchange for partition-count independence. For a system that might grow to dozens or hundreds of partitions over its lifetime, the dedup table approach is clearly the more sustainable choice.And yes, the dupe check still works:Even better, we get Postgres's own constraint violation message now, complete with the offending key value. No custom error formatting necessary.<h2>Feedback Loop</h2>There's a natural concern with this approach: the dedup table itself could grow enormous. If the  table accumulates billions of rows over time, the dedup table will have billions of entries as well. That's a lot of B-tree to maintain within a monolithic index.The solution is almost comically recursive: partition the dedup table:Hash partitioning is well suited for this because it distributes values evenly and Postgres can prune directly to the correct partition for any given . The primary key constraint works here because, well, the partition key is the column we're constraining. No uniqueness problem on the uniqueness-enforcement table. Very convenient.The trigger code doesn't change at all. Postgres handles the partition routing transparently. We've effectively built a scalable uniqueness enforcement layer using nothing but declarative partitioning, a trigger function, and a table that exists purely as an optimization.<h2>Going Virtual</h2>What if the deduplication ID is something that we can derive from the row's content rather than storing in the  table? That's worth yet another tweak to our approach. Say the uniqueness guarantee is based on the  content. We could use a hash function to generate the dedup key on the fly:The  function is one of those perilous corners of Postgres that doesn't show up in casual documentation browsing. It produces a 64-bit hash of the input text, which delivers a collision space of about 18 quintillion values. For most practical purposes, that's unique enough. For deduplication requirements that are truly ironclad, stick with a deterministic identifier rather than a hash.The beauty of this approach is that the  table stays lean. The dedup table absorbs the storage cost, and since it's just a single  column, it's remarkably compact.<h2>The Fine Print</h2>No solution is without caveats, and intellectual honesty demands we acknowledge a few.First, triggers add complexity. Every trigger is a piece of business logic that lives in the database rather than the application. Some teams are fine with that; others treat it like putting ketchup on a steak. Organizations with strong opinions about keeping logic in the application layer may balk at using a trigger in this situation regardless of its technical merits.Second, concurrency deserves attention. Both trigger approaches rely on seeing committed data (or data visible within the current transaction). Under high concurrency, two transactions could simultaneously check for the same , both find nothing, and both proceed with the insert.The dedup table approach handles this more gracefully because the  will block on the primary key's underlying unique index, causing the second transaction to wait. Once the first commits, the second will get the constraint violation.The scan-based trigger is more vulnerable here because  might not see uncommitted rows from other transactions. The trigger would need to use more sophisticated logic including <a href="https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS"><u>advisory locks</u></a> to prevent that.Finally, there's the matter of updates. If it's possible to update , there should be additional trigger logic to handle that use case. This is territory where implementation details tied to business logic proliferate and escape the simple confines of our demonstration case. But you get the idea.<h2>Closing Thoughts</h2>So where does this leave us? The documentation is refreshingly honest about the constraint limitation, but honesty doesn't enforce uniqueness across partitions.The brute force trigger works without much fanfare, but it scales poorly with partition count. The dedup table approach trades a small, constant overhead for partition-count independence, which is almost always the right trade-off in production systems. And for those feeling adventurous, the algorithmic variant omits the extra column entirely by computing dedup keys on the fly.Sadly, none of these techniques are as clean as a native  constraint. They all involve triggers, which means more moving parts, more things to test, and more things to document. But they work. Sometimes the best engineering isn't about finding the perfect solution, it's about finding the most tolerable compromise.Keep these tricks in your back pocket. Partitioning is still the best approach for horizontally scaling large data sets, and the uniqueness problem isn't going anywhere until Postgres adds global indexes. Until then, triggers and dedup tables will do just fine.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/enforcing-constraints-across-postgres-partitions</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>pglogical,pgEdge,PostgreSQL,pgEdge,postgres,PostgreSQL</category>
            <title><![CDATA[Checkpoints, Write Storms, and You]]></title>
            <link>https://www.pgedge.com/blog/checkpoints-write-storms-and-you</link>
            <pubDate>Fri, 10 Apr 2026 06:06:37 GMT</pubDate>
            <description><![CDATA[ <p>Every database has to reconcile two uncomfortable truths: memory is fast but volatile, and disk is slow but durable. Postgres handles this tension through its Write-Ahead Log (WAL), which records every change before it happens. But the WAL can't grow forever. At some point, Postgres needs to flush all those accumulated dirty pages to disk and declare a clean starting point. That process is called a <a href="https://www.postgresql.org/docs/current/wal-configuration.html"><u>checkpoint</u></a>, and when it goes wrong, it can bring throughput to its knees.<h2>A Bit About Checkpoints</h2>Under normal operation, Postgres is remarkably polite about checkpoints. The  parameter (default 5 minutes) tells Postgres how often to perform a scheduled checkpoint, and  (default 0.9) tells it to spread the resulting writes over 90% of that interval. So a checkpoint timeout of 5 minutes means Postgres trickles dirty pages to disk over roughly 4.5 minutes, keeping IO impact to a minimum.This only applies to timed checkpoint behavior.The  parameter sets a soft limit on how much WAL can accumulate between checkpoints. When the WAL approaches that threshold (1GB by default), Postgres doesn't wait for the next scheduled checkpoint. Instead, it forces one immediately.These forced (or requested) checkpoints do not honor . Postgres needs to reclaim WAL space, so it flushes every dirty buffer to disk as fast as the IO subsystem will allow. On a busy system with a large  pool full of modified pages, this can completely saturate disk IO in seconds.It's like trying to drink from a firehose.<h2>Rubber Meets the Road</h2>To see this in action, we set up a modest test environment:<ul><li>Hypervisor:</li><li> </li><li>Proxmox</li></ul><ul><li>CPU:</li><li> 4x AMD EPYC 9454 cores</li></ul><ul><li>RAM:</li><li> 4GB</li></ul><ul><li>DB Storage:</li><li> 100GB @ 2,000 IOPS</li></ul><ul><li>WAL Storage:</li><li> 100GB @ 2,000 IOPS</li></ul><ul><li>OS:</li><li> Debian 12 Bookworm</li></ul>We initialized the database with  at a scale factor of 800, producing roughly 12GB of data (3x available RAM to reduce cache hits). We also followed the traditional advice of setting  to 25% of RAM, or 1GB in this case. All other settings remained at their defaults.Each test followed the same pattern: issue a manual  to start clean, then run pgbench for 60 seconds with per-second progress reporting and 16 concurrent clients to keep all of the CPU cores busy:We started with the default  of 1GB to see how the system behaves. This setting is frequently overlooked during optimization, so it should illustrate a good example of baseline operation.Throughput holds steady between 1,000 and 1,100 TPS for the first 41 seconds of the test. Buffers began to warm, the IO subsystem was keeping pace, and latency remained low. At the 42-second mark, WAL output reached 1GB and Postgres forced a checkpoint. TPS immediately cratered to roughly 620—a drop of nearly 40%! It never recovered for the remainder of the benchmark run.We increased  to 4GB for the second test. It's a modest bump, but should be sufficient for the purposes of this demonstration. Throughput started around 1,000 TPS this time around and gradually climbed as shared buffers warmed up, reaching 1,200 TPS by the end of the test. One minute of pgbench activity isn’t enough to produce 4GB of WAL on this hardware, meaning no forced checkpoint.The results basically speak for themselves:<img src="https://a.storyblok.com/f/187930/1782x805/5192ce01b9/pgbench-tps-forced-checkpoints.png" >Ouch! Both tests tracked nearly identically for the first 40 seconds. Then the 1GB configuration hit a wall while the 4GB configuration kept climbing.<h2>The Cost of Forced Checkpoints</h2>Postgres normally uses  during a timed checkpoint to calculate a write budget. If it has 5 minutes between checkpoints and the target is 0.9, it can spread dirty page writes over 270 seconds. That's a lot of time to trickle data to disk, and the IO impact per second is minimal.A forced checkpoint has no such luxury. The WAL is full (or nearly so), and Postgres needs to reclaim space now. It writes dirty buffers as fast as it can, competing directly with active queries for disk IO. This competition is fierce on a system limited to 2,000 IOPS. Every IOPS spent flushing checkpoint data is essentially stolen from user queries.The severity is largely hardware dependent. Systems with fast NVMe storage and tens (or hundreds) of thousands of IOPS may barely notice. But cloud instances, virtualized environments, or anything with IO throttling (which is extremely common) will feel the pain. We provisioned our test system at 2,000 IOPS per volume, which is relatively generous by cloud standards, and still experienced a marked impact.The benchmark itself is only half of the story. Prior to that, we had to initialize the 12GB test database with . While pgbench generated the sample data with the default 1GB , Postgres triggered 18 forced checkpoints. Trying again with  set to 20GB brought that number to zero.So what? It's just initialization, right? Consider that this same pattern applies to any bulk data operation:  imports, large  statements,  on big tables, , or even hefty  batches. If any of these operations are running alongside a production OLTP workload, that's 18 IO storms competing with application queries.An ETL job that loads a few gigabytes of data every night could trigger a string of forced checkpoints that spike latency for every other query on the system. The bulk operation itself will also slow down since it's fighting its own checkpoint IO for disk bandwidth.Everyone loses when checkpoints can't spread write activity.<h2>Spotting the Problem</h2>Postgres tracks checkpoint statistics, and checking them should be part of any regular health assessment. The system catalog you should use depends on the Postgres version.In Postgres 17 and later, use <a href="https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-CHECKPOINTER-VIEW"><u>pg_stat_checkpointer</u></a>:In older versions, the same information lives in :The critical ratio here is: timed versus requested checkpoints. In a well-tuned system,  (or ) should be close to zero relative to . If requested checkpoints are a significant proportion of the total,  is too small for the current write workload and performance is likely sub-optimal.It's also worth keeping an eye on  and . If  is consistently high, the storage subsystem is struggling to keep up with checkpoint flushes, which further confirms an IO bottleneck during checkpoints.As for logging, we highly recommend setting <a href="https://www.postgresql.org/docs/current/runtime-config-logging.html"><u>log_checkpoints</u></a> to  to capture checkpoint activity:This causes Postgres to log detailed information about every checkpoint, such as the number of buffers written, how long it took including sync time, and other useful metrics. When enabled, the Postgres log should show checkpoint activity like this:That  line is the smoking gun. It means this checkpoint was forced because WAL hit the limit, not because the timeout expired. A timed checkpoint would say  instead.This is free forensic information. The logging overhead is negligible and provides a clear trail of checkpoint behavior. This is another one of those settings that should be enabled by default, and has been since Postgres 15. Those with older clusters will have to enable it manually.<h2>Finding the Right max_wal_size</h2>So should everyone just crank  to some enormous value and forget about it? Not exactly. There are trade-offs.A larger  allows more WAL to accumulate between checkpoints, which means more data must be replayed during crash recovery. If Postgres crashes with 20GB of WAL to replay, startup necessarily takes longer than it would with 1GB. This difference in recovery time is usually only a matter of seconds, but it's worth acknowledging.The other consideration is disk space. WAL files consume storage, and  is a soft limit. Under heavy write loads, WAL can temporarily exceed it. There should be enough headroom on the WAL volume to accommodate bursts without running out of space entirely. That would be a much worse problem than slow checkpoints.A reasonable starting point for write-heavy OLTP workloads is 10GB to 20GB. Systems with aggressive bulk loading or large batch operations might benefit from 50GB or more. The goal is to make forced checkpoints rare enough that essentially all checkpoints are timed and spread gracefully over .We recommend validating the setting by monitoring  (or ) over time. Let the system run under typical load for a day or a week, then check the ratio. If requested checkpoints have increased, bump  higher and repeat.If you'd rather not do this yourself, I actually wrote a Postgres extension called <a href="https://github.com/pgEdge/pg_walsizer"><u>pg_walsizer</u></a>. It launches a background worker that monitors checkpoint activity and automatically increases  based on how many checkpoints occur within the configured . Just set it and forget it!<h2>Wrapping Up</h2>Checkpoints are one of those Postgres internals that most people never think about until something goes wrong. Periodic latency spikes can have any number of causes after all. Not all DBAs consider checking WAL activity, nor realize the relationship it has with disk flushes—most people blame vacuum at first.As is tradition, the default value of 1GB for  is a conservative one. It minimizes crash recovery time and works fine for light workloads without using a lot of storage. Unfortunately, busy systems will quickly exceed this default and begin to suffer. Our test showed a 40% TPS plunge on modest hardware; production systems with heavier loads and tighter IO budgets will likely fare worse.For most production environments, we suggest starting  with something more appropriate. If  isn't already enabled, prioritize enabling that as well. And finally, either  or  should feature prominently in a monitoring dashboard. Look upon any increases in requested checkpoints with suspicion.In the end,  is one of those rare occasions where a single parameter can confer a substantial improvement with virtually no downside. So go check your checkpoint stats. You might be surprised by what you find!</p> ]]></description>
            <guid>https://www.pgedge.com/blog/checkpoints-write-storms-and-you</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>Distributed Postgres,PostgreSQL,pgEdge,postgres,PostgreSQL</category>
            <title><![CDATA[What is a Collation, and Why is My Data Corrupt]]></title>
            <link>https://www.pgedge.com/blog/what-is-a-collation-and-why-is-my-data-corrupt</link>
            <pubDate>Fri, 03 Apr 2026 05:36:03 GMT</pubDate>
            <description><![CDATA[ <p>The GNU C Library (glibc) version 2.28 entered the world on August 1st, 2018 and Postgres hasn't been the same since. Among its many changes was a massive update to locale collation data, bringing it in line with the 2016 Edition 4 release of the ISO 14651 standard and Unicode 9.0.0. This was <i>not</i> a subtle tweak. It was the culmination of roughly 18 years of accumulated locale modifications, all merged in a single release.<i>Nobody</i> threw a party.What followed was one of the most significant and insidious data integrity incidents in the history of Postgres. Indexes silently became corrupt, query results changed without warning, and unique constraints were no longer trustworthy. The worst part? You had to <i>know</i> to look for it. Postgres didn't complain. The operating system didn't complain. Everything appeared normal, right up until it wasn't.This is the story of how a library upgrade quietly corrupted databases around the world, what the Postgres community did about it, and how to make sure it never happens to you again.<h2>What even is a Collation?</h2>Before we can understand what broke, we need to understand what a collation actually <i>does</i>. At its core, a collation defines how text is compared and sorted. That sounds simple enough, but collation rules become much more turbulent outside of the English alphabet.Consider the German letter ß. Does it sort the same as "ss"? Usually. What about accented characters like é and è? Should they be treated as equivalent to "e" for sorting purposes, or should they have their own distinct positions? What about the Swedish alphabet, where ä and ö come <i>after</i> z rather than being treated as variants of a and o?Every language has its own answer to these questions, and a collation encodes those answers into a set of rules for a database to follow. When Postgres needs to sort a column of text, enforce a unique constraint, or build a B-tree index, it asks the collation: "Which of these two strings comes first?" The collation's answer determines everything from query results to whether an index lookup finds any data at all.Historically, Postgres delegated that question to the operating system's C library. Postgres doesn't have its own implementation of American English sorting rules baked in, so databases created with the  locale rely on external libraries. More specifically, Postgres used to simply invoke  from glibc (on Linux systems) and trusted whatever answer came back.That trust worked fine for years. And then one day it didn't.<h2>The Day the World Changed</h2>So what <i>actually</i> changed in glibc 2.28? Consider a simple example using the  locale. Before the update, strings containing special characters sorted like this:After glibc 2.28, those same strings sorted like this:That's not a minor adjustment. The relative positions of strings containing punctuation, mixed case, and special characters shifted <i>dramatically</i>. Another well documented example: the sort order of  and  simply flipped between the old and new versions. Data containing strings with hyphens, underscores, or currency symbols now experienced seemingly inconsistent sorting rules.This wasn't a <i>bug</i> in glibc. The glibc developers were correcting years of accumulated divergence from the Unicode standard. The new sort orders were arguably more <i>correct</i>. But correctness is cold comfort for indexes founded upon faulty assumptions.<h2>Anatomy of a Silent Catastrophe</h2>Why did this cause so much trouble, though? Let's examine how Postgres B-tree indexes work with text. When Postgres builds an index on a text column, it sorts the values according to the active collation and stores them in that order. Postgres builds the literal on-disk tree structure based on these results. Later, Postgres navigates the B-tree by comparing search terms against stored keys, following the tree left or right based on which string "comes first" according to the collation.Imagine the library underneath changes its mind about which string comes first. The physical layout of the index reflects the <i>old</i> sort order, but every new comparison uses the <i>new</i> sort order. Postgres navigates right when it should go left, or left when it should go right, and previously valid data becomes invisible. The row is still in the table, a sequential scan still finds it, but the index lookup misses it entirely.The consequences cascaded outward from there:<ul><li> Queries using index scans could silently skip existing rows. A </li><li> might return 999 rows when 1,000 actually matched the predicate.</li></ul><ul><li> Unique constraints backed by B-tree indexes could fail to detect actual duplicates because the index traversal couldn't find the existing entry. Alternatively, they might wrongly reject valid entries because the traversal landed on the wrong node.</li></ul><ul><li> Any query relying on </li><li> with a text column would produce different results before and after the upgrade. Merge joins, which depend on both inputs being sorted identically, could produce silently incorrect results.</li></ul><ul><li> If a primary ran one glibc version and a replica ran another, identical queries against identical data produced different results. The </li><li> demonstrated this exact scenario with streaming replicas, though few understood the full implications at the time.</li></ul>All of this happened <i>silently</i>. Nobody knew (or <i>could</i> know) anything was wrong until it was already too late.<h2>The Domino Effect</h2>The glibc 2.28 release didn't hit every Linux distribution simultaneously. Instead, it rippled outward over the course of about a year as each distribution adopted it on their own schedule:<ul><li> Fedora 29 and Ubuntu 18.10 shipped with glibc 2.28.</li></ul><ul><li> Christoph Berg, the Debian PostgreSQL maintainer, raised the alarm on the </li><li>, describing the situation as critical. He proposed automated warnings for Debian users with PostgreSQL clusters.</li></ul><ul><li> RHEL 8 and CentOS 8 arrived with glibc 2.28, making the leap from glibc 2.17 in RHEL 7. That's an 11-version jump in a single upgrade cycle.</li></ul><ul><li> Debian 10 (Buster) followed suit.</li></ul>The RHEL jump was particularly brutal. Many enterprise shops run on CentOS or RHEL, and an OS upgrade from version 7 to 8 was just a common eventuality. Nobody expected that a routine distribution upgrade would quietly corrupt their database indexes. Arch Linux users, running the bleeding edge as always, were among the first canaries in the coal mine.Daniel Verite published "<a href="https://postgresql.verite.pro/blog/2018/08/27/glibc-upgrade.html"><u>Beware of your next glibc upgrade</u></a>" in August 2018, one of the earliest public warnings. The Postgres Wiki created dedicated pages for <a href="https://wiki.postgresql.org/wiki/Locale_data_changes"><u>Locale data changes</u></a> and <a href="https://wiki.postgresql.org/wiki/Collations"><u>Collations</u></a> to track the evolving situation. Blog posts from <a href="https://www.crunchydata.com/blog/glibc-collations-and-data-corruption"><u>Crunchy Data</u></a>, <a href="https://www.citusdata.com/blog/2020/12/12/dont-let-collation-versions-corrupt-your-postgresql-indexes/"><u>Citus Data</u></a>, and <a href="https://www.cybertec-postgresql.com/en/icu-collations-against-postgresql-data-corruption/"><u>CYBERTEC</u></a> followed, each emphasizing the same uncomfortable truth: if you upgraded glibc and didn't rebuild your indexes, your data might <i>already</i> be corrupt.<h2>Finding the Damage</h2>The first step for any affected system was identifying at-risk indexes. Any B-tree index on a text, varchar, char, or citext column using a locale-dependent collation (anything other than  or ) was potentially corrupt. The Postgres community settled on a diagnostic query that looked something like this:Every index returned by that query needed to be rebuilt. For Postgres 12 and later, that meant:The  option proved to be a lifesaver for production systems, as it allows the rebuild to happen without locking the table for the entire rebuild duration. For those stuck on Postgres 11 or earlier, the workaround was uglier: create a replacement index concurrently, drop the old one, and rename the new one. Primary key indexes and unique constraints made this particularly painful.The scale of the problem was staggering. A database with hundreds of tables and thousands of text indexes needed <i>every single one</i> rebuilt. And this wasn't a one-time fix. Any future glibc upgrade that changed collation rules would require a repeat performance.<h2>Déjà vu</h2>Ironically, the glibc 2.28 incident wasn't even the first time glibc caused index corruption in Postgres. Several glibc versions in 2015 shipped with a buggy  implementation that produced results inconsistent with , violating both ISO C90 and POSIX standards.Postgres 9.5 had introduced "abbreviated keys" to speed up text index builds, and the glibc bugs caused these to produce corrupt indexes. The fix in Postgres 9.5.2 was to <a href="https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue"><u>disable abbreviated keys for non-C locales entirely</u></a>, a performance regression that persists to this day for libc-based collations. Users had to  then, too.Two major incidents in three years, both caused by the same fundamental problem: Postgres trusted an external library for a critical operation, and that library's behavior was neither stable nor guaranteed. The writing was on the wall.<h2>The Long March Away from glibc</h2>The Postgres community's response to these incidents was measured but determined, playing out over nearly a decade of incremental progress. introduced initial support for ICU (International Components for Unicode) as an alternative collation provider. Peter Eisentraut's work here was prescient, landing a full year before glibc 2.28 shipped. For the first time, it was possible to create a collation backed by ICU instead of libc:ICU maintains its own collation data independent of the operating system. ICU updates its rules through a strict versioning system, meaning Postgres can detect the change and emit a warning. added <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d5ac14f9c"><u>collation version tracking</u></a> for glibc. Postgres began recording the collation version during index creation, and issuing warnings when the underlying version changed. This was the first real "early warning system" for the problem. It couldn't prevent corruption, but at least the logs told the full story. was an event many anticipated: ICU could now be used as the <a href="https://www.postgresql.org/docs/15/app-initdb.html"><u>default collation provider</u></a> for an entire database cluster:Before this, ICU was only available for individual collation objects, which was both inconvenient and prone to mistakes. delivered what many consider the real solution: the <a href="https://www.postgresql.org/docs/current/collation.html"><u>builtin collation provider</u></a>. This provider compiles collation logic directly into Postgres itself, with no external glibc or ICU dependency. The builtin provider ships with two primary collations:<ul><li>: Unicode code point sorting with POSIX-compatible pattern matching and simple case mapping.</li></ul><ul><li>: Unicode code point sorting with full Unicode case mapping and standard pattern matching behavior.</li></ul>Both are guaranteed to be immutable within a major Postgres version. The entire class of "the OS changed my sort order" bugs simply cannot happen with these collations.<h2>The Right Way to Initialize a Cluster (Today)</h2>Despite all of this history,  as of Postgres 18. Newly initialized clusters will use glibc libraries unless told otherwise. That means every new database is potentially vulnerable to the same class of problem that glibc 2.28 caused, just waiting for the next major library update to trigger it.As a result, I recommend <i>always</i> specifying the builtin provider when creating a new cluster:Two flags to eliminate an entire category of data corruption risk. The  locale provides proper UTF-8 character handling while sorting by Unicode code point order without any lingering surprises.For those creating databases within an existing cluster, the same principle applies:It’s necessary to specify  as the template database when doing this, as it’s normally impossible to use a collation different from the source database.Until the Postgres project changes the default (and there are ongoing discussions about exactly that), every DBA needs to make this a conscious choice for every new cluster or database.<h2>Never a Free Lunch</h2>If the new internally-provided collations are so great, why isn't everyone using them?The first reason is that few stop to consider the topic at all. They may trust that the problem is being handled and addressed by future versions, and it'll happen as if by magic. Nobody wants to face the ugly reality of a data migration. Perhaps they’re simply new users who missed the uproar.The other reason is more subtle. Both  and  sort by Unicode code point value. This is essentially a byte-order sort for UTF-8 encoded text. It's deterministic, fast, and perfect for indexes. But it doesn't match what a human would expect for linguistically correct sorting in most languages.Consider German names:Yet a native German speaker would expect this instead:For most application workloads, this doesn't actually matter. APIs return JSON, frontends sort data client-side, and search operations care about matching rather than ordering. But linguistic sort order is still relevant for applications that display sorted lists directly to users, such as a directory, a catalog, a report, and so on.<h2>Applying Linguistic Sorting</h2>This is where ICU earns its keep. Clusters can run on the builtin provider for safety and performance, and then apply ICU collations <i>precisely where necessary</i>. There are two approaches. is ideal when a particular column always needs linguistic sorting:Now the  column always sorts according to German linguistic rules, while every other text column in the database uses the safe, deterministic builtin collation. Indexes on the  column will use the ICU collation, so becomes dependent on ICU's versioning for that specific index. ICU tends to exercise more discipline about version management than glibc, so this isn't quite as risky. is better for occasional linguistic sorting:This applies the German ICU collation just for this specific sort operation. The underlying column and its indexes remain on the builtin collation. This results in linguistic sorting without altering the storage or index behavior. This is the <i>safest</i> approach as a result, though admittedly more inconvenient due to the additional syntax.<h2>The Ghost of Clusters Past</h2>New clusters are easy. Specify the  provider at init time and move on with your life. But what about the millions of existing Postgres clusters already running on libc collations? Those don't just go away, and it's not possible to change a database's default collation after creation. That said, there are a few options:<ul><li> Use </li><li> and </li><li> or logical replication to move data into a fresh cluster initialized with the </li><li> provider. This is the cleanest approach, but requires planning and potential downtime.</li></ul><ul><li> Individual columns can use a different collation within a libc-based cluster. Perform a staged rollout by moving tables or affected columns to a safer collation, and always create new columns with that collation. A future cluster migration is still necessary for a permanent fix, but this process provides a path to safety.</li></ul><ul><li> Postgres 13+ will log warnings when a collation's underlying version changes. Pay attention to those warnings; they're screaming that a REINDEX is necessary. Look for these messages:</li></ul><ul><li> The safest approach for glibc-based clusters is to rebuild every text index following a major distribution upgrade. No exceptions. Use the diagnostic query from earlier to identify affected indexes, and </li><li> for identified candidates.</li></ul><h2>Moving On</h2>The glibc 2.28 incident changed how the entire Postgres community thinks about external dependencies. Before 2018, the idea that an OS library update could result in database corruption was something only a handful of people worried about. That it did so <i>silently—</i>allowing the corruption to fester for weeks, months, or even <i>years—</i>just poured extra salt in the wound.The Postgres community responded in its typical heroic fashion. Collation version tracking, ICU provider support, and ultimately  collations, show just how far the Postgres devs are willing to go to solve a problem. It's not a complete rebellion against trusting OS-provided libraries, but decoupling from external collation resources remains a prudent reaction given the circumstances.And yet we shouldn't rest on our laurels; glibc remains the default even now, suggesting the lesson hasn't fully sunk in for everyone. Every  run without  is another cluster carrying the same risk that nearly spelled disaster for many. I've personally encountered clusters with this kind of corruption as recently as 2025, fully seven years since everything went wrong. Why propagate that mistake?So the next time you spin up a Postgres cluster, do yourself a favor and use the builtin locale provider. Your future self, the one who just upgraded to the latest Ubuntu LTS without thinking twice about it, will thank you.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/what-is-a-collation-and-why-is-my-data-corrupt</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>postgres,PostgreSQL,PostgreSQL</category>
            <title><![CDATA[PG Phriday: Absorbing the Load]]></title>
            <link>https://www.pgedge.com/blog/pg-phriday-absorbing-the-load</link>
            <pubDate>Fri, 27 Mar 2026 06:53:43 GMT</pubDate>
            <description><![CDATA[ <p>Recently on the <a href="https://www.postgresql.org/list/pgsql-performance/">pgsql-performance</a> mailing list, a <a href="https://www.postgresql.org/message-id/CAEzWdqd0SPkZMYNaAbERdgczkfQqLmNV5JBMmF-F9s7KjxJ0gw%40mail.gmail.com">question popped up</a> regarding a Top-N query gone wrong. On the surface, the query merely fetched the latest 1000 rows through a join involving a few CTEs in a dozen tables with a few million rows distributed among them. A daunting query with an equally <a href="https://gist.github.com/databasetech0073/6688701431dc4bf4eaab8d345c1dc65f">daunting plan</a> that required about 2-3 seconds of execution time. Not ideal, but not exactly a show-stopper either. But high concurrency compounded the problem, with 40 sessions enough to completely saturate the available CPU threads.In the end all it took was a <a href="https://www.postgresql.org/message-id/CAEzWdqcAbi0GYp_K64oZTeUeN3YN7-eFQ2m2fZDRvmnJx5Lb5w%40mail.gmail.com">new index</a> to solve the conundrum, as is usually the case with queries like this. Problem solved, right?Well... maybe. But let's play a different game.What if the query was already optimal? What if every join was fully indexed, the planner chose the most optimal variant, and yet the sheer volume of data meant two seconds was the fastest possible time? Now imagine not 40 concurrent users, but 4000. What happens then?<h2>It's Just Math</h2>Suppose a query takes two seconds after being fully optimized and there are 4000 users that each need that result once per page load, spread across a 60-second window. That's roughly 67 concurrent executions at any given moment. A 2-vCPU machine can only handle two of these, what about the other 65?That's an irreducible capacity mismatch. If we can't reduce the execution time of the original query, we must provide some kind of substitution instead, one that reduces the number of times Postgres runs that query at all.The good news is that Postgres has built-in tools for this. The better news is that there are external tools that extend those capabilities substantially. Sometimes—though a Postgres expert is loathe to admit it—the right answer rests outside of Postgres.Let's explore the available tools.<h2>It's a Material World</h2>Postgres has supported <a href="https://www.postgresql.org/docs/current/rules-materializedviews.html">materialized views</a> since version 9.3, and they are exactly what they sound like: a view whose results are physically stored on disk, rather than recomputed on demand. Creating one is straightforward:Now instead of hammering the source tables on every request, thousands of concurrent users query a single pre-materialized result set. The read path becomes trivially fast. You can even add indexes to the materialized view itself:The catch is that the data is frozen at the moment of last refresh. To update it, you run:By default, this takes an AccessExclusiveLock on the view during the refresh, meaning all reads are blocked until the operation completes. It's possible to refresh a materialized view concurrently, but this requires a unique index and is considerably slower:It's typical to schedule such refreshes using a tool like <a href="https://github.com/citusdata/pg_cron">pg_cron</a> or an external scheduler:What do we actually get here? Zero external dependencies, native Postgres, indexable, and queries like any other table. A multi-second query transforms into a sub-millisecond scan to clients. Unfortunately, the data is stale between refreshes, the concurrent refresh is slower and requires a unique index, a full refresh blocks all readers, and the refresh time scales with the base table size rather than the size of what actually changed. It's a blunt instrument, but an effective one.Materialized views excel most when you can tolerate stale data on a predictable schedule. Think dashboards, reporting aggregates, and leaderboards where "as of 5 minutes ago" is perfectly acceptable. For real-time transactional Top-N results? Less so.Can we improve on this?<h2>Incrementally Yours</h2>The <a href="https://github.com/sraoss/pg_ivm">pg_ivm</a> extension addresses the core weakness of standard materialized views: the full recomputation on every refresh. IVM stands for Incremental View Maintenance, and the premise is straightforward. When a base table changes, only apply the delta to the materialized view, rather than rebuilding the entire thing from scratch.The performance difference is striking; a full refresh of a large materialized view might take 20 seconds. An incremental update after a single-row insert may only require a few milliseconds. It's not uncommon to experience one or more orders of magnitude of improvement.Creating an incrementally maintained materialized view (IMMV) looks like this:Behind the scenes,  attaches  triggers to every base table referenced in the view definition. When a row is inserted, updated, or deleted in , , or , those triggers fire and update  in the same transaction. The view is always current without any need for a job scheduler or update process.Beautiful. But there are some cracks in the facade.Trigger overhead is real. Every write to every base table now carries a few microseconds of extra work. If those tables receive thousands of writes per second, every single one of those must update the IMMV. We've essentially traded faster reads for slightly slower writes. Whether that trade is favorable depends entirely on the database's read-to-write ratio and write latency thresholds.There's also the matter of query restrictions.  does not support , , , window functions, , or subqueries in the view definition. Notice that the  from the original view definition is missing and left as an exercise on the read side. Base tables must be plain tables too, not views, partitions, or foreign tables. And finally, data types need to support btree indexing, which rules out JSON, XML, and similar types.The extension is best suited for aggregate-style views (sums, counts, averages) where the base data is relatively stable. For the specific Top-N use case from the mailing list, these limitations narrow its applicable scope considerably. But if your scenario fits within those constraints, the always-fresh guarantee is hard to ignore.What other options are there?<h2>Out of the Loop</h2>Neither materialized views nor pg_ivm fully address highly dynamic real-time query results accessed by thousands of concurrent readers. The real insight is this:<i>The TTL of your cache only needs to be greater than or equal to the query execution time.</i>Think about that for a moment. A query with a two-second run-time has a minimum resolution of two seconds. By widening that window slightly to five seconds, cache refreshes can absorb slow executions while still remaining current. Every concurrent request in between just reads the cached result instantly, condensing thousands of database roundtrips into one.This is the Shared Cache pattern, and it works at the application layer. The key design decision is to separate the writing of the cache from the reading of it. Rather than having client requests trigger the expensive query on a cache miss, you dedicate a background worker whose entire job is to keep the cache warm. Clients never touch the database directly, they only read from the cache.Here's what something like that may look like in Python:This separation is what makes the pattern so effective. The background worker runs on its own schedule, executing the expensive query once per TTL cycle regardless of whether 1 user or 4000 users are waiting for results. The clients are completely decoupled from the database. There's no risk of a sudden surge of users all triggering the same expensive query at once, because clients never trigger it in the first place. The worker owns the query. The clients own the read. Those two concerns never intersect.The dance is entirely agnostic to what Postgres is doing. No schema changes, no extensions, no triggers. And with a TTL measured in seconds rather than minutes, the data is effectively real-time from a user's perspective. Unless it's a high-frequency trading platform, few users will notice data that's a few seconds stale.The obvious question is: where does this cache live? If it lives in-process memory, each application server maintains its own cache, and you still get N database queries for N application servers. Which brings us to the next stop on our tour.<h2>The Shared Nothing Problem</h2><a href="https://redis.io/">Redis</a> (and variants like <a href="https://docs.keydb.dev/">KeyDB</a>) is the most common answer to the shared cache problem in production environments. It's fast, simple, and provides a single cache shared by all application servers. That last part is important because it instantly solves the multi-instance cache coherence problem.The same background worker pattern applies, but now the worker writes to Redis instead of a local dictionary, and every application server reads from the same shared store.Let's see that with some more pseudo-python:Redis handles the TTL natively via , so even if the worker hiccups, stale entries expire on their own. The result: one database round-trip per refresh cycle, regardless of how many application servers or concurrent users there are. The worker is the only thing that ever talks to Postgres for these queries, and it does so at a calm, predictable cadence.There are some things worth considering though. JSON serialization of large result sets has a cost. For very large payloads, MessagePack or Protocol Buffers will do the job with less overhead. Cache granularity matters too; if every user has a highly unique set of filter predicates, cache hit rate plummets and essentially negates any previous benefit.Then of course there's cache invalidation. We can actually ignore that for the purposes of this discussion. Why? It's often better to have some data than no data. If the worker hasn't refreshed the cache, either the database is unreachable, or the refresh took longer than expected. In that event, the front end can display the existing cache with a tooltip or some other disclaimer disclosing the information may be slightly stale. While not ideal, the alternative is redirecting traffic back to the database, inevitably overwhelming its resources and resulting in no data for anyone.Tools like <a href="https://memcached.org/">Memcached</a> deserve a brief mention as well. It's a reasonable alternative when you don't need persistence or data structures beyond simple key-value storage. It's slightly faster at raw cache operations but lacks many of Redis's operational niceties. It does the job in a pinch.<h2>Pushing It to the Edge</h2>If your Top-N query results are not user-specific (think a public leaderboard, a global trending list, or a site-wide feed) then there's one more powerful option that lives even further from Postgres: a Content Delivery Network (CDN).CDNs like <a href="https://www.cloudflare.com/">Cloudflare</a>, <a href="https://www.fastly.com/">Fastly</a>, and <a href="https://aws.amazon.com/cloudfront/">AWS CloudFront</a> can cache HTTP responses at edge nodes geographically close to your users. A CDN will cache the partial or fully-rendered results from a properly configured API endpoint with an appropriate  HTTP header:This instructs the CDN to serve the cached response for 5 seconds, and then for a further 10 seconds serve the stale result while asynchronously revalidating in the background. Your origin server, and by extension Postgres, sees one request per cache expiration cycle per CDN edge node. That could collapse millions of user requests into dozens of origin requests for global deployments.The CDN approach follows the same philosophy as the background worker pattern, just at a different layer. The CDN itself handles revalidation, fetching a fresh copy from the origin on its own schedule rather than letting every user request pass through. It's elegant, requires zero application code changes for HTTP-based APIs, and the cost is typically a fraction of your dedicated compute resources.The limitation, of course, is that CDN caching only applies to public data. Anything that varies by authenticated user (personalized feeds, account-specific results, user-filtered Top-N queries) can't be cached at the CDN layer without significant complexity involving headers, cache partitioning, or edge authentication. For those cases, intermediate caches like Redis remain a simpler solution.There's also <a href="https://www.varnish.org/index.html">Varnish</a>, which deserves a mention for teams that want CDN-like caching on self-hosted infrastructure. Varnish sits in front of an application server and caches responses based on configurable rules. It's extremely fast, delivers fine-grained control over cache behavior, and avoids CDN cost overhead.That makes three different tiers of caching!<h2>Know Your Terrain</h2>There's no universal answer, but the decision isn't all that complicated:<img src="https://a.storyblok.com/f/187930/604x904/bbe14edfe3/caching-decision-flowchart.png" >In practice, most high-scale systems use multiple layers simultaneously. Postgres materialized views pre-aggregate data that rarely changes. Redis caches user-specific query results with short TTLs. A CDN handles public-facing endpoints. Each layer absorbs load that doesn't need to reach the next one, and Postgres ends up only seeing the requests that genuinely require fresh database access.<h2>Sometimes the Question Isn't the Question</h2>Ostensibly, Yudhi asked about query tuning. Most of the early replies focused on missing indexes, and they were ultimately proven correct. A well-placed  moderately improved that specific query. But that's not how every story ends.What about the other questions Yudhi asked? The ones about thousands of concurrent users, CPU spikes, and capacity planning? Those are really architectural questions in query-tuning suits. Answering those questions evades any amount of database or query performance tuning.When you've done everything right in Postgres (optimal queries, correct indexes, well-chosen configuration) and you still have more demand than your instance can absorb, the solution isn't to make the query faster. It's to make the query happen less often. Materialized views bestow that capability natively. The pg_ivm extension adds a few more bells and whistles in exchange for minor write throughput. Redis brings a shared cache that spans the entire application fleet. CDNs push the cache to the edge of the internet itself.The magic, if we can call it that, is the layering. Each level absorbs the load that doesn't need to penetrate further. By the time a request actually reaches Postgres and runs a query, it genuinely needs to. Everything else was handled or deflected by an upstream layer.I suspect most performance problems that look unsolvable at the query level are really just missing a layer. People don't usually think like databases, and really, who can blame them? But the next time you find yourself staring at a perfectly optimized query that still can't keep up with demand, consider: you may not have a query problem, but an architectural one. And the good news is that the solutions are well-understood, battle-tested, and surprisingly straightforward to implement.Now go add that index. Then think about the cache. Your database will thank you.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/pg-phriday-absorbing-the-load</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>Distributed Postgres,PostgreSQL,pgEdge,postgres,PostgreSQL</category>
            <title><![CDATA[Using Patroni to Build a Highly Available Postgres Cluster—Part 3: HAProxy]]></title>
            <link>https://www.pgedge.com/blog/using-patroni-to-build-a-highly-available-postgres-clusterpart-3-haproxy</link>
            <pubDate>Fri, 20 Mar 2026 06:47:06 GMT</pubDate>
            <description><![CDATA[ <p>Welcome to Part three of our series for building a High Availability Postgres cluster using Patroni! <a href="/blog/using-patroni-to-build-a-highly-available-postgres-clusterpart-1-etcd">Part one</a> focused entirely on establishing the DCS using etcd to provide the critical DCS backbone for the cluster, and <a href="/blog/using-patroni-to-build-a-highly-available-postgres-clusterpart-2-postgres-and-patroni">part two</a> added Patroni and Postgres to the software stack. While it's entirely possible to stop at that point and use the cluster as-is, there's one more piece that will make it far more functional overall.New connections need a way to reach the primary node easily and consistently. Patroni provides a REST interface for interrogating each node for its status, making it a perfect match for any software or load-balancer layer compatible with HTTP checks. Part three focuses on adding HAProxy to fill that role, completing the cluster with a routing layer.Hopefully you still have the three VMs where you installed etcd, Postgres, and Patroni. We will need those VMs for the final stage, so if you haven't already gone through the steps in part one and two, come back when you're ready.Otherwise, let's complete the cluster!<h2>What HAProxy adds</h2><a href="https://www.haproxy.org/">HAProxy</a> is one of the most common HTTP proxies available, but it also has a hidden superpower: it can transparently redirect raw TCP connections as well. This means it can also act as a proxy for any kind of service such as Postgres. Here's how it works:<ul><li>HAProxy connects to the Patroni REST interface and gets the status for the "/" URL.</li></ul><ul><li>Patroni will only respond with a "200 OK" status on the primary node. All other nodes will produce a "500" error of some kind.</li></ul><ul><li>HAProxy marks nodes that respond with errors as unhealthy.</li></ul><ul><li>All connections get routed to the only "healthy" node: the primary for the cluster.</li></ul>Of course that's not the end of it; the Patroni <a href="https://patroni.readthedocs.io/en/latest/rest_api.html">REST API</a> is incredibly powerful, as it provides multiple additional endpoints. For example a check against:<ul><li>/replica will succeed if the node is a healthy streaming replica of the primary, a good match for offloading intensive read queries from the primary node.</li></ul><ul><li>/read-only works on any healthy node in the cluster—perfect for connections that don't care how they interact with the database.</li></ul><ul><li>/synchronous only succeeds on healthy synchronous streaming replicas, for operations where it's important for read durability.</li></ul>There's also an HTTP parameter (lag) that limits success on replicas to a specified maximum. Want to target only replicas with less than 1MB of replication latency? Simply add that parameter to the HTTP check operation in HAProxy. This enables creation of multiple proxy definitions for each dedicated requirement, and HAProxy maintains everything automatically based on Patroni status codes—no manual intervention necessary.<h2>Installing HAProxy</h2>Thankfully installing HAProxy is pretty easy because it's so ubiquitous. It should be available in the default repositories of every major Linux platform without any special steps. In the case of Debian, one command should do it:<h2>Building a useful HAProxy configuration</h2>Unlike Patroni, configuring HAProxy is a much simpler affair. The default  configuration file should already exist in the  directory following installation. Since the defaults depend entirely on the version of HAProxy and the target distribution, let's replace it with something that should work everywhere and specific to the Patroni cluster being built.Begin with the preamble of all the service defaults. There are globally applied settings to all defined listeners, and default values for standard operation. For example:In this case, HAProxy will stop allowing connections after 100—a perfect amount for a demo cluster, but you may want to increase it in a production system.The remainder of parameters define log output, set the connection type to TCP rather than HTTP, ensure checks are repeated twice to prevent false positives, and establish a few basic timeouts to avoid stale connections or server states.The next step is to define a  block. This is what actually binds a port to a group of servers and defines the health check against Patroni's REST API. Based on the VMs we've built so far, it should look something like this:The Postgres service port of 5432 is already in use in the VMs we created, so the proxy must use a different port. It's probably not necessary to explicitly set the mode to TCP again in the listen block, but is a good practice just in case. Checks can be either TCP or HTTP, and despite the fact connections should be TCP in nature, the health check itself is HTTP thanks to Patroni's convenient REST interface. And of course we only want to acknowledge a 200 status as success.The line has a special importance here. Translated, it means:<ul><li>Perform the health check every three seconds.</li></ul><ul><li>Require three failures before marking the host as down.</li></ul><ul><li>Set a failed host as healthy after two successful checks.</li></ul><ul><li>When a node is marked down, disconnect any established sessions.</li></ul>This is necessary because HAProxy has many operation modes, many of which are permissive or otherwise optimistic. Just because new connections shouldn't be sent to a new server doesn't mean old ones are suddenly invalid. In the case of Patroni however, it means exactly that!If Patroni fails, the REST interface also disappears, and HAProxy will interpret that as a node failure. But in the event Patroni fails before it can properly stop Postgres, Postgres will remain online, possibly accepting writes in the case of a primary node. We need to terminate all connections just in case to prevent split-brain scenarios.This is one situation where simply relying on Postgres connection protocol parameters like  reveals an underlying weakness. That parameter only ensures connections reach a writable node when set to , it does not guarantee that only one node in the specified list of hosts is writable! If two nodes are promoted to read-write status, that's simply too bad, and cleaning up afterwards is up to you.Every safeguard matters, and so the safest course of action is to immediately terminate connections to hosts that fail three consecutive health checks.The next block simply consists of one line for each server, including the port to direct connections to, and the check port itself. Now connections to port 6543 on this server will be redirected to port 5432 on whichever node is the current primary. By running HAProxy on each VM, connections to any node will go to the primary without any additional configuration.<h2>Starting and testing HAProxy</h2>Once the configuration file is complete, start HAProxy to activate the new routing layer:That's really all there is to it. But we also want to test the proxy to verify that it's working as expected. The easiest way to do that is to connect from any node aside from the primary and then check to see where the connection went.Use  to find the current primary:This output indicates node 1 is still the cluster primary. In that case, connect from node 2 or 3 on the proxy port and check the server IP address:That's definitely node 1. Success!<h2>Adding endpoints</h2>Remember how we mentioned the ability to add additional endpoints? Here's an example where connections will only go to replicas with less than 1MB of replication latency:After restarting HAProxy, it will also listen on port 6544 and only send connections to one of the replicas. These servers are idle, so there should be no lag, allowing us to test the connection this way:Go ahead and execute that command as many times as you want, it will never execute on node 1. How's that for convenience?<h2>Finishing up</h2>We've now completed the proverbial "HAProxy and DCS sandwich" Postgres cluster, made possible by Patroni. Here it is, in all of its glory:<img src="https://a.storyblok.com/f/187930/878x564/a63db7aa73/haproxy.png" >Both HAproxy and etcd (the DCS) are running on the same nodes as Postgres and Patroni for the purposes of this demonstration, but this is hardly a standard configuration. We only did it this way to simplify the example and require a minimum of virtual machines. A more typical cluster is more likely to decouple the HAProxy layer to a separate system, allowing it to act as a dedicated endpoint. It's usually easier to connect to "postgres-proxy.company.net" than arbitrarily assign Postgres VMs to various applications, or use multi-host connection strings. Putting HAProxy on its own host also allows it to revert to the default 5432 Postgres port and simply masquerade as Postgres.Another interesting variant includes running HAProxy at the application layer itself. Applications connect to the local server or server group and transparently route to the current Postgres primary based on HAProxy checks.The DCS (etcd here) is also likely to exist on a separate set of hosts. This allows multiple Patroni clusters to share the same consensus layer, and having a separate DCS layer makes two-node Postgres clusters a valid solution. For clusters that need to conserve storage or compute, reducing excess Postgres deployments is a great way to go. Larger or more established organizations may already have a consensus system (such as ZooKeeper or Consul) in place, so it makes sense to reuse these resources.Regardless of what your deployment model resembles, Patroni acts as the glue binding everything together: all Postgres nodes, the DCS, and the routing system. Everything acts as a single coherent cluster, perhaps in spite of how each component may act in isolation. In the end, you'll have the best HA solution available for Postgres short of a full Kubernetes solution, but in a much simpler package.We at pgEdge use Patroni in our own clusters, and now you know why. Let us know if you found this series useful, educational, or at least entertaining. Power to Postgres!</p> ]]></description>
            <guid>https://www.pgedge.com/blog/using-patroni-to-build-a-highly-available-postgres-clusterpart-3-haproxy</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>pgEdge,PostgreSQL,PostgreSQL,postgres</category>
            <title><![CDATA[Using Patroni to Build a Highly Available Postgres Cluster—Part 2: Postgres and Patroni]]></title>
            <link>https://www.pgedge.com/blog/using-patroni-to-build-a-highly-available-postgres-clusterpart-2-postgres-and-patroni</link>
            <pubDate>Fri, 13 Mar 2026 06:12:14 GMT</pubDate>
            <description><![CDATA[ <p>Welcome to Part two of our series about building a High Availability Postgres cluster using <a href="/blog/using-patroni-to-build-a-highly-available-postgres-clusterpart-1-etcd">Patroni! Part one</a> focused entirely on establishing the DCS using etcd, providing the critical layer that Patroni uses to store metadata and guarantee its leadership token uniqueness across the cluster.With this solid foundation, it's now time to build the next layer in our stack: Patroni itself. Patroni does the job of managing the Postgres service and provides a command interface for node administration and monitoring. Technically the Patroni cluster is complete at the end of this article, but stick around for part three where we add the routing layer that brings everything together.Hopefully you still have the three VMs where you installed etcd. Those will be the same place where everything else happens, so if you haven’t already gone through the steps in part one, come back when you’re ready.Otherwise, let’s get started!<h2>Installing Postgres</h2>The Postgres community site has an incredibly thorough page dedicated to <a href="https://www.postgresql.org/download/"><u>installation on various platforms</u></a>. For the sake of convenience, this guide includes a simplified version of the Debian instructions. Perform these steps on all three servers.Start by setting up the PGDG repository:Then install your favorite version of Postgres. For the purposes of this guide, we’re also going to stop Postgres and drop the initial cluster the Postgres package creates. Patroni will recreate all of this anyway, and it should be in control.It’s also important to completely disable the default Postgres service since Patroni will be in charge:Finally, install the version of Patroni included in the PGDG repositories. This should be available on supported platforms like Debian and RedHat variants, but if it isn’t, you may have to resort to the <a href="https://patroni.readthedocs.io/en/master/installation.html"><u>official installation instructions</u></a>.Once that command completes, we should have three fresh VMs ready for configuration.<h2>Configuring Patroni the easy way</h2>The Debian Patroni package provides a tool called  that transforms a Patroni template into a configuration file customized specifically for Debian systems. Before using it, it’s necessary to modify part of that template to use etcd, as ZooKeeper is the default. Perform these steps on all three servers.Note that the YAML header shows “etcd3” rather than simply “etcd”. Patroni uses etcd2 by default for backward compatibility purposes, and version 3 requires a much different communication protocol.Then create the rest of the config with a single command:This creates a file named  in the  configuration directory, which systemd uses when managing this specific cluster. We’ll also need this for invoking .<h2>Understanding Patroni configuration</h2>Despite the fact that the configuration file is already complete, it’s important to actually understand the purpose of each section and what it does. This will enable users of other platforms to manually configure Patroni if necessary.Let’s start with the topmost section dedicated to the DCS:When Patroni writes to the DCS, all keys start at the path specified by the  parameter. Similarly as one DCS may host multiple clusters, keys for this cluster must include  in the key path. The  indicates how Patroni should refer to this individual node. The configuration tool actually uses the DCS to see which names are already reserved so each VM will be uniquely identified. Go ahead and check all three to make sure they’re correct.The next section, labeled , determines how Patroni should create the initial Postgres cluster, the parameters to use, and other important information. It’s also pretty long, so let’s look at each individual portion:Normally Patroni uses  when creating a new cluster, but for full compatibility with Debian organization quirks regarding Postgres, the configuration specifies an alternative command. This short section will likely only appear on a Debian system.Next comes the  section under . All of these parameters should be covered in the <a href="https://patroni.readthedocs.io/en/master/dynamic_configuration.html"><u>Dynamic Configuration Settings</u></a> documentation, but we’ll explain the important ones. It’s important to note that any settings defined here actually persist in the DCS layer and apply to Patroni on all nodes. After initialization, the only way to change these parameters is through the  utility. It’s a good idea to make sure all of these are set properly, as changing them later is somewhat inconvenient.These parameters define how Patroni interacts with the DCS layer and how it should manage certain Postgres features. Remember that the leadership token determines which node is the primary, so  defines how long that lease should last,  controls how long to wait between lease renewals, and  says how long to wait for a response from the DCS.We’ve included  in this output because the leader race isn’t quite absolute. If Patroni promotes a node to primary, or determines Postgres has failed, it has up to this timeout before it forces a failover. The provides a grace period for crash recovery to complete, but you may find the default of five minutes is much too long. Another important parameter here is , which tells Patroni it should manage the  Postgres configuration setting by automatically using names of other nodes in the cluster. This is how you would enable synchronous replication in Patroni.Next is the  section under : This section defines how Patroni should operate the Postgres service. The first few parameters control how Patroni recycles old primary nodes, such as using the  utility when possible, and whether it should erase the data directory as a last resort. Patroni also uses replication slots for replicas by default to prevent unnecessary replica rebuilds in failure scenarios.You can also pass GUC settings directly to Postgres on all nodes through the  section. This is useful for providing important cluster-wide settings that may not be hardware dependent, such as , , or .The final section under the  heading is : You’ll want to customize this section before starting Patroni; it uses this to build the <a href="https://www.postgresql.org/docs/current/auth-pg-hba-conf.html"><u>pg_hba.conf</u></a> file that controls incoming connection access. The default will allow connections on the server’s subnet if you uncomment the disabled line, otherwise it’s local access only.Next is another  section, but this is a top-level header meant to tell Patroni how it should handle Postgres on this specific server. These sections are explained in more detail in the Patroni <a href="https://patroni.readthedocs.io/en/master/yaml_configuration.html"><u>YAML Configuration Settings</u></a> documentation.This example starts with some Debian-specific content:As before, this is so Debian can integrate with the other packaged Postgres tooling, so it’s safe to skip on other platforms. After that comes a few pertinent parameters for handling connections: This sample effectively tells Patroni how it should connect to the local Postgres service for administrative actions. Patroni uses unix sockets when possible using these settings, which makes sense as Patroni runs as the postgres OS user and has direct socket access.Then comes a fun section that defines several paths: Patroni knows it will be installed in several different environments where Postgres and configuration directories may be in completely arbitrary locations. These are the defaults for Postgres 18 running on a Debian system.Lastly there’s a second parameters section, meant for parameters that should only apply to this specific Postgres server: Nothing here should be surprising; it’s mostly just log storage for the local instance and where the unix socket directory is located. These are likely to be universal across the cluster, but it’s safer to leave them out of the DCS section. If there is ever any variance caused by a hardware or OS distribution migration, you’ll want to have the ability to change these locally.In any case, take some time to examine the  file on each node to spot-check it for any mistakes.<h2>Starting and validating Patroni</h2>The Patroni package provides a standard systemd service file; simply enable and start the service on all VMs.One of the three nodes will “win” the leader race and become the primary for the cluster. Patroni then invokes the  command on that system to create the data and configuration directories before starting Postgres. On the other nodes, Patroni calls  instead to create new streaming replicas. If you want a specific node to start as the primary, simply start Patroni on that node and wait for it to establish a cluster before starting the service on the other two.The end result on all three systems should be a new “demo” database visible to : The next step is to check the status of the Patroni cluster itself. You should be able to run this command from any node as the  OS user. It will also work as , but now that Patroni is installed and managing the cluster, it’s best to avoid relying on the root user.This output tells us the cluster is healthy and operational, node 1 is the current primary, both replicas are streaming, and there’s no replication lag. Success!<h2>Editing the cluster configuration</h2>The last step that might be necessary is to modify the cluster configuration stored in the DCS layer. These are the Postgres parameters and pg_hba.conf entries used to bootstrap the initial state of the cluster, and it’s easy to make mistakes early on.Once again,  comes to the rescue: Patroni loads the current DCS config into the current default editor, and in our case it looks like this:Use this as an opportunity to fix any missing HBA lines, or add any Postgres parameters that should apply to all nodes. For example, add  under  to enable logical replication:Since changing the  parameter requires a Postgres restart, use  to restart the nodes in the cluster: Then check with Postgres to verify that the setting changed as expected. This is the output from node 3, even though I modified the DCS and restarted the cluster from node 1:<h2>Finishing up</h2>Now you know why this series was broken into three parts! Setting up Patroni isn’t too difficult by itself, but getting the configuration right, knowing how and why each section works the way it does, and continuing to modify the cluster after deployment, is a complex process. But if you followed along, you should have a fully operational Patroni cluster at this very moment.Technically you can even stop here and skip the third and final installment of this series. Postgres supports <a href="https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING"><u>multi-host connection strings</u></a>, and specifying  for the  restricts connections to the primary node. Connecting with psql might look like this:But what if, in some distant future, we change server names, or add more nodes to the cluster, or want other connection restrictions? That’s where the routing layer comes in, and what fully completes a Patroni deployment.So come back next week to learn about HAProxy and how it provides that critical and final component!</p> ]]></description>
            <guid>https://www.pgedge.com/blog/using-patroni-to-build-a-highly-available-postgres-clusterpart-2-postgres-and-patroni</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>Distributed Postgres,PostgreSQL,pgEdge,postgres,PostgreSQL</category>
            <title><![CDATA[Using Patroni to Build a Highly Available Postgres Cluster—Part 1: etcd]]></title>
            <link>https://www.pgedge.com/blog/using-patroni-to-build-a-highly-available-postgres-clusterpart-1-etcd</link>
            <pubDate>Fri, 06 Mar 2026 07:48:55 GMT</pubDate>
            <description><![CDATA[ <p>The last PG Phriday article focused on the architecture of a Patroni cluster—the how and why of the design. This time around, it’s all about actually building one. I’ve often heard that operating Postgres can be intimidating, and Patroni is on a level above that. Well, I won’t argue on the second count, but I can try to at least ease some of the pain.To avoid an overwhelming deluge consisting of twenty pages of instructions, I’ve split this article into a series of three along these lines:<ul><li>Etcd</li></ul><ul><li>Postgres and Patroni</li></ul><ul><li>HAProxy</li></ul>This establishes each of the three layers that represent the full Patroni stack, and provides a convenient reference for later regarding each.With that out of the way, let’s get started!<h2>Why etcd?</h2>The last article should have made it abundantly clear that the DCS is the nexus of communication and status for the whole cluster. As a result, it’s important to install it first and certify that it’s operational. Etcd is the default and the example most often deployed in Patroni clusters. It’s also the key/value storage system Kubernetes uses as a default, so it should be reliable enough for our needs.Don’t forget to keep a browser tab opened to the <a href="https://etcd.io/docs/v3.6/"><u>etcd documentation</u></a> handy.<h2>What you’ll need</h2>If you want to follow along with this demonstration, you’ll need:<ul><li>The ability to create three VMs. Whether it’s </li><li>Amazon EC2</li><li> instances, </li><li>Microsoft Hyper-V</li><li>, </li><li>Xen</li><li>, </li><li>QEMU</li><li>, </li><li>Proxmox</li><li>, </li><li>Oracle VirtualBox</li><li>, or even </li><li>VMWare Fusion</li><li>, make sure you have a hypervisor and know how to use it.</li></ul><ul><li>Three VMs running </li><li>Debian Stable</li><li> version 13. At the time of writing, this should be the Trixie release.</li></ul><ul><li>SSH access as a root-capable user on each VM.</li></ul><ul><li>An internet connection. If you have the first three, it’s likely you have this as well.</li></ul>Believe it or not, that should actually be all that’s necessary. While these instructions focus on Debian packaging when possible, feel free to substitute RedHat equivalents if you’d rather be adventurous. Most of these instructions should work on any Linux system if you’re familiar with your platform of choice and know how to improvise.If you want to make your life easier, add some lines to  on the VMs to give each a name. IP addresses are great, but they’re not as convenient as “pg1”. Here’s an example:Unless otherwise noted, execute commands described in this guide on each of the VMs.<h2>Preparing each VM</h2>Prior to installing etcd, let’s create a user named “etcd” to own the service and related data using a quick  command:It’s important to create the user as a “system” user, as these are often treated differently by Systemd.<h2>Installing etcd</h2>The first lesson is that most of these tools are not “properly” packaged. By that, I mean that there are no official .deb or .rpm packages that should be considered recent. The etcd software maintainers do not provide anything aside from Zip, tarballs, or source code. That means the first step is to visit the etcd <a href="https://github.com/etcd-io/etcd/releases"><u>GitHub release page</u></a> and find the URL for the latest release.With that URL, install with these commands:Then we want to invoke the Debian alternatives system to make the binaries easier to use:Finally, create a systemd service file to control the etcd service:Don’t start the service; we haven’t configured it yet.<h2>Configuring etcd</h2>For etcd, the “hard” part is getting the configuration right. Take note of all bolded parameters, including  in the following block. These should all reflect the name or IP address of the server you’re configuring! The easiest way to do this is to use environment variables as shown in the example.The listen URLs are IP addresses because etcd attempts to resolve hostnames when specified for these parameters.Once the configuration file exists on each of the servers, enable and start the etcd service itself.Since the configuration file states that this is a “new” cluster, the cluster won’t consider itself bootstrapped until all three servers are online and connected to each other. Give it a minute or two before continuing.<h2>Validating the service</h2>Once etcd starts on all nodes, it’s a good idea to verify that it’s working as expected before handing it over to Patroni. Start by launching the etcdctl tool to view the cluster member list, which should include all three nodes:We can see here that all nodes are accounted for, but this doesn’t actually show the state of each, just that they have joined the cluster. For that, we need a different command:And this is what we see if a node is offline:Finally, write a sample value to the DCS, retrieve it from a different node, and delete it from a third. This ensures that all nodes can write based on the consensus between them.This full lifecycle proves etcd is operating as expected, all nodes are fully operational, and this cluster is ready for Patroni.<h2>Finishing up</h2>You should have three VMs equipped with an etcd service at this stage, and that provides a convenient stopping point for the next article. If you were wondering why we’re installing everything on three nodes, this is because it’s the minimum viable HA cluster with any real meaning.While it’s possible to run a two-node cluster, quorum requires a majority to guarantee consensus. Any node that disagrees is simply treated as incorrect, and should re-synchronize with the majority. So a two node cluster must always keep both nodes online or it can’t be trusted. A three node cluster has a spare, as it becomes possible to stop a single node while the other two maintain the consensus.As a result, no real cluster has fewer than three nodes. Please note that this only applies to the etcd layer! Consider this cluster design:<img src="https://a.storyblok.com/f/187930/878x564/41aba6c15b/picture1.png" >This is the same as our original diagram, but with only two Postgres / Patroni elements. This is perfectly valid because the DCS layer itself maintains the quorum, so we don’t have to enforce that same constraint on Postgres or Patroni. This means we could theoretically operate two Postgres nodes in different regions under the assumption that there’s an externally managed DCS layer.In the case of this demonstration however, we don’t have that luxury. To decouple Patroni from etcd that way requires a five node cluster: three for etcd, and two for Patroni and Postgres.  That’s actually the superior approach for more sophisticated architectures since multiple Patroni clusters can share a single etcd resource.We may explore that kind of advanced use case in the future, but for now, experiment with your new etcd cluster and we’ll see you next week!</p> ]]></description>
            <guid>https://www.pgedge.com/blog/using-patroni-to-build-a-highly-available-postgres-clusterpart-1-etcd</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>Distributed Postgres,PostgreSQL,PostgreSQL,postgres,PostgreSQL High Availability</category>
            <title><![CDATA[How Patroni Brings High Availability to Postgres]]></title>
            <link>https://www.pgedge.com/blog/how-patroni-brings-high-availability-to-postgres</link>
            <pubDate>Fri, 27 Feb 2026 05:33:52 GMT</pubDate>
            <description><![CDATA[ <p>Let’s face it, there are a multitude of High Availability tools for managing Postgres clusters. This landscape evolved over a period of decades to reach its current state, and there’s a lot of confusion in the community as a result. Whether it’s <a href="https://www.reddit.com/r/PostgreSQL/"><u>Reddit</u></a>, the Postgres <a href="https://www.postgresql.org/list/"><u>mailing lists</u></a>, Slack, Discord, IRC, conference talks, or any number of venues, one of the most frequent questions I encounter is: How do I make Postgres HA?My answer has been a steadfast “Just use Patroni,” since about 2017. Unless something miraculous happens in the Postgres ecosystem, that answer is very unlikely to change. But why? What makes Patroni the “final answer” when it comes to Postgres and high availability? It has a lot to do with how Patroni does its job, and that’s what we’ll be exploring in this article.<h2>The elephant in the room</h2>By itself, Postgres is not a cluster in the sense most people visualize. They may envision a sophisticated mass of interconnected servers, furiously blinking their lights at each other, aware of each computation the others make, ready to take over should one fail. In reality, the “official” use of the word “cluster” in the Postgres world is just one or more databases associated with a single Postgres instance. It’s right in the documentation for <a href="https://www.postgresql.org/docs/current/creating-cluster.html"><u>Creating a Database Cluster</u></a>.“A database cluster is a collection of databases that is managed by a single instance of a running database server.”The concept of multiple such instances interacting is so alien to Postgres that it didn’t even exist until version 9.0 introduced <a href="https://www.postgresql.org/docs/9.0/hot-standby.html"><u>Hot Standbys and streaming replication</u></a> back in 2010. And how do hot standby instances work? The same way as the primary node: they apply WAL pages to the backend heap files. Those WAL pages may be supplied from archived WAL files or by streaming them from the primary itself, but it’s still just continuous crash recovery by another name.This matters because each Postgres node still knows little to nothing about other nodes in this makeshift cluster over 15 years later. This isn’t necessarily a problem in itself, but it betrays a certain amount of willful ignorance on the part of each node. Why doesn’t each node care about, or even really acknowledge the other nodes exist at all?Of course each node should care that the other nodes exist! And that’s how every Postgres HA tool was born. <a href="https://slony.info"><u>Slony</u></a> and <a href="https://www.pgpool.net/docs/latest/en/html/"><u>PgPool-II</u></a> were probably the first of these, using <a href="https://clusterlabs.org/pacemaker/"><u>Pacemaker</u></a> and <a href="https://corosync.github.io/corosync/"><u>Corosync</u></a> was always popular in the early days, then came <a href="https://bucardo.org/Bucardo/"><u>Bucardo</u></a>, <a href="https://www.repmgr.org"><u>repmgr</u></a>, and <a href="https://www.enterprisedb.com/docs/efm/latest/"><u>EFM</u></a>. Those are just the more noteworthy examples known by most of the community.But a funny thing happened after Patroni’s initial release: the relentless torrent of Postgres HA tools suddenly ceased. Everyone immediately understood something made it fundamentally different from its predecessors. Let’s talk about why.<h2>What Patroni does</h2>Patroni does something Postgres still doesn’t do: it builds a cluster of Postgres nodes. It does this by facilitating what I like to call a “HAProxy and DCS Sandwich” that looks something like this:<img src="https://a.storyblok.com/f/187930/878x564/008eecd0f1/haproxy.png" >Think of it like a Postgres BLT where Patroni acts as the lettuce that brings everything together. It’s the missing communications nexus that records the composition of the cluster, the status of its members, and routes connections where they need to go.Let’s dive a bit deeper into how it does all of this, and why that matters to everyone from hobbyists to fortune 500 corporations.<h2>Quorum</h2>The first and most important aspect of Patroni’s operational role is that of maintaining quorum. Here’s a handy definition for a quorum:The minimal number of officers and members of a committee or organization, usually a majority, who must be present for valid transactions of business.The critical aspect here is the voting majority, otherwise known as a Consensus. The standard formula for this for some number of nodes N is: N/2 + 1. While a two-node cluster would need both nodes to remain online to maintain a majority, a three-node cluster would also require two nodes to maintain a majority. It’s this “extra” node that creates resilience in a network cluster. Should one node become isolated from the others, either through failure or a <a href="https://en.wikipedia.org/wiki/Network_partition"><u>network partition</u></a>, the quorum remains and the cluster stays operational. More nodes usually also confers better protection; three is best out of five, after all. Due to communication overhead caused by node topology, most consensus layers suggest staying below a “handful” of nodes, which tends to mean “fewer than ten”.Ironically, Patroni handles quorum by delegating that responsibility to another piece of software entirely. Patroni reports compatibility with four different key/value or Distributed Configuration Store services, including <a href="https://etcd.io/"><u>etcd</u></a>, <a href="https://developer.hashicorp.com/consul"><u>Consul</u></a>, <a href="https://zookeeper.apache.org"><u>ZooKeeper</u></a>, and even Kubernetes. In reality, Patroni doesn’t really care where the DCS layer lives or what it’s composed of, just so long as it responds to read and write requests.That’s why the “DCS” layer in the diagram is a flat plane supporting all of the Postgres nodes. The DCS could be anywhere, using any number of nodes, and Patroni doesn’t have to manage it.<h2>Orchestration</h2>Patroni is a specialized high-availability tool designed specifically for Postgres. As a result, it knows how to manage anything associated with a cluster of Postgres instances, including but not limited to:<ul><li>starting and stopping the Postgres service.</li></ul><ul><li>promoting replicas.</li></ul><ul><li>bootstrapping new replicas.</li></ul><ul><li>demoting primary nodes.</li></ul><ul><li>Log Sequence Numbers (LSNs).</li></ul><ul><li>replication slots.</li></ul>Patroni stores all metadata in the DCS layer, updating it regularly for every node. The “cluster” always knows the status of all nodes, including any replication lag. The magic of how Patroni works so well is how it knows which node is the Primary for the cluster: the leadership token. Here’s how it works:<ul><li>Patroni checks to see if the current node owns the leadership token.</li></ul><ul><li>If yes, refresh the token and restart the loop.</li></ul><ul><li>If no, can this node </li><li>take</li><li> the leadership token?</li></ul><ul><li>If yes, take the token, promote this node, and restart the loop.</li></ul><ul><li>If no, act as a normal replica, reconfigure for the current primary if needed.</li></ul>There are other steps involved of course, but since the consensus layer is distributed, there can only ever be one leadership token. Once a node has the token, no other node can claim to be the primary node. Based on which node has the token, Patroni will reconfigure all other nodes to use it as the primary. If a replica encounters replay errors, or was a previous primary, Patroni will use <a href="https://www.postgresql.org/docs/current/app-pgrewind.html"><u>pg_rewind</u></a>, <a href="https://www.postgresql.org/docs/current/app-pgbasebackup.html"><u>pg_basebackup</u></a> from the current primary, or even recover from a stored backup to rebuild the node.That’s something almost none of the other HA tools do. Not only will Patroni promote a replacement primary, but it will rebuild the failed one if it can. If you add a new node to the cluster, it creates the data directory on your behalf based on the cluster configuration. The DCS is the single source of truth Patroni operates from, and in a very real sense, the DCS itself is the cluster.Things really start to get interesting when the DCS layer itself experiences failures.<h2>Fencing</h2>The idea behind <a href="https://en.wikipedia.org/wiki/Fencing_(computing)"><u>fencing</u></a> is that a misbehaving node should be decommissioned. The reasoning for this is deceptively simple: in the absence of consensus, you can’t trust any written data. There are many reasons a node could lose contact with the DCS, or the DCS refuses to respond, and none of them matter at all. The safest course of action is to stop Postgres.If the primary node can’t maintain its ownership of the leadership token, another node seizes it. The Patroni process on that node promotes it to leader, the cluster reconfigures itself around that new primary, and the beat goes on. Isolated replicas don’t have to worry about writes, but they also can’t participate in the leadership race.An isolated primary knows another node has been promoted in its absence, that it should reject writes to prevent <a href="https://en.wikipedia.org/wiki/Split-brain_(computing)"><u>split brain</u></a> risk, that it should no longer accept new connections. Similarly, a replica cut off from the DCS can’t be monitored, is likely accumulating replication lag, and is otherwise suspect. As a result, Patroni stops the Postgres service on that node.Believe it or not, most Postgres HA solutions omit this critical factor. Almost all of them will detect a primary failure and promote a standby, but almost none consider what happens if the failure is the network and not the node or Postgres itself. In these systems, an isolated node keeps accepting writes from colocated systems or established connections, keeps operating normally, and doesn’t know or care that a promotion happened elsewhere.The Postgres service on isolated nodes absolutely must self-terminate, and Patroni ensures that outcome by its very design. Lose contact with the DCS, or if the DCS refuses requests for any reason, shut down. Easy.Note: One failure scenario for the cluster is that the DCS itself loses quorum. In a five node cluster, a network error could split two nodes from the other three. In such a situation, two nodes lost the majority and will refuse to operate in that state. The Patroni service for any affected nodes doesn't know this, and indeed, it doesn’t matter. The end result is always to stop Postgres.<h2>Routing</h2>The final thing Patroni does to establish a Postgres cluster is manage connection routing. It does this by tracking the ownership of the leadership token and providing an HTTP <a href="https://patroni.readthedocs.io/en/latest/rest_api.html"><u>REST status interface</u></a>. Any front-end routing system can interrogate a Patroni node for its current state. Whether or not Postgres is online, if the node should be considered writable, if there’s too much replication lag, and so on.The usual choice for this routing layer is <a href="https://www.haproxy.org/"><u>HAProxy</u></a> as reflected in the architecture diagram, but it could easily be an <a href="https://www.f5.com/products/load-balancing"><u>F5 load balancer</u></a>, an <a href="https://aws.amazon.com/elasticloadbalancing/"><u>Amazon ELB</u></a>, and so on. This determines which connections reach what node—or whether a node should allow connections at all. Users who wish to connect to the Postgres primary simply need to connect to the routing layer. Is it important to connect to a replica that has less than 5MB of replication lag? Routing layer. Patroni evaluates criteria encoded in the health check request and responds accordingly.Fencing is one half of the equation, and routing control is the other. If Patroni determines a node should not be routable for any reason, it simply returns a failure on the REST interface. A properly configured routing component will then immediately cut any established connections and refuse future routing for that node until it is healthy again.More importantly, users and applications don’t need to know anything about the node they’re connecting to. In a very real sense, they’re not connecting to a node at all, but the cluster itself. Now, finally, it’s possible to accurately describe Postgres as a cluster of nodes. Each individual node doesn’t actually operate any differently, but Patroni and the underlying DCS establish an underlying fabric that binds everything together.<h2>What about Kubernetes?</h2>Kubernetes solves the problem of truly establishing a Postgres cluster in a similar fashion. Kubernetes operators like <a href="https://cloudnative-pg.io"><u>CloudNativePG</u></a> either take the same role as Patroni, or literally use Patroni under the hood like the <a href="https://access.crunchydata.com/documentation/postgres-operator/4.6.1/"><u>Crunchy operator</u></a>. But Kubernetes is a rich ecosystem of inter-operating components with its own design ethos. A Postgres cluster emerges from this design as a consequence of its underlying fundamentals.Kubernetes isn’t specific to Postgres, and a Postgres HA solution in Kubernetes cannot exist outside of that environment. It is a perfectly valid way to transform Postgres into a cluster, but software outside of a Kubernetes context can’t leverage those capabilities. At least for now, the vast majority of Postgres installations still exist on bare metal, VMs, and other ad-hoc manual deployments.<h2>Final thoughts</h2>Let’s look at MongoDB for a second. The following is a diagram taken directly from their architecture documentation:<img src="https://a.storyblok.com/f/187930/1124x669/9bedf0248a/picture2.png" >We’re not here to discuss the merits of Postgres versus MongoDB or NoSQL in general. However, look very carefully at this structure. The design manual describes shard architecture in further detail: “If the primary replica for a shard fails, secondary replicas together determine which replica should be elected as the new primary using an extended implementation of the Raft consensus algorithm.”Does any of this sound familiar? Ignore the sharding aspect for a moment and consider what’s happening here. The cluster has a routing layer and nodes coordinate by consensus through <a href="https://raft.github.io"><u>Raft</u></a>. MongoDB was designed from the very beginning to be a cluster, while Postgres treats node interaction as an afterthought. Outside of extensions like <a href="https://docs.pgedge.com/spock-v5/v5-0-4/"><u>Spock</u></a> from <a href="/home-2026"><u>pgEdge</u></a>, <a href="https://www.enterprisedb.com/docs/pgd/latest/"><u>BDR</u></a> from <a href="https://www.enterprisedb.com"><u>EnterpriseDB</u></a>, or forks like <a href="https://www.yugabyte.com/"><u>YugabyteDB</u></a>, every Postgres node is an island. Even <a href="https://www.citusdata.com/product"><u>Citus</u></a>, an extension known for using coordinator nodes and data nodes and should be thought of as a cluster, needs Patroni to handle failover between data node replicas.Postgres simply isn’t a self-organizing cluster without some external orchestration layer. For now, Patroni is the best of these. There’s a reason we use it at pgEdge as part of our Ultra HA architecture. It’s impossible to say what the future might hold, but for now, just use Patroni.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/how-patroni-brings-high-availability-to-postgres</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>PostgreSQL,postgres,pgEdge</category>
            <title><![CDATA[Returning Multiple Rows with Postgres Extensions]]></title>
            <link>https://www.pgedge.com/blog/returning-multiple-rows-with-postgres-extensions</link>
            <pubDate>Mon, 27 Oct 2025 05:09:23 GMT</pubDate>
            <description><![CDATA[ <p>Creating an extension for Postgres is an experience worthy of immense satisfaction. You get to contribute to the extension ecosystem while providing valuable functionality to other Postgres users. It’s also an incredibly challenging exercise in many ways, so we’re glad you’ve returned to learn a bit more about building Postgres extensions.In the previous article in this series, we discussed <a href="https://www.pgedge.com/blog/introduction-to-postgres-extension-development"><u>creating an extension to block DDL</u></a>. That sample extension was admittedly fairly trivial, in that it only added a single configuration parameter and utilized one callback hook. A more complete extension would provide a function or view to Postgres so users could interact with the extension itself. So let’s do just that!<h1>Deciding on Data</h1>Once again we’re faced with choosing a topic for the extension. Users will sometimes ask the question: how much memory is Postgres using? Using various tools like top or ps will show a fairly inaccurate picture of this, limiting results to opaque fields like VIRT, RES, SHR, PSS, RSS, and others. Some are aggregates, others include Postgres shared buffers, and none really describe how memory is being used.Luckily on Linux systems, there’s an incredibly handy <a href="https://www.kernel.org/doc/html/latest/filesystems/proc.html"><u>/proc filesystem</u></a> that provides a plethora of information about several process metrics, including memory. The <a href="https://www.man7.org/linux/man-pages/man5/proc_pid_smaps.5.html"><u>smaps</u></a> file in particular reports several process-specific memory categories, and does so on a per-allocation basis.  What if we could parse that file for every Postgres backend process and return output in a table? Admins could then see exactly which user sessions or worker processes are using the most memory and why, rather than an imprecise virtual or resident memory summary.Sounds interesting!<h1>Starting the Extension</h1>As with our previous extension, we need to bootstrap the project with a few files. Start with creating the project folder:And create a  file with these contents:As before, we just need to name the extension, give it a version, and provide an installation path for the resulting library file. Nothing surprising so far.Next we need a makefile; let’s start with this:Look familiar? Simple extensions won’t likely need to customize the makefile very much, and in this case we’re just describing the extension and its contents. More sophisticated extensions require more complicated makefiles, but we’re still covering easier ground.Don’t forget your copy of the Postgres source code; it’s going to be incredibly valuable in this exercise.<h1>Functions and Macros</h1>Now it’s time to work on the body of our extension. Before that, we need to learn a bit more about some of the macros and API functions Postgres provides for this kind of extension. We already know about the  macro which prepares the extension for Postgres, but there are several more.The <a href="https://www.postgresql.org/docs/current/xfunc-c.html"><u>C-Language Functions</u></a><a href="https://www.postgresql.org/docs/current/xfunc-c.html"> </a>documentation says that any function exposed to users must be declared with the macro, and the function should use  rather than an actual argument list. The function we’re building won’t take any arguments, but the macro is required anyway.The “Returning Sets” section in particular explains that there are two types of functions which can return a set: value per call, or materialized. Given we’re parsing potentially ephemeral kernel memory mappings, we probably want to return the entire data-set at once. We also have to consider the fact it’s necessary to obtain the list of running processes, and if we used value-per-call, we’d have to effectively cache that output so the function could use it for each call. It’s simply easier to materialize for now.Since we’re writing a function, the <a href="https://github.com/postgres/postgres/blob/master/src/backend/utils/fmgr/README"><u>src/backend/utils/fmgr/README</u></a><a href="https://github.com/postgres/postgres/blob/master/src/backend/utils/fmgr/README"> </a>file also presents a few pertinent structures we’ll need to know:<ul></ul><ul></ul><ul></ul>The documentation also says a set-returning function will need to interact with tuplestores. That means we should examine <a href="https://github.com/postgres/postgres/blob/master/src/include/utils/tuplestore.h"><u>src/include/utils/tuplestore.h</u></a> to understand that API. The most interesting function here is , which takes a , a , an array of  values, and an array of booleans defining which are NULL. This will allow us to pass an entire array of values representing the row, and another array specifying which (if any) are null.The tricky part is figuring out what  and  are. It turns out that  in this context is a  struct, and if we examine <a href="https://github.com/postgres/postgres/blob/master/src/include/nodes/execnodes.h"><u>src/include/nodes/execnodes.h</u></a> where it’s defined, we’ll see that the  field is a  which is intended to contain our result set! Similarly,  is a  which describes the tuples we’re returning. That should be all we need for storing the return values.We don’t quite have everything, however. Postgres expects result sets to be composed of  structures, which encapsulate supported data types like , , and so on. In order to transform “raw” C types into these, we need to use another set of functions defined in <a href="https://github.com/postgres/postgres/blob/master/src/include/postgres.h"><u>src/include/postgres.h</u></a> and <a href="https://github.com/postgres/postgres/blob/master/src/include/utils/builtins.h"><u>src/include/utils/builtins.h</u></a>. Primarily we are looking for these functions:<ul></ul><ul></ul>Why is  defined in a completely different file than ? Who knows. It might be tempting to use , but that would be a mistake; Postgres Text and CString types are treated quite differently and this would make it difficult to manipulate the fields in queries.Next we need the ability to capture the contents of  so we can use the PID of each process to obtain all of the memory mapping information. Finding that is a bit of an adventure. First we start at <a href="https://github.com/postgres/postgres/blob/master/src/backend/catalog/system_views.sql"><u>src/backend/catalog/system_views.sql</u></a> where it’s defined, to find that it’s a view which calls the function. That function is defined in <a href="https://github.com/postgres/postgres/blob/f727b63e810724c7187f38b2580b2915bdbc3c9c/src/backend/utils/adt/pgstatfuncs.c#L331"><u>src/backend/utils/adt/pgstatfuncs.c</u></a> and provides a great example at how to make internal system calls. This function is meant to be used by Postgres user sessions, so we want to effectively “mine” it for internal routines. Here are the important ones:<ul></ul><ul></ul>These are all defined in <a href="https://github.com/postgres/postgres/blob/master/src/include/utils/backend_status.h"><u>src/include/utils/backend_status.h</u></a><a href="https://github.com/postgres/postgres/blob/master/src/include/utils/backend_status.h">,</a> which is fairly convenient for us.Finally, we can’t forget about security. This function returns a lot of potentially sensitive information, so we want to restrict it to users who are either superusers or who have access to the <a href="https://www.postgresql.org/docs/current/predefined-roles.html"><u>pg_read_all_stats</u></a> predefined role. If we check out <a href="https://github.com/postgres/postgres/blob/master/src/include/miscadmin.h"><u>src/include/miscadmin.h</u></a>, the  function retrieves the UID of the current user.Nailing down the ability to check the  role is a bit trickier. Searching the source basically only shows this string in source comments, but the usage is clear: use to check a user’s roles against . That means we also need the <a href="https://github.com/postgres/postgres/blob/master/src/include/utils/acl.h"><u>src/include/utils/acl.h</u></a> header.There’s one final piece that’s slightly non-obvious. The  definition doesn’t actually exist yet. That header file is generated during the build process and placed in a file included by <a href="https://github.com/postgres/postgres/blob/master/src/include/catalog/pg_authid.h"><u>src/include/catalog/pg_authid.h</u></a>. This file is then included by <a href="https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/acl.c"><u>src/backend/utils/adt/acl.c</u></a> and not the previously mentioned  This happens frequently when working with catalog data since these are defined as files ending with a  extension that are converted into actual headers at build time.It’s admittedly a bit confusing, but forewarned is forearmed!<h1>Ins and Outs</h1>Isn’t it interesting how much we had to search the code for various Postgres internal system calls? The published documentation really only scratches the surface of everything necessary to write even a fairly basic extension, so always have a copy of the actual source handy!Given our discussion so far, we’ll need several includes:It’s possible to reduce this list by almost half by accounting for the dependency graph, but in the interest of being complete, it’s best to simply include every header which declares a referenced symbol. Also note that as with the previous extension, the header always comes first.Then start the module, declare the function we plan to expose to Postgres, and begin with the function definition:These are all things we’ve discussed thus far, and should present no surprise. The function takes  which expands to , a variable we’ll be using frequently. The function must return a , even if we’re returning a set.For the function body, we need to do a bit of housekeeping. As discussed, the first step is to limit execution only to allowed roles. That means we need some variant of this code in the body:The only thing we haven’t discussed here is the  error code. These are described in <a href="https://github.com/postgres/postgres/blob/master/src/backend/utils/errcodes.txt"><u>src/backend/utils/errcodes.txt</u></a><a href="https://github.com/postgres/postgres/blob/master/src/backend/utils/errcodes.txt">,</a> so always check this to see if there is a more applicable error when rejecting user actions.Next we call the indispensable helper function that makes all of this possible:This single function call prepares our function environment for returning a set, performing all of these actions:<ul><li>Ensuring the caller can receive a result set.</li></ul><ul><li>Asserting materialize mode is allowed.</li></ul><ul><li>Creating a tuple store and description in the correct context to store the result set.</li></ul><ul><li>Setting several result-set attributes to enable returning a result set.</li></ul>It’s about forty lines of code we no longer have to worry about.Next would come the actual program body, which we’ll discuss at greater depth in the following section. The end of the function should always end with this:This indicates to Postgres that row production is complete.<h1>The Main Loop</h1>Now we’re ready to build the central loop for the set results. Recall how we have a function to retrieve the number of existing backend processes, and another to fetch the contents of each. Here’s how we could use those to locate the appropriate smap file for parsing:It’s quite beneficial that we can retrieve all backend status simply using the index of the backend. In our case it’s only necessary to retrieve the PID. Once we have that, we just need to loop through each of the smap files using . Perhaps it would be better to include some error-handling here in case the files are unavailable or can’t be opened, but this is fine for demonstration purposes.The parsing loop is a bit more complicated. Since this is a blog, we’re using a magical function called  which will return a  struct of all known smap fields. For the actual implementation, we’ve put together a <a href="https://github.com/pgEdge/blog-meminfo"><u>GitHub repository</u></a> you can use for a full understanding of how we built this extension.The inside of the parsing loop needs a few things for the tuple functions we discussed earlier: an array of Datums, an array of nulls, and a call to the  function. Given our magic function and struct, here’s how that might work:Note from this example that we never set any values in the  array under the assumption that zero is the default and preferred value. Aside from the metadata rows about the memory segment itself, most of the other fields are some amount of kilobytes. Rather than showing all 33 possible fields, we’ve elected to omit them for the sake of brevity.So long as the parsing function returns records from the memory mapping, the materialized result will continue accumulating rows. This will repeat for each Postgres backend until every one has been processed, and then we exit the loop. Once the function returns, the accumulated tuples are handed to the caller.<h1>Becoming a Postgres Extension</h1>Writing the function is only half of the story, it’s also necessary to create a SQL to C binding. This accomplishes a couple of things:<ul><li>Provide a SQL wrapper so the function is usable within Postgres sessions.</li></ul><ul><li>Defines all of the types for each column in the result set. This is how Postgres builds the </li><li>setDesc</li><li> portion of the result set struct.</li></ul>This has a naming convention based the <a href="https://www.postgresql.org/docs/current/extend-extensions.html"><u>Packaging Related Objects into an Extension</u></a><a href="https://www.postgresql.org/docs/current/extend-extensions.html"> </a>documentation. For our extension, this would be: . The contents should be the function declaration including the entirety of the result set. For this function and all fields in the memory mappings, it would look like this:The  portion will automatically be substituted for the installed location of the shared library in Postgres. Running this SQL after the extension shared object files are installed will create an SQL function named  that would call our C function.This would be the minimum to get a working Postgres extension. Assuming all of the files are complete, installing consists of a few steps. First is to build and install the actual library:Next, restart Postgres using your favorite method. And finally, install the extension itself:This actually executes the SQL file we prepared above, ensuring that the function mapping exists. Once that happens, all subsequent Postgres sessions should be able to utilize the function like this:It sure was a lot of work getting to that point, but it was all worth it!<h1>Odds and Ends</h1>We definitely glossed over a few pertinent details here, the most obvious of which is the implementation of the memory map parser itself. Unless all intermediate structs and functions are defined in the same file as the extension function itself, these are probably in a separate C file for organization purposes. (Again, full sources for this blog are available in a <a href="https://github.com/pgEdge/blog-meminfo"><u>dedicated GitHub repository</u></a>.)Imagine these are in a parser.c source file that should be part of the extension build procedure. The Makefile example we used doesn’t really account for building multiple sources, or binding them together into a single module. It’s easy to remedy however by using dedicated build variables in the Makefile itself:The assumption is now that  and exist, produce the resulting object files during the build step, and are then combined into a single module. Everything else is the same as before and the extension will work as described once installed, we simply have more organized source files as a result.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/returning-multiple-rows-with-postgres-extensions</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>PostgreSQL,postgres,pgEdge</category>
            <title><![CDATA[What's Our Vector, Victor? Building AI Apps with Postgres]]></title>
            <link>https://www.pgedge.com/blog/what-s-our-vector-victor-building-ai-apps-with-postgres</link>
            <pubDate>Wed, 15 Oct 2025 04:14:58 GMT</pubDate>
            <description><![CDATA[ <p>Something I’ve presented about recently (a couple times, now!) is how we can make AI actually useful with Postgres. Not the mystical robot overlord kind of AI that people worry about, but the practical, math-heavy kind that can actually help solve real problems with the data you already have.Let me be clear up front: AI isn't magic. It's math. Lots and lots of math. And if you can write SQL queries (which, let's be honest, is why you're here), you can build AI applications with Postgres. Let me show you how.<h2>Breaking Down the AI Buzzwords</h2>Before we get too deep into the technical stuff, let's clarify some terms that get thrown around way too much these days:LLM (Large Language Model): This is what people are actually talking to when they chat with AI systems like ChatGPT or Claude. It's the thing that generates responses that sound surprisingly human.RAG (Retrieval Augmented Generation): We're all tired of AIs making stuff up, right? RAG forces them to use reference material - your actual data - instead of just hallucinating answers. It's basically saying "here's some context, now answer based on this instead of whatever random training data you remember."Tokens: Text gets broken into chunks, and those chunks get meaning associated with them. Think of it as building up from word parts to bigger concepts, piece by piece.Embeddings: Here's where it gets interesting, and where Postgres really starts to shine. An embedding is a vector, essentially a coordinate system for those tokens. You know how a Cartesian plot has X and Y coordinates? Vectors are like that, except they're hundreds or even thousands of coordinates long. That's where AI concepts live and get their granularity.<h2>Understanding Vectors: The School of Fish</h2>I love using this analogy because it actually captures what's happening under the hood. Picture a school of fish in a coral reef. A vector isn't just representing the species of fish, it's capturing everything about that fish in context. The position of its fins, what other fish it's swimming near, its place in the family tree, the rocks around it, what it eats, how it behaves.That's exactly what a vector does for text or concepts. It's not just the word itself, it's all the surrounding context and meaning that gives it substance. It's like having a 3D hologram where you capture not just the position of the fish, but everything that makes that fish what it is at that moment.This is why when you talk to an LLM, you get all that surrounding context. All the related concepts, the metadata, the contextual relationships - they're all encoded in those coordinate arrays.<h2>The Postgres AI Stack That Actually Works</h2>Here's what makes this all work with Postgres, and why I think this approach is superior to the usual Python-heavy AI stacks:pgvector: This is the foundation. It adds vector similarity searches to Postgres, along with different vector types (binary, sparse, various array types), operators for comparing vectors (cosine, euclidean, dot product), and crucially, new index types like HNSW that make searching through thousands of coordinates actually feasible instead of impossibly slow.pg_vectorize: This is where things get really interesting, and it's the extension I personally prefer for real-world work. Written in Rust by Adam Hendel (@chuckhend on GitHub), it builds on pgvector but adds the functionality that makes everything actually usable in production. It uses pgmq for background queuing and can integrate with pgCron for scheduled operations.The beauty of pg_vectorize is that it handles all the tedious orchestration work that usually requires a bunch of Python code and multiple API calls to different services.<h2>What pg_vectorize Actually Does</h2>Here's where the magic happens. pg_vectorize gives you several key capabilities:<h3>1. Text Encoding</h3>Using vectorize.encode, you can transform individual prompts to search terms using whatever compatible vector you prefer. The output is compatible with pgvector vector search.Obviously, the answer is yes, but let's keep going.<h3>2. Create and Maintain Embeddings</h3>This is where pg_vectorize really shines:When you specify realtime, something clever happens: any time you insert or update a row, it generates an embedding in the background using pgmq. Your insert completes immediately; the embedding generation happens asynchronously. This is absolutely crucial because generating embeddings can be slow (it's either a local CPU operation or a remote GPU call), and you definitely don't want that latency directly impacting your write performance.I saw a Microsoft demo at a conference where they were generating embeddings synchronously on every insert. That's not a great approach if you value database performance.Embeddings are maintained by pg_cron job, or pgmq live updates.<h3>3. Vector Search</h3>This searches through all your vectorized content and returns the most semantically similar matches. It's not just keyword matching, it's understanding the intent and context behind your query.<h3>4. Complete RAG in One Query</h3>Here's the real magic trick:This single function takes your question, compares it against your vectorized content, pulls back the best matches, combines that with a prompt, sends it to the LLM, and returns a complete response as a JSON object that includes context if we need it. All the orchestration that used to require dozens of lines of Python code - gone.<h2>The Architecture Simplification</h2>The traditional RAG architecture is a nightmare of moving parts: User → App Layer → Transformer API → Database lookup → Combine context → LLM API → Parse response → App Layer → User. That's a lot of potential failure points and a lot of code to maintain.With pg_vectorize, your stack becomes: App Layer makes SQL query → Postgres handles everything → Response back.I rebuilt a RAG application I had originally written in Python using this approach, and I'm not exaggerating when I say it cut my codebase by at least half. Much easier to maintain, much easier to reason about, and much fewer things that can break.<h2>Practical Implementation: The Chunking Strategy</h2>Here's something important that doesn't get talked about enough: embeddings are less granular than LLMs themselves. Embeddings are usually “fuzzy” (only 384 coordinates), while LLMs have billions of parameters. This means you need to think strategically about chunking your content.Don't try to create embeddings for entire 10-page articles. Break them into logical chunks: paragraphs, sections, whatever makes sense for your content. Each chunk gets its own embedding. When someone asks a question, you're matching against the specific chunk that's most relevant, not trying to capture the essence of an entire document in a single vector. This will provide sharper context.<br>-- Example chunked table Then you vectorize the chunks table instead of the articles table. Much better semantic matching.<h2>Advanced Technique: Query Rephrasing</h2>Here's a trick I learned from Adam that's incredibly useful: users don't always phrase questions in ways that match well with embeddings. If someone asks "how do window functions work?", a naive embedding search might just return content that contains the words "window" and "function" frequently, not necessarily content that explains what window functions actually do.However, there’s an elegant solution. Use the LLM to rephrase the query first:-- Step 1: Get a semantically better queryThis might return something like "SQL window function syntax row partitions unbounded practical examples" which is much richer semantic content.-- Step 2: Use that improved query for your RAG searchThis gives you much better semantic matching instead of just literal word matches. The difference in result quality is dramatic.<h2>Configuration and Setup</h2>You can configure pg_vectorize to work with external services like OpenAI, or you can run everything locally. Want to keep your data completely internal?-- Search using Ollama or vLLM instead:-- Or, use a custom transformer service:With distributed Postgres systems, you can co-locate your AI processing with your data nodes, reducing latency and keeping everything within your security perimeter. Using this approach, no data leaves your infrastructure.<h2>Why This Matters</h2>The bottom line is this: if you can write SQL queries, you can build sophisticated AI applications. You don't need to become an expert in transformer architectures, manage API tokens across multiple services, handle complex JSON parsing from different AI providers, or write orchestration code to coordinate between embedding services and LLMs.Your data is already in Postgres. Your application logic can stay close to your data. The AI functionality becomes just another database capability rather than a separate system to manage.And here's the real advantage: when ChatGPT is having an outage or when your OpenAI bill is getting out of hand, your local setup keeps running. When you need to meet compliance requirements about data not leaving your infrastructure, you've got that covered. When you want to fine-tune models on your specific domain, you control the entire stack.<h2>Getting Started</h2>If you can write queries, you can build AI apps using Postgres. All you have to do is load your content (which you'd have to do anyway), vectorize your tables with a single function call, and start querying. It doesn’t matter whether you're building a customer service system that knows your documentation, a search interface for internal knowledge bases, or any other AI-powered application; it all can be contained within a few SQL functions.That's the real power of keeping AI close to your data. Not to mention, chances are if you're here, your data is already in Postgres.Thinking about watching the webinar in full? It’s recorded - find it <a href="https://pages.pgedge.com/postgres-live-monthly-series-session-1-whats-our-vector-victor?_gl=1*fewz2m*_gcl_au*MjEyNzYyNDQ4MC4xNzU0NTA0ODEyLjIwODMzODcxNjMuMTc1ODExNDczMi4xNzU4MTE1MTUy"><u>here</u></a></p> ]]></description>
            <guid>https://www.pgedge.com/blog/what-s-our-vector-victor-building-ai-apps-with-postgres</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>postgres,PostgreSQL</category>
            <title><![CDATA[Introduction to Postgres Extension Development]]></title>
            <link>https://www.pgedge.com/blog/introduction-to-postgres-extension-development</link>
            <pubDate>Thu, 11 Sep 2025 05:37:00 GMT</pubDate>
            <description><![CDATA[ <p>Possibly the most notorious benefit to running a Postgres cluster is access to the <a href="https://pgxn.org"><u>vast ecosystem of extensions</u></a>, including procedural languages, foreign data wrappers, index types and storage systems, handy utility functions and much more. Extensions run the gamut from <a href="https://github.com/hydradatabase/columnar"><u>columnar table storage</u></a>, <a href="https://github.com/pgvector/pgvector"><u>efficient vector functions</u></a><a href="https://github.com/pgvector/pgvector">,</a> or <a href="https://github.com/paradedb/paradedb"><u>BM25</u></a> fulltext search analytics, to more trivial augmentations like <a href="https://github.com/fboulnois/pg_uuidv7"><u>UUIDv7</u></a>.But what’s involved in actually developing an extension? What arcane Postgres APIs need to be invoked? How does installation work? Does it take a genius? This article will attempt to answer all of those questions and a few more besides, and hopefully by the end, you too can achieve immortality by contributing a useful feature to Postgres.<h2>Choosing a Topic</h2>I’ve written two relatively simple SQL-only extensions in the past. This isn’t uncommon, as packaging up a few<a href="https://www.postgresql.org/docs/current/plpgsql.html"> </a><a href="https://www.postgresql.org/docs/current/plpgsql.html"><u>PL/pgSQL</u></a> functions into a utility library is a convenient way to distribute them. Since the possible subject matter is so vast, what might be an interesting piece of functionality to add using an extension? Some DBAs might want to prevent users from executing DDL for instance, so would that be something we could do?More astute readers probably already know that it’s possible to reject DDL using a Postgres <a href="https://www.postgresql.org/docs/current/event-triggers.html"><u>Event Trigger</u></a> like this:Ironically however, the  command itself will not invoke this trigger. What if we wanted a more tightly coupled method, directly embedded in Postgres itself through an extension written in C? How might that change things?<h2>Getting Things Under Control</h2>According to the Postgres <a href="https://www.postgresql.org/docs/current/extend-extensions.html"><u>extension documentation</u></a>, all extensions require a control file to define various bits of metadata about the extension itself.So let’s create a folder for the extension:And create a  file with these contents:This is essentially the bare minimum necessary for a control file. It provides the name of the extension, its version, and the path where the library itself may be found. In this case, the library path is just the default  location for a library named “noddl”.Now Postgres knows what to do with our fancy new extension when it gets installed.<h2>Preparing the Environment</h2>Since this will be a C extension, we need to prepare the development environment a bit more. A Postgres build environment requires a minimum of gcc, make, and the Postgres development headers. This is all completely operating system and distribution dependent, so doing this is beyond the scope of this guide.However, the<a href="https://www.postgresql.org/download/"> </a><a href="https://www.postgresql.org/download/"><u>Postgres Download</u></a> repository usually includes development packages for all supported operating systems. If this were a Debian, Ubuntu, or Mint system for example, adding Postgres development packages is one convenient command away:Red Hat, Fedora, or Rocky systems may do this instead:Once the postgres development headers are installed, double-check the availability of the  command, as this will guide the build and install process for our extension.The  flag should display the location of the extension Makefile that will do most of the hard work of setting header paths, installing to the correct library directory, and so on.It’s also a good idea to have a local copy of the Postgres source itself. Postgres C extension development often depends on invoking otherwise undocumented internal functions, referencing multiple structs, and calling several interface macros. The best way to do this is to examine the source code where these things are defined, as there is often an incredibly helpful comment heading or even a README which explains things in greater detail.So go to GitHub and obtain a copy:Trust me, you’ll be very glad you did.<h2>Making Everything Possible</h2>Speaking of makefiles, it’s time to create one for the noddl extension. The <a href="https://www.postgresql.org/docs/current/extend-pgxs.html"><u>PGXS documentation</u></a> provides a sample Makefile along with an explanation of all available makefile parameters. Using that as a guide, our  should look something like this:Similarly to the control file, we’re really just describing the extension. Note that the last three lines in the file are there to include the Postgres extension makefile which does all of the real work. There’s nothing better than delegating responsibilities when you don’t really know what you’re doing.<h2>Start Your Engines</h2>There are two critical components to a Postgres <a href="https://www.postgresql.org/docs/17/xfunc-c.html"><u>C extension</u></a>.<ul><li>The </li><li>PG_MODULE_MAGIC</li><li> macro. Don’t worry about this; it’s magic!</li></ul><ul><li>A</li><li> PGinit</li><li> function to act as an entrypoint into the extension for setting hooks, creating GUCs, launching worker processes, and so on.</li></ul>Given this module will be blocking DDL, there should be a convenient way to enable and disable it so superusers can execute DDL when necessary. That means we need at least one GUC. If we take this into account, an early skeleton for the extension will look something like this:This is actually a functional extension already. The only thing it does right now is set the GUC, but this will successfully compile and install as a functional Postgres extension.The next thing we should do is make our extension useful.<h2>Intercepting DDL Commands</h2>Understanding how to trap DDL statements is easier said than done. We know that all statements must be parsed before they’re dispatched to whatever part of the engine that will execute them. That means we need to add some kind of function to the list of hooks Postgres calls during that process.It turns out that one of the categories the Postgres dispatcher uses is reserved for utility commands. If we check out the Postgres source, there’s even a test module which specifically leverages this functionality: <a href="https://github.com/postgres/postgres/tree/master/src/test/modules/test_oat_hooks"><u>test_oat_hooks</u></a>. If we examine <a href="https://github.com/postgres/postgres/blob/master/src/test/modules/test_oat_hooks/test_oat_hooks.c"><u>test_oat_hooks.c</u></a> a bit, we’ll find this code in This code essentially checks a custom GUC to decide whether or not to block statements at all, and will only block commands for non-superusers. If both of these criteria match, it throws an error, and that’s really the extent of the function. This example demonstrates how to indiscriminately reject all utility commands, but it’s the first step to isolating DDL events.How do we isolate DDL specifically? If we look for the string “DDL” in the Postgres source code, <a href="https://github.com/postgres/postgres/blob/master/src/backend/tcop/utility.c"><u>src/backend/tcop/utility.c</u></a> stands out dramatically. The function categorizes all command tags into various categories for the log_statement GUC. The signature includes a  parameter which includes a  field. If we pass that into the  function, we only want statements identified as .If we take that into consideration, blocking DDL looks something like this:The particular hook we want to set is named . Assuming our callback function is named , we just need to add this to the end of our  function:If we stopped here, our hook function would be fully operational and our work would be done. Unfortunately it’s not quite that simple.<h2>Being a Good Neighbor</h2>Unlike some extensible software, Postgres does not provide utility functions to register hooks in some kind of dedicated callback stack. Instead, it defines several global variables which extensions may choose to override, one of which we set in the previous section. However, these global variables may be manipulated by any extension loaded into the process environment, which could represent a user session, background worker, or some other kind of mixed context.What if  is set like this:Now what happens if the noddl extension sets the  hook? In this case, it would essentially deactivate the pg_stat_statements extension! What we really need to do is preserve any previously set hook and propagate the call in a kind of simulated stack. Every well-behaved extension does this, so in theory every extension’s hooks are eventually honored. That means we need another global variable:Then we need to preserve any previously defined ProcessUtility hook at the end of  before overriding it:And finally, the full body of our own ProcessUtility function hook needs to physically invoke either the default Postgres ProcessUtility function, or a previously defined hook:In an ideal world, Postgres would handle hook stacks directly, if only to prevent one misbehaving extension from ruining it for everybody. It’s a bit ironic that our extension has less of its own code than calls to maintain an unbroken hook chain.<h2>Some Assembly Required</h2>Believe it or not, that is the entire extension. Who knew writing a C extension for Postgres would be so simple? Since we already have a makefile, the final steps are to make, install, and activate the extension. Let’s start with building it:That will deposit the extension in the appropriate library directory, and copy the control file to the directory Postgres uses to track available extensions. Next we activate the extension by adding it to shared_preload_libraries:And finally, restart Postgres. Debian-based systems would use a command like this:Red Hat family systems would use a command like this:Or you can do it the old-fashioned way:Once Postgres restarts, the noddl extension is available and active. But remember, our GUC is disabled by default, so we need to enable it. Connect to Postgres as a superuser and issue these commands:Then attempt to execute literally any DDL statement:How rude! Fortunately we used  for this GUC, meaning superusers can override it at the session level:Unlike an event trigger, it’s not possible to circumvent this extension by disabling the <a href="https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-EVENT-TRIGGERS"><u>event_triggers</u></a><a href="https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-EVENT-TRIGGERS"> </a>GUC. If we remove the GUC, the only way to enable DDL again is to unload the extension itself. There are rare environments that require structurally static deployments, and this would be one way to enforce that restriction.<h2></h2></p> ]]></description>
            <guid>https://www.pgedge.com/blog/introduction-to-postgres-extension-development</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>PostgreSQL</category>
            <title><![CDATA[The Joy of Deploying pgEdge Using Ansible]]></title>
            <link>https://www.pgedge.com/blog/the-joy-of-deploying-pgedge-using-ansible</link>
            <pubDate>Thu, 17 Jul 2025 04:19:00 GMT</pubDate>
            <description><![CDATA[ <p>Once upon a time, the Perl programming language earned the nickname “the Swiss Army chainsaw of scripting languages” from its users. In many ways, Ansible has attained a similarly lofty reputation for orchestrating infrastructure. Many experienced Postgres DBAs figured this out a long time ago, jealously hoarding collections, roles, and playbooks and yearning for a day when they too could sit back and watch their new cluster “<a href="https://xkcd.com/303/"><u>compile</u></a>”. We at pgEdge understand this compulsion all too well, extensively leveraging Ansible for internal testing purposes. It’s hard to beat the convenience of whipping up a few VMs and building a full active-active pgEdge <a href="https://www.pgedge.com/products/what-is-pgedge"><u>distributed Postgres</u></a> cluster within a few minutes, bombarding it with millions of transactions to reveal operational edge cases, and then lather, rinse, and repeat. Some time during this unsanctioned VM abuse, we realized everyone else might want to join in on the fun, and so <a href="https://github.com/pgEdge/pgedge-ansible"><u>pgedge-ansible</u></a> was born.<h2>What is pgedge-ansible</h2>Frankly, pgedge-ansible <a href="https://en.wikipedia.org/wiki/Does_exactly_what_it_says_on_the_tin"><u>does exactly what it says on the tin</u></a>: it’s an Ansible collection built specifically to deploy pgEdge clusters. It provides configurable roles for spinning up a variety of pgEdge architecture stacks, from a simple two-node cluster, all the way to our 10-node Ultra HA design featuring six Postgres nodes, independent <a href="https://patroni.readthedocs.io/en/latest/"><u>Patroni</u></a> failover management, <a href="https://www.haproxy.org/"><u>HAProxy</u></a> traffic control, and <a href="https://pgbackrest.org"><u>pgBackRest</u></a> backup management.While the pgEdge <a href="https://docs.pgedge.com/platform"><u>CLI platform tool</u></a> has been around for a while and fulfills a similar role when combined with a <a href="https://docs.pgedge.com/platform/installing_pgedge/cluster_deploy"><u>JSON cluster definition file</u></a>, there are certain advantages to including Ansible. Principal among them is the ability to provide more direct control over the full hardware and software stack. Ansible is designed to be flexible, either by using roles in playbooks exactly as intended, or by integrating various customizations the collection authors may not have anticipated.In true Ansible fashion, the pgedge-ansible role provides a minimum reference implementation and most parts may be readily swapped out to better match existing infrastructure. Don’t need HAProxy and have a role for configuring a common <a href="https://www.f5.com/glossary/load-balancer"><u>F5 load balancer</u></a> or an <a href="https://aws.amazon.com/elasticloadbalancing/"><u>AWS elastic load balancer</u></a> (ELB)? Be our guest! Want to use <a href="https://pgbarman.org"><u>Barman</u></a> instead of pgBackrest for backup management? Please feel free. We want to make using pgEdge easy and convenient, and that means meeting the customer where they are.<h2>How does it work?</h2>We hope you’ll love this part, because it’s incredibly easy to use. We’ve included playbooks for two sample architecture designs in the collection:<ul><li>A simple cluster consisting </li><li>only</li><li> of pgEdge distributed Postgres nodes.</li></ul><ul><li>A “batteries included” cluster with all the trimmings we like to call “Ultra HA”.</li></ul>The collection operates by assuming every pgEdge node resides in its own “zone”. Something like this:<img src="https://a.storyblok.com/f/187930/564x326/06bec2fa8e/sophisticated_clusters.png" >This design makes it possible to extend the zone designation to far more sophisticated clusters like this:<img src="https://a.storyblok.com/f/187930/964x824/795831713d/four_roles.png" >The simple cluster relies on only four roles:<ul><li>init_server - Get the server ready to run a pgEdge cluster.</li></ul><ul><li>install_pgedge - Install the pgedge CLI platform software.</li></ul><ul><li>setup_postgres - Create a Postgres instance capable of running pgEdge.</li></ul><ul><li>setup_pgedge - Initialize each pgEdge node and wire the zones together for Active-Active logical replication.</li></ul>The Ultra HA cluster requires a substantially longer list since we also have to orchestrate Patroni, etcd, HAProxy, and pgBackRest as well:<ul><li>init_server - Get the server ready to run a pgEdge cluster.</li></ul><ul><li>install_pgedge - Install the pgedge CLI platform software.</li></ul><ul><li>setup_postgres - Create a Postgres instance capable of running pgEdge.</li></ul><ul><li>install_etcd - Retrieve and install etcd to act as a </li><li>DCS</li><li> for Patroni.</li></ul><ul><li>install_patroni - Retrieve and install Patroni to manage replicas and failover per zone.</li></ul><ul><li>install_backrest - Retrieve and install pgBackRest for backup management.</li></ul><ul><li>setup_etcd - Configure etcd to create a quorum within the node zone.</li></ul><ul><li>setup_patroni - Configure Patroni for its local etcd and pgEdge Postgres instance. This will also bootstrap the replicas.</li></ul><ul><li>setup_haproxy - Retrieve, install, and configure HAProxy based on the Patroni-managed nodes in the zone.</li></ul><ul><li>setup_pgedge - Initialize each pgEdge node and wire the zones together for Active-Active logical replication. This has to come after setup_haproxy because each zone needs to communicate through the proxy to make sure it always reaches the current pgEdge primary node in that zone.</li></ul><ul><li>setup_backrest - Configure pgBackRest to back up nodes in its zone, and take an initial backup to bootstrap everything. It will only back up the current Primary node.</li></ul>That’s a lot of work to do to set up a cluster! But the beauty of Ansible is that it handles all of that. The roles even recognize three group names to keep inventory files simple:<ul><li>pgedge - Nodes where pgedge is being installed or managed. These are automatically sub-grouped by zone.</li></ul><ul><li>haproxy - Nodes where HAProxy is running. These are also sub-grouped automatically by zone because each HAProxy will only perform node health checks for routing within that zone.</li></ul><ul><li>backup - Dedicated backup servers. These are also sub-grouped automatically by zone so each zone has its own set of backups.</li></ul><h2>Installation and use</h2>Using the pgedge-ansible collection is pretty straightforward; we’re doing our best to keep everything compatible with Ansible best practices and intend to eventually submit everything to <a href="https://galaxy.ansible.com/"><u>Ansible Galaxy</u></a>.Installation is about what you’d expect:Once the collection is installed, using it only requires activating the collection and using the roles. Consider this inventory file:That’s two nodes, each with its own logical availability zone. Then we just need a playbook to deploy the cluster:And then we just need to run the playbook:Assuming everything went as expected, we should end up with an active-active pgEdge distributed Postgres cluster with DDL replication enabled by default. But don’t take our word for it, try it yourself!The work of a DBA is never truly done, and we hope this makes your life just a bit easier. Fully active-active Postgres clusters are hardly trivial to deploy and administer, but a good Ansible collection can certainly improve the situation considerably. There’s an exciting new world of running writable Postgres nodes “on the edge”, and now diving in is easier than ever before.Now’s your <a href="/download/enterprise-postgres"><u>chance to experiment</u></a>. Let us know if pgedge-ansible helped you bring your Postgres cluster to the next level.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/the-joy-of-deploying-pgedge-using-ansible</guid>
            <author><name>Shaun Thomas</name></author>
            </item>
            <item>
            <category>PostgreSQL</category>
            <title><![CDATA[Living on the Edge]]></title>
            <link>https://www.pgedge.com/blog/living-on-the-edge</link>
            <pubDate>Mon, 02 Jun 2025 15:00:00 GMT</pubDate>
            <description><![CDATA[ <p>The <a href="https://2025.pgdaychicago.org"><u>PGDay Chicago 2025</u></a> Postgres conference has come and gone, and so too has my presentation on conflict management in Postgres Multi-Master clusters. For those who couldn’t attend this year, there’s no need to fret, as this blog will present all of the relevant details. By the end, you'll be well armed with the knowledge necessary to identify, mitigate, and avoid the worst scenarios which may arise in a distributed Postgres cluster.<h2>Is Bi-Directional Logical Replication Implicitly Safe?</h2>The <a href="https://www.postgresql.org/about/news/postgresql-16-released-2715/"><u>release announcement for Postgres 16</u></a> included this cryptic note:Despite how that may sound on the surface, that shouldn’t be interpreted as tacit permission or a resounding recommendation to immediately spin up an Active-Active logical replication cluster using native Postgres logical replication. Among other things, there are novel sequence rules, data merging concerns, conflict management, and the dreaded potential of node divergence.None of these situations are managed by default, and a problem related to any one of these issues could result in implicit or explicit data loss. Consider a cluster distributed to three distinct geographic regions:<img src="https://a.storyblok.com/f/187930/1312x912/50856a509a/living-on-the-edge-global-and-local.png" >What might happen if Bob is in Ashburn, Virginia, and opens an account at ACME using his email address bob@smith.com? Probably nothing. Unless perhaps his wife is on vacation in Frankfurt, Germany and happens to be touring the ACME global office and decides to sign Bob up for an account because she knows he’s been meaning to do so. Through some providence of fate, this happens within the round-trip latency between Frankfurt and Ashburn, and now we have a problem. Which set of account credentials and related information do we retain?The answer to that question is why we’re here, and why solutions like pgEdge are necessary to successfully navigate the perilous seas of Multi-Master Postgres clusters. Let’s discuss the various types of complications we may encounter while operating a distributed Postgres cluster so we can do so safely.<h2>In Theory</h2>Those familiar with database cluster architectures may already be well acquainted with <a href="https://en.wikipedia.org/wiki/CAP_theorem"><u>CAP theorem</u></a>. Unlike <a href="https://en.wikipedia.org/wiki/ACID"><u>ACID</u></a><a href="https://en.wikipedia.org/wiki/ACID">,</a> which describes how databases like Postgres guarantee data validity, CAP is more focused on distributed operation within clusters. It goes like this:<ul></ul><ul></ul><ul></ul>Partitioned (distributed) clusters have either availability or consistency, never both. What happens if two nodes are disconnected? Either data may not be consistent between the two, or writes must be rejected or paused to avoid introducing inconsistencies.But this isn’t quite right because it doesn’t account for network latency, which is a kind of built-in partition dictated by the speed of network traffic. The <a href="https://en.wikipedia.org/wiki/PACELC_design_principle"><u>PACELC design principle</u></a> aims to solve this conundrum. Rather than focusing purely on availability and consistency, it augments CAP by adding an extra logic branch:<ul></ul><ul><li>Choose between </li><li>L</li><li>atency or</li></ul><ul><li>Loss of </li><li>C</li><li>onsistency</li></ul>If you need a mnemonic for this, consider the humble Pack Elk:<img src="https://a.storyblok.com/f/187930/720x540/de97e0a7a8/latency-and-consistency.png" >Most Multi-Master cluster solutions choose the latter (lost consistency) as the default. Essentially Postgres nodes in such a cluster accept local writes and transmit those changes to other participating nodes via logical replication. Until that transmission is complete and other nodes integrate that information, there is a small window where the cluster is not fully consistent. This is where problems begin for the unprepared.A single Postgres instance doesn’t suffer from this indignity. There’s a single source of truth much like a chess board, a single state of truth dictated by the board itself.<img src="https://a.storyblok.com/f/187930/720x470/b7cade6e4c/a-single-postgres-instance.png" >In an Active-Active Postgres cluster, it’s more like a flock of starlings, where each node is generally going in the same direction most of the time. It’s a beautiful orchestration of form where no single node truly dictates the fate of the cluster.<img src="https://a.storyblok.com/f/187930/720x491/62e14dd0ab/distributed-postgres-cluster.png" >This is ultimately what we’re trying to achieve when deploying a distributed Postgres cluster.<h2>Disagreeable Outcomes</h2>In a fully distributed Postgres cluster, the primary source of problems is the list of potential data conflicts, usually caused by giving preference to fast local writes rather than 100% consistency across the cluster. Such asynchronous operations lead to simultaneous changes which could be incompatible. The resulting conflicts make everyone sad, including the software components mediating the dispute.When it comes to Postgres, there are basically four categories of conflict that may arise:<ul><li>Naturally convergent conflicts</li></ul><ul><li>Resolvable conflicts</li></ul><ul><li>Divergent conflicts</li></ul><ul><li>(Bonus) Phantom conflicts</li></ul>The first step to avoiding a situation is recognizing it - let’s tackle them one by one.<h3>Convergent Conflicts</h3>A convergent conflict is what happens when the order of operation doesn’t matter. It’s an idempotent result where, even if nodes apply changes with total disregard, they’ll all reach the same end state anyway. And remember, a “conflict” just means two different nodes performed some operation on the same row(s).The first of these is an  /  conflict:<ul><li>Update happened first? It gets deleted</li></ul><ul><li>Delete happened first? Nothing to update</li></ul><ul><li>End state: the row is deleted</li></ul>Next is a  /  conflict:<ul><li>End state: the row is deleted</li></ul>Then there’s the  /  conflict:<ul><li>Update happened first? The table gets truncated</li></ul><ul><li>Truncate happened first? Nothing to update</li></ul><ul><li>End state: the table is truncated</li></ul>Similarly, the  /  conflict:<ul><li>Delete happened first? The table gets truncated</li></ul><ul><li>Truncate happened first? Nothing to delete</li></ul><ul><li>End state: the table is truncated</li></ul>And finally comes the  /  conflict:<ul><li>End state: the table is truncated</li></ul>All in all, these conflicts aren’t very interesting; they’re all dependent on either a single row being deleted, or the whole table being emptied. It doesn’t take much consideration to realize what will happen in the end.<h3>Resolvable Conflicts</h3>Then there are conflicts that can be resolved automatically by some default operation such as “Last-Write-Wins”, which assumes the most recent write should “win” the exchange on all nodes. This allows all nodes to remain consistent, but may result in data loss. Whether that overwritten row matters or not tends to depend on the application.<br>Let’s start with  /  conflicts:<ul><li>Node A: INSERT ... (id, col1) VALUES (2, 10)</li></ul><ul><li>Node B: INSERT ... (id, col1) VALUES (2, 100)</li></ul><ul><li>Last "update" wins (default resolution method)</li></ul><ul><li>Result: one INSERT is discarded / overwritten</li></ul>The conflict on the “id” column must be resolved, meaning we must discard either the value of 10 or 100, depending on the timestamp of the committing transaction. Will the transaction selected for discard make a difference?This also applies to tables which rely on multiple unique constraints, such as a primary key and a unique index on an email column. Consider this scenario:<ul><li>Node A: INSERT ... (id, email) VALUES (1, 'bob@smith.com')</li></ul><ul><li>Node B: INSERT ... (id, email) VALUES (2, 'bob@smith.com')</li></ul><ul><li>Last "update" wins (default resolution method)</li></ul><ul><li>Results:</li></ul>The row will be consistent across the cluster, but which row? Will it be the email address with an ID of 1, or 2? Did the same transaction also insert dependent rows in another table with a foreign key referencing this ID? Should this even be handled automatically, or should there be some kind of custom resolution method to account for such scenarios? If Jim and Jen both entered a forest at the same time and only Jen returned, what happened to Jim?The other types of resolvable conflict are both  /  variants. Consider this first example:<ul><li>Node A: UPDATE t SET col1=100 WHERE id=4</li></ul><ul><li>Node B: UPDATE t SET col1=500 WHERE id=4</li></ul><ul><li>Last "update" wins (default resolution method)</li></ul><ul><li>Result: contents of col1 are discarded / overwritten</li></ul>We can only retain a single update, but which one? If you imagine this is a bank app setting your total balance, it suddenly becomes a lot more relevant. But this can also go wrong in a more subtle way:<ul><li>Node A: UPDATE t SET col1=100 WHERE id=4</li></ul><ul><li>Node B: UPDATE t SET col2='stuff' WHERE id=4</li></ul><ul><li>Last "update" wins (default resolution method)</li></ul><ul><li>Result: contents of col1 or col2 are discarded / overwritten</li></ul>This happens because logical replication doesn’t just transmit the specific modification, it sends the entire resulting tuple. This is necessary because there could be triggers or other after-effects which further mutate the change being applied, so the only way to ensure an accurate representation is to capture the entire row contents at the end of the process.Let’s examine the entire process:<ul><li>Old tuple on both nodes: {id: 4, col1: 50, col2: 'wow'}</li></ul><ul><li>Node A: UPDATE t SET col1=100 WHERE id=4</li></ul><ul><li>Logical replication sees: {id: 4, col1: 100, col2: 'wow'}</li></ul><ul><li>Node B: UPDATE t SET col2='stuff' WHERE id=4</li></ul><ul><li>Logical replication sees: {id: 4, col1: 50, col2: 'stuff'}</li></ul>The receiving node must pick one or the other, not both; there’s no automatic “merge” possible here. Using a Last Write Wins strategy would ensure a consistent picture across the cluster, but what about the column values we lost? Again, two people entered the forest and one vanished without a trace.What happened to Jim?<h3>Divergent Conflicts</h3>Then there are conflicts that actually cause data to diverge on one or more nodes in the cluster on a permanent basis. Luckily this is usually caused by some kind of race condition between at least three nodes, so they’re much less likely to occur in the wild. Still, it’s important to know how they might occur so we can avoid triggering the surrounding circumstances.Let’s start with an  /  conflict between three nodes:<ul><li>Node A: INSERT -> Node B</li></ul><ul><li>Node B: UPDATE -> Node C</li></ul><ul><li>Node C: ignores UPDATE</li></ul><ul><li>Node A: INSERT -> Node C</li></ul>In this case, Node B received the INSERT before Node C, and subsequently updated the value. If that UPDATE reaches Node C before the INSERT from Node A, we have a problem. If Node C discards that phantom update for a row it doesn’t have, and later processes the INSERT, that means Node A and B have the updated value, and Node C does not. This is a node divergence. Once the cluster reaches this state, the only way to fix it is through manual intervention using some kind of data comparison tool like the <a href="https://docs.pgedge.com/platform/ace"><u>Active Consistency Engine</u></a> (ACE) from pgEdge.This can also happen in an  /  scenario in a 3-node cluster:<ul><li>Node A: INSERT -> Node B</li></ul><ul><li>Node B: DELETE -> Node C</li></ul><ul><li>Node C: ignores DELETE</li></ul><ul><li>Node A: INSERT -> Node C</li></ul>Now Node C has a row that is missing on both Node A and Node B.This situation is actually slightly worse than an INSERT / UPDATE conflict. A custom conflict management routine could convert the UPDATE to an INSERT using its original timestamp on nodes where the value is missing. Then when the original INSERT arrived, it would be skipped as being an older transaction. In the case of a DELETE, this isn’t possible because there is no way to modify a DELETE so it only affects the incoming INSERT.Then there’s a  /  conflict in a 3-node cluster:<ul><li>Node A: INSERT -> Node B</li></ul><ul><li>Node B: TRUNCATE -> Node C</li></ul><ul><li>Node C: applies TRUNCATE</li></ul><ul><li>Node A: INSERT -> Node C</li></ul>This time both Node A and Node B have no rows in the affected table, and Node C has diverged by retaining one rogue . There’s no easy fix for this situation either, as Node C has no way to know there’s a pending  when the  arrives. Network latency makes fools of us all.<h3>Phantom Conflicts</h3>The last type of conflict is technically not a conflict at all, but an application may treat it like one. It won’t show up in logs, it’s not due to latency, and it can actually happen on a single Postgres node outside of replication of any kind. So what is it?Consider this scenario:<ul><li>App: INSERT; COMMIT -> Node A</li></ul><ul><li>Node A: COMMIT -> WAL -> Node B</li></ul><ul><li>Node A: Confirm !-> App</li></ul><ul><li>Node A: Crash</li></ul><ul><li>Node B: Becomes new write target</li></ul><ul><li>App: Connection aborted? Retry insert!</li></ul><ul><li>App: INSERT -> Node B</li></ul>Postgres normally sends a confirmation message to the client after every transaction commit. But what happens if the node crashes during the confirmation and the client never receives the message? A good application would notice the error and try again. But thanks to that second attempt, a duplicate record now exists in the database.How is that possible? Postgres always writes transactions to the WAL before applying them to backend data files. These WAL files are also used for crash recovery. The client never received an acknowledgement that the transaction committed successfully, but it succeeded nonetheless. Since the transaction is in the WAL, it will be retained following crash recovery of Node A and any replicas in the cluster. Yet the application layer thinks it failed and will probably try again.The easiest way to avoid this is to rely on<a href="https://en.wikipedia.org/wiki/Natural_key"> </a><a href="https://en.wikipedia.org/wiki/Natural_key"><u>natural keys</u></a>, or in the absence of those, a <a href="https://en.wikipedia.org/wiki/Surrogate_key"><u>surrogate key</u></a> that the application itself controls. Consider calling nextval() directly and supplying sequence values manually, or even relying on an external ID generator. Alternatively, the application retry loop should confirm whether the transaction failed with a SELECT prior to simply retrying.Remember, this can affect even a non-distributed Postgres cluster. Trust, but verify.<h2>Preventing the Inevitable</h2>How do we mitigate or even avoid these scary conflict situations? Luckily there are a plethora of ways to address conflict potential in a distributed Postgres cluster. Let’s start with the easiest one of all!<h3>Don’t Do That</h3>An ounce of prevention is worth a pound of cure, after all.Don’t do what? Consider the primary cause of conflicts: simultaneous operations occurring on the same data on different nodes. That happens because data may not have distinct structural or regional boundaries, or because a single application instance is interacting with multiple nodes simultaneously without regard for transmission latency.Thus the simplest way to avoid conflicts is to control write targets<ul><li>Use “sticky” sessions. Applications should only interact with a single write target at a time, and never “roam” within the cluster.</li></ul><ul><li>Assign app servers to specific (regional) nodes. Nodes in Mumbai shouldn’t write to databases in Chicago, and vice versa. It’s faster to write locally anyway.</li></ul><ul><li>nteract with specific (regional) data. Again, an account in Mumbai may physically exist in a globally distributed database, but multiple accessors increase the potential for conflicts.</li></ul><ul><li>Avoid unnecessary cross-node activity. Regional interaction also applies on a more local scale. If applications can silo or prefer certain data segments on specific database nodes, they should.</li></ul>To solve the issue of updates on different database nodes modifying the same rows, there’s a solution for that too: use a ledger instead.<img src="https://a.storyblok.com/f/187930/765x499/2e640f0363/updates-on-different-database.png" >This is probably a much more invasive application-level change than most, but some systems already do this and as a result, are automatically a great fit for distributed Postgres clusters.<h3>Solving Conflicts with CRDTs</h3>The official meaning of CRDT acronym is “<a href="https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type"><u>Conflict-free Replicated Data Type</u></a>”, but that’s cheating, so I usually refer to them as Conflict Resistant Data Types. These data types were specifically designed for distributed clusters or databases with constant asynchronous write conflicts. The latter occurs when two different writers modify a single row, but each started with an “old” copy of the row. Think of a  ->  pattern.As a result, CRDTs can actually be useful even in non-distributed Postgres clusters. In the context of a distributed Postgres cluster, there are basically two types of commonly available CRDT provided by the Active-Active extension:<ul><li>Apply a diff between the incoming and existing values</li></ul><ul><li>Use a custom data type with per-node “hidden” fields</li></ul>Additionally, these two approaches only work for numeric columns, like , , , , and so on. And of those two techniques, there are also two methods for providing them to Postgres: by patching how Postgres handles logical replication for numeric columns, or providing dedicated data types.To understand how this works, let’s use a CRDT in the pgEdge <a href="https://docs.pgedge.com/spock_ext"><u>Spock extension</u></a> (a component of pgEdge Distributed Postgres):In this case, we’ve told Postgres that we need to track the previous value of the column, and that there’s a special callback function it should use when modifying values in that column. That function will ensure updates from any node in the cluster will add the difference between the value on the two nodes rather than the absolute value, resulting in a numeric merge operation.If we used the EnterpriseDB BDR extension (a component of EDB Postgres Distributed) instead, it would look something like this:This use case may appear simpler, but the BDR extension actually provides<a href="https://www.enterprisedb.com/docs/pgd/latest/conflict-management/crdt/"> </a><a href="https://www.enterprisedb.com/docs/pgd/latest/conflict-management/crdt/"><u>multiple CRDTs</u></a><a href="https://www.enterprisedb.com/docs/pgd/latest/conflict-management/crdt/">, </a>each with its own unique rules, behaviors, and best practices. This also causes a certain amount of vendor lock-in, as it becomes impossible to uninstall the BDR extension so long as these column types are in use.In either case, a delta-based CRDT works like this:<ul><li>Both nodes start with a tuple: {id: 5, total: 100}</li></ul><ul><li>Either node: UPDATE account SET total = total + 100 WHERE id = 5</li></ul><ul><li>New resulting tuple: {id: 5, total: 200}</li></ul><ul><li>Old and new values are sent to the remote node</li></ul><ul><li>The delta function or type calculates:</li></ul>Here’s a scenario where that procedure specifically avoids a data conflict and also preserves the intended value across the cluster:<ul><li>Initial tuple on both nodes: {id: 5, total: 100}</li></ul><ul><li>Node A adds 100, sends: {id: 5, total: 200}</li></ul><ul><li>Node B adds 400, sends: {id: 5, total: 500}</li></ul><ul><li>Node A: total = 200 + (500 - 100) = 600</li></ul><ul><li>Node B: total = 500 + (200 - 100) = 600</li></ul>And indeed, an initial balance of $100, increased by $100 and then $400 would be a total of $600. This is exactly what we want, and the overhead from managing the deltas is generally minimal.There’s also an alternative technique of maintaining separate hidden fields within the column for each node in the cluster. The benefit here is that each node only interacts with its own assigned sub-record, and the data type handles merging on display or retrieval of the stored value. There’s never a risk of conflict or unexpected merge behavior because nodes are always interacting with separate values.Here’s how that works:<ul><li>Initial tuple on both nodes: {id: 5, total: (100, 0)}</li></ul><ul><li>Total displayed as 100</li></ul><ul><li>Node A adds 100</li></ul><ul><li>Node B subtracts 50</li></ul><ul><li>New tuple: {id: 5, total: (200, -50)}</li></ul><ul><li>Total displayed as 150</li></ul>If a third node is added to the cluster later, the CRDT will simply add another hidden field to the column when rows are updated by the new node. However, there's an important limitation to this technique: values can never be set to zero without using a special function. Consider this scenario:<ul><li>Initial tuple on both nodes: {id: 5, total: (200, -50)}</li></ul><ul><li>Total displayed as 150</li></ul><ul><li>Node A sets total to 0</li></ul><ul><li>Node B does nothing</li></ul><ul><li>New tuple: {id: 5, total: (0, -50)}</li></ul><ul><li>Total displayed as -50</li></ul>Since each node only interacts with its own dedicated sub-field, it’s impossible to reset the column value to zero without having every node also reset its data. The only real way to set such CRDT columns to zero is to use a special utility function that will essentially revert the column to an initial empty state. Once that action enters the logical replication stream, every other node will also empty that column. Applications which intend to use these types of CRDT would definitely need to account for this behavior.<h3>Key Management</h3>Surrogate keys are probably the most commonly leveraged technique for assigning unique values to primary keys in Postgres tables. However, this approach often assumes a single source of key values, or fails to account for the presence of other nodes. If two nodes rely on a sequence for assigning values, this sequence will frequently yield the same value on both nodes, causing constant conflicts and resulting in data loss during  statements.There are effectively four methods for safely generating surrogate keys which are guaranteed to be unique across every node in distributed Postgres clusters:<ul><li>Sequence offsets</li></ul><ul><li>Globally unique keys</li></ul><ul><li>Global allocations</li></ul><ul><li>External key generator</li></ul>We’ll explore the first three options here, since it’s incredibly uncommon to leverage an external service to generate IDs rather than doing it locally or algorithmically.<h3>Sequence Offsets</h3>Sequence offsets are the most backward compatible approach because they rely on no outside technology or extension. They work like this:For every sequence on Node 1:For every sequence on Node 2:Afterwards, the sequence on Node 1 will generate values such as 1001, 1011, 1021, and so on, while Node 2 will not conflict, as its values will be 1002, 1012, 1022, etc.. The primary difficulty here is never forgetting to apply these modifications to all sequences on a node when it joins the cluster. New sequences will also require this treatment on every node. Further, the width of the increment must account for the maximum number of expected cluster nodes, and can never really be expanded afterwards. The ongoing amount of maintenance here is incredibly high.<h3>Globally Unique Keys</h3>It’s also possible to algorithmically generate surrogate keys in such a way that they’re guaranteed to be unique across the cluster. The most common approach already in use in many applications is through leveraging UUIDs. Postgres has a built-in function for this:Better yet, Postgres 18 is expected to have native <a href="https://commitfest.postgresql.org/patch/4388/"><u>support for UUID v7</u></a>, which should improve compatibility with BTREE indexes.Another approach is to use a <a href="https://en.wikipedia.org/wiki/Snowflake_ID"><u>Snowflake ID</u></a>, an algorithmic technique that combines a timestamp, node identifier, and a millisecond-scale sequence, popularized by sites like Twitter and Instagram. The pgEdge Distributed Postgres stack includes the <a href="https://github.com/pgEdge/snowflake"><u>Snowflake extension</u></a>, which provides alternative nextval() and currval() functions for generating these kinds of values.The primary drawback when using these values is that they’re always 64-bit integers. This is fully compatible with the Postgres  type, but applications may require explicit modification to support such large 64-bit values, such as JavaScript and its <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt"><u>BigInt type</u></a>.<h3>Global Allocations</h3>The last technique we’ll discuss is the use of globally allocated sequence ranges. In this approach, each node requests a contiguous block of values from the other nodes in the cluster, and they all agree on block assignments. As a result, Node 1 may have two 1-million row chunks from 1-2,000,000, Node 2 uses 2,000,001 to 4,000,000, and Node 3 has 4,000,001 to 6,000,000.Nodes retain two blocks so they can request another block once they start consuming the second. INTs may default to smaller chunks such as 1-million, while s can reasonably assign larger 1-billion row chunks.This process is something a distributed cluster extension can do by hooking into internal Postgres extension hooks. But it also means there’s a maintenance process and a consensus aspect for distributing and using assigned blocks. On the other hand, unlike Snowflake IDs, there’s no potential for number length incompatibility, and unlike sequence offsets, there’s no need to manually modify each sequence on every node.<h3>Sequence Safety</h3>While we’re on the topic of sequences, we need to look at how Postgres uses them internally. We’ve all been told not to do this:Why? Because there’s a lot of hidden behind the scenes magic that doesn’t mesh well with current SQL standards. This is actually what happens when using SERIAL or BIGSERIAL in Postgres:Postgres creates an implicit sequence and associates it with the table. Usage grants to the table do not cascade to the sequence, and the value is supplied by a weak “default” clause. As a result, many new applications and schemas use IDENTITY syntax instead:Now there’s no more magical nextval shenanigans, the word ALWAYS means what it says, and it’s standard SQL. There’s even extra syntax to set the starting point, increment size, and other attributes. It’s great, right? Well maybeWhat happens in Active-Active clusters? We can no longer explicitly substitute our own nextval function, so we may be stuck with the monotonically advancing sequential values provided by the identity. It means we can’t use an alternative ID generator such as snowflake, timeshard, galloc, or anything else unless the distributed cluster extension can override the appropriate identity calls. The application definitely can’t do it with something as simple as the Snowflake extension.Ironically the new and recommended approach which is perfect for a replicated Postgres cluster is actually too inflexible for a distributed cluster. It may even be necessary for application schemas using IDENITITY to reverse migrate to , , or using  explicitly.<h3>Fixing Divergent INSERT / UPDATE Conflicts</h3>When we discussed the topic of  /  conflicts, we noted that Node C would ignore an UPDATE if it arrived prior to the dependent . One way to fix that would be for the distributed Postgres extension to convert the  to an . Then when the original  arrives, its earlier timestamp means it loses the Last Write Wins battle, and the risk of divergence vanishes.The only caveat here is that updates don’t normally include <a href="https://www.postgresql.org/docs/current/storage-toast.html"><u>TOAST</u></a><a href="https://www.postgresql.org/docs/current/storage-toast.html"> </a>data unless the toasted value itself is modified. This means the distributed Postgres extension must account for this and manually detoast the column value while performing the logical decoding step. If the extension doesn’t perform this translation, the only other option is to declare the entire row as a replica identity like this:This forces Postgres to retrieve the toasted value and add it to the logical replication stream automatically. This may affect logical replication efficiency if there are a lot of large text fields, but it facilitates the  to  conversion necessary to avoid this type of conflict.<h3>Historical Relevance</h3>Finally, there’s a technique for addressing divergent conflicts involving  statements which no current distributed Postgres solution provides. Postgres <a href="https://www.postgresql.org/docs/current/mvcc.html"><u>Multi Version Concurrency Control</u></a><a href="https://www.postgresql.org/docs/current/mvcc.html"> </a>works by retaining old row records in the table heap. We ran into a divergent conflict earlier in this article because a node has no way to reject an INSERT for a previously  row. But what if we did?Consider how the <a href="https://postgresqlco.nf/doc/en/param/hot_standby_feedback/"><u>hot_standby_feedback</u></a><a href="https://postgresqlco.nf/doc/en/param/hot_standby_feedback/"> </a>configuration parameter works. It allows a physical standby node to inform the primary of the lowest visible XID for any active transactions. The primary will then refrain from performing destructive maintenance on these visible tuples. A distributed Postgres cluster should know the timestamp from the oldest transaction from any other node in the cluster. What if it used that value to prevent similar cleanup across the cluster?This “tombstone” record would follow the same rules as Postgres transaction visibility normally would, but would allow the distributed Postgres extension to physically examine the older record if present. If a  removes a record, but there are pending transactions which might need to see that delete, the tombstone would make that possible.What if the  arrived before any ? Rather than discarding the event, the distributed Postgres extension could insert it purely as a tombstone record. Once all transactions older than the  have replayed, the tombstone snapshot would move to the next outstanding XID the same way as hot_standby_feedback.Barring this kind of advanced automatic handling, the easiest way to avoid divergent conflicts caused by  or  is to avoid executing either. It’s actually fairly common for databases to never delete data, electing instead to use a technique called a “soft” delete. A soft delete is simply something like this:In the event of a system audit, the record is still there, its old contents are still available, and there’s no risk of a divergent conflict because the existing record also acts as a tombstone. Some DBAs and data scientists actually strongly recommend always doing this even in non-distributed Postgres clusters. Having a reliable audit-trail is sometimes more important than having a pristine database with no “unnecessary” records.<h2>Final Words</h2>Managing a <a href="https://www.pgedge.com/products/what-is-pgedge"><u>distributed Postgres</u></a> cluster is hardly a trivial commitment. But it’s certainly less daunting when armed with the right product such as <a href="http://www.pgedge.com"><u>pgEdge </u></a>Distributed PostgreSQL, knowledge of the possible conflicts and related edge cases, and a clear understanding of the best mitigation techniques. Any application stack can excel in a distributed cluster environment by following certain precautions.As always, the power of Postgres makes it possible!</p> ]]></description>
            <guid>https://www.pgedge.com/blog/living-on-the-edge</guid>
            <author><name>Shaun Thomas</name></author>
            </item>    
    
        </channel>
    </rss>