The Scaling Ceiling: When One Postgres Instance Tries to Be Everything

There's a persistent belief in the database world that vertical scaling solves all problems. Need more throughput? Add CPUs. Running out of cache? More RAM. Queries hitting disk? Higher IOPS. It's a comforting philosophy because it's simple, and for a surprisingly long time, it works. A single beefy Postgres instance can handle an enormous amount of punishment before collapsing under the strain.

But there's a ceiling up there, and it's not made of hardware. Postgres was designed as a single-instance database engine, and many of its internal structures are shared across every database the instance contains. These shared resources are rarely concerning in a single modest instance. But with twenty databases running a mixture of heavy OLTP workloads, analytical queries, or even mostly idle, the shared nature of these internals becomes very relevant.

Let’s talk about the barriers these over-provisioned instances eventually hit, with references to the Postgres source code itself for good measure. Some of these are well known, while others are the kind of thing that strikes suddenly at 2 AM when all the monitoring dashboards turn red simultaneously.

One Pool to Rule Them All

The shared_buffers parameter is probably the first tunable every Postgres administrator encounters. It controls the size of Postgres's own buffer cache, the region of shared memory where frequently accessed disk pages live so they don't need to be fetched from storage on every read. The documentation suggests starting at 25% of system RAM, and that's reasonable advice for a single-database instance. Most experts in the subject agree.

It’s easy to forget that this allocation is instance-wide. The contents of src/backend/storage/buffer/buf_init.c bear this out, as the buffer pool gets allocated once at startup as a flat array of pages in shared memory:

BufferBlocks = (char *)
    TYPEALIGN(PG_IO_ALIGN_SIZE,
              ShmemInitStruct("Buffer Blocks",
                              NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
                              &foundBufs));

There is no per-database partitioning, no priority system, no reservation mechanism. Every database on the instance competes for the same pages in the same pool. An analytics query scanning a 500GB table in one database will happily evict cached pages that belong to a latency-sensitive OLTP workload in another. The buffer replacement algorithm (a clock-sweep LRU variant) has no concept of "this page belongs to an important database."

The same applies at the operating system level. The kernel's filesystem cache, often called the "double buffer" in Postgres circles because effective_cache_size accounts for it, is also shared across all processes on the machine. Two databases with fundamentally different access patterns, one doing sequential scans and the other doing random index lookups, will thrash each other's cached pages with no way to intervene.

Will throwing more RAM at the problem help? Only until the largest working sets collide. At that point, it becomes the worst example of the Noisy Neighbor problem.

The 32-Bit Treadmill

The 32-bit nature of the Postgres transaction ID (XID) is practically venerated as something of an old joke by this point. Blogs warning about the dreaded "XID wraparound" terror are easy to find. The Postgres fix for this is VACUUM, specifically the VACUUM FREEZE operation. Most tuples have an associated XID, but since there are a limited number of those, tuples past a certain horizon get "frozen". Frozen tuples still have an XID, but Postgres ignores it and treats the data as if it has always existed. And thus by magic, that 4-billion transaction window only cares about "recent" transactions (for varying definitions of recent).

Unfortunately, this counter persists across the entire instance. In src/backend/access/transam/varsup.c, the function GetNewTransactionId() draws from a single global well:

if (TransactionIdFollowsOrEquals(xid, TransamVariables->xidVacLimit))
{
    /* ... */
    if (IsUnderPostmaster &&
        TransactionIdFollowsOrEquals(xid, xidStopLimit))
    {
        ereport(ERROR,
                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                 errmsg("database is not accepting commands that assign "
                        "new transaction IDs to avoid wraparound data "
                        "loss in database \"%s\"",
                        oldest_datname),
                 errhint("Execute a database-wide VACUUM in that database.")));
    }
}

Read that error message carefully. The instance refuses all new transactions to protect a specific database. A single neglected database out of dozens can accumulate enough XID age and cause the entire instance to grind to a halt. Every tenant suffers because one database didn't get vacuumed in time, or some resource artificially held onto a visible tuple so long it couldn’t be cleaned up.

The SetTransactionIdLimit() function in the same file makes this even more explicit. It computes the wraparound danger threshold based on "the oldest XID that might exist in any database of our cluster." One database's frozen-XID age becomes the constraint for every other database sharing that instance.

Multixact: The Other Wraparound

If XID wraparound is Postgres's well-publicized villain, multixact wraparound is the esoteric threat. Multixacts exist to track shared row-level locks; when multiple transactions hold locks on the same row, Postgres records them as a "multixact" group rather than storing each lock individually. Like XIDs, multixact IDs are 32-bit counters that wrap around, and like XIDs, they're instance-wide.

But the member storage, the actual record of which transactions participate in each multixact, has its own nasty limit. The source code in src/backend/access/transam/multixact.c spells out the on-disk layout with typical Postgres clarity:

/*
 * ...we store four bytes of flags, and then the
 * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a
 * "group", and are stored as a whole in pages.  Thus, with 8kB BLCKSZ,
 * we keep 409 groups per page.  This wastes 12 bytes per page, but
 * that's OK -- simplicity (and performance) trumps space efficiency here.
 */
#define MULTIXACT_FLAGBYTES_PER_GROUP       4
#define MULTIXACT_MEMBERS_PER_MEMBERGROUP   \
    (MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)
#define MULTIXACT_MEMBERGROUP_SIZE \
    (sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP \
     + MULTIXACT_FLAGBYTES_PER_GROUP)
#define MULTIXACT_MEMBERGROUPS_PER_PAGE \
    (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)

The math is straightforward but the implications are severe. With 409 groups per 8KB page and 4 XIDs per group, we can work out the total SLRU address space: 2^32 member offsets divided by 1,636 members per page, multiplied by 8KB per page. That comes out to roughly 21GB of multixact member storage for the entire instance.

That 21GB ceiling might sound generous until you consider a multi-tenant setup with aggressive row-level locking. A workload that performs SELECT ... FOR UPDATE across many rows, or any application pattern that causes multiple transactions to hold shared locks on the same tuples, burns through multixact members quickly. Once exhausted, the instance starts refusing operations just as it does for XID wraparound, except the monitoring for multixact usage is far less mature in most environments.

Worse, the same "slowest database wins" dynamic applies. The global minimum across all databases governs when the SLRU can be truncated. One database with inadequate vacuuming of multixact-heavy tables can pin that minimum in place for the entire instance. Similarly, a single database can greedily monopolize that precious resource simply due to unusual or aggressive locking behavior.

The One-Lane Highway of WAL Replay

Postgres streaming replication works by shipping Write-Ahead Log (WAL) records from the primary to replicas, which then replay them to stay current. It's a utilitarian and reliable workhorse, but there's a fundamental constraint: replay is single-threaded.

In src/backend/access/transam/xlogrecovery.c, the main redo loop that processes WAL on a replica is exactly what it looks like:

/*
 * main redo apply loop
 */
do
{
    ProcessStartupProcInterrupts();
 
    /* ... pause checks, recovery target checks ... */
 
    /*
     * Apply the record
     */
    ApplyWalRecord(xlogreader, record, &replayTLI);
 
    /* Else, try to fetch the next WAL record */
    record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
 
/*
 * end of main redo apply loop
 */

One record at a time, sequentially, in a single process. The startup process (which handles WAL recovery) is the sole consumer of WAL data on a replica. There is no parallel apply. Even an over-provisioned 128-core machine acting as a replica can only leverage a single core for processing WAL data.

The recovery_prefetch parameter (defaulting to try since Postgres 15) helps when the bottleneck is IO. It looks ahead in the WAL stream and issues asynchronous reads for pages that will be needed soon, reducing stalls caused by cold cache hits. The prefetcher documentation in src/backend/access/transam/xlogprefetcher.c describes it as a "drop-in replacement for an XLogReader that tries to minimize IO stalls by looking ahead in the WAL."

But if a primary generates WAL faster than a single core can process it, prefetching won't help. The bottleneck shifts from IO to CPU, and there's nowhere to go. A write-heavy primary with many concurrent backends can produce WAL at a rate that structurally outpaces what a single replay process can consume. The replica falls behind, and the gap only widens under sustained load. I've personally witnessed a replica where this process is pinned at 100% CPU for hours while replication lag continues to accumulate.

This is especially painful in a multi-database instance. Every database's WAL goes through that same single-threaded funnel. A batch import into one database generates a torrent of WAL that delays replay of another database's critical transaction. On separate instances, each database has its own replica and independent replay process—no more cascading latency from a single busy database.

The Singleton Bottleneck Brigade

Beyond the big-ticket items, Postgres runs several background processes that are each a single worker serving the entire instance. Individually, they're rarely a problem. Collectively, they form a convoy of potential bottlenecks.

Autovacuum gets a shared pool of workers, defaulting to a maximum of 3 (controlled by autovacuum_max_workers). The launcher process in src/backend/postmaster/autovacuum.c schedules these workers across all databases in the instance. In an instance with ten databases and three workers, a couple of databases with heavy churn can monopolize the pool while others accumulate dead tuples and XID age. This kind of autovacuum starvation feeds directly into the XID and multixact wraparound risks discussed earlier.

It's possible to raise autovacuum_max_workers of course, but those workers draw from the same CPU budget as application backends. How many workers will we need to accommodate all databases? It's not possible to assign workers to specific databases, so the problem never really goes away, it just becomes less likely. Separate instances would ensure that each database gets its own full complement of autovacuum workers without competing.

The checkpointer is a single process responsible for flushing dirty buffers to disk at checkpoint intervals. A checkpoint triggered by one database's heavy write activity forces a flush of all dirty pages across the instance, including pages dirtied by other databases. The IO storm from a large checkpoint can cause latency spikes for every tenant, not just the one that triggered it.

The background writer is also a single process that continuously writes dirty shared buffers to disk to keep a supply of clean pages available. It manages the entire shared buffer pool, and its pace is governed by instance-wide settings like bgwriter_lru_maxpages and bgwriter_delay. There's no way to prioritize one database's dirty pages over another.

Splash Damage

Maybe the most straightforward argument against cramming everything into one instance is the blast radius of failure. When a Postgres instance goes down, whether from a crash, an OOM kill, a kernel panic, or just planned maintenance, every database on that instance goes with it.

The postmaster treats many failure modes as potentially corrupting shared memory. A single backend crash triggers a full restart cycle and termination of all user sessions. This comment in the checkpointer code captures the philosophy:

“If the checkpointer exits unexpectedly, the postmaster treats that the same as a backend crash: shared memory may be corrupted, so remaining backends should be killed."

Maintenance windows compound the problem. A Postgres major version upgrade, an extension update, or even a configuration change requiring a restart affects all tenants simultaneously. Coordinating downtime across multiple teams with different SLAs, different peak hours, and different tolerance for interruption is an organizational headache that grows geometrically (or worse) with the number of databases.

And then there's the dreaded emergency vacuum. If one database approaches XID wraparound, Postgres will refuse transactions for all databases (as we saw in varsup.c). An urgent maintenance task on one database is now a high-severity outage incident for everyone. The blast radius of a forgotten cron job or a stuck long-running transaction just expanded to encompass the entire data tier.

Splitting the Atom

The solution to most of these problems is, perhaps counter-intuitively, not beefier hardware but more instances. Take the same physical machine, carve it into virtual environments (VMs, containers, or even just multiple Postgres installations on different ports), and run one database per instance.

What changes? Let's see...

Each instance gets its own shared_buffers, sized appropriately for its workload. An OLTP database can have a large, hot buffer pool while an analytics database gets a smaller one tuned for filesystem cache access. No more buffer thrashing between incompatible access patterns.
Transaction IDs become per-instance. One database's vacuum debt can't drag others into wraparound territory. The same applies to multixact members; that 21GB ceiling now applies to a single workload rather than the sum of all tenants.
WAL replay is per-instance. A write-heavy database generates WAL that only its own replica needs to replay. A latency-sensitive OLTP replica isn't waiting behind a batch import's WAL records destined for a completely different database.
Autovacuum workers, the checkpointer, and the background writer each serve a single database. No more starvation, no more shared checkpoint storms, no more one-size-fits-all background writer pacing.
Failures become isolated. A crash in one instance is invisible to the others. Maintenance windows can be scheduled independently. Emergency vacuums don't trigger cross-tenant incidents.

The trade-off is operational complexity. More instances means more configuration to manage, more backup schedules to maintain, more monitoring dashboards to watch. But with modern infrastructure tooling (Ansible, Terraform, Kubernetes operators), the marginal cost of an additional Postgres instance is low compared to the cost of debugging an emergency multi-tenant resource exhaustion event.

Knowing When to Quit

Vertical scaling is a perfectly valid strategy, and there's a reason so many Postgres installations run happily on a single large instance. For moderate workloads, the shared nature of Postgres internals is not just acceptable but efficient. Shared memory, shared processes, shared caches: they all reduce overhead when the workloads are playing nicely together.

The trouble starts when "playing nicely" is no longer a given. Databases with fundamentally different I/O profiles, vacuum requirements, availability SLAs, activity patterns, and other concerns, don't always mix well. Resources become contested rather than efficient. No amount of RAM, CPU, or storage can counteract that because the constraints are architectural.

The signals are usually subtle at first. Autovacuum can't keep up across all databases. Replica lag increases during batch jobs in an unrelated database. Checkpoint duration creeps up. Multixact warnings appear in the logs that nobody configured alerts for. By the time XID wraparound threatens to lock the whole instance, there have usually been many other signs that simply went unseen. There's a reason many in the community consider multiple-database instances a type of anti-pattern; shared resources are also a shared throttle.

So if you're staring at a single Postgres instance that hosts a growing number of databases, or a shrinking number of exceptionally large ones, take a hard look at the shared internals. Read the source. Do the math on your multixact headroom. Check whether your autovacuum workers are keeping pace across every database, not just the ones you're watching. And if the numbers start looking uncomfortable, consider splitting before it becomes absolutely necessary.

It's a lot easier to plan a migration than to execute one during an incident.