The Long Road to Bottomless Postgres

Every database eventually runs into the same wall: storage costs money, and the data nobody queries anymore costs exactly as much as the data everyone does. A five-year-old row occupies the same expensive block storage as the order that came in thirty seconds ago. Postgres doesn't know the difference, and why would it? That's honestly a common refrain among most database engines.

As a result, many have dreamed of fixing this by decoupling compute from storage. Push the cold, ancient data down to cheap object storage like S3, keep the hot data local and fast, and let a single query span both. Better yet, store that cold data in an open columnar format for optimal analytics. It's a vision that launched an entire cottage industry of extensions, forks, and re-architectures, each one tackling the problem in its own particular way.

But there are so many approaches now, and none have "the" definitive solution. Each contender seems to demand a different sacrifice, whether that's a forked binary, an extra daemon, a duplicate copy of table data, or eschewing writes entirely. Let's explore the external storage ecosystem and see what we find.

Perchance to Dream

So what does "bottomless" actually mean, and how do we get there?

Postgres declarative partitioning offers the first potential tool in our arsenal. The ability to split data into definitive chunks means the planner can leverage pruning to focus on the most relevant data based on our predicates. From there, we can consider purposefully tiering data based on storage cost and demands. It's not a huge stretch of the imagination to use tablespaces to further federate "hot" and "cold" storage this way. No longer would our critical transaction-sensitive data cost the same as accumulated metrics useful only to analytics.

It's possible to take that idea even further. In the most extreme cases, innovative users utilize Postgres Foreign Data Wrappers to present entirely remote content as local data. Some wrappers are better than others, but this method of decoupling further taps into the potential of "limitless" storage. Indeed, Postgres doesn't even really care where that data lives at all, only that the FDW reacts as instructed. Postgres Table Access Methods can even obfuscate the foreign nature of the content entirely.

And perhaps finally, there's the storage format itself. The Postgres data heap is great, but it's understandably tied to the engine, with all that implies. Storage isn't columnar for better analytic access, it's encumbered with reasonable atomic locking that actively resists multiple write sources, and has very real limitations. Apache Parquet or Iceberg on the other hand, have quietly become the lingua franca for data lakes of all description. Stored externally in those formats, suddenly multiple engines can interact with the "cold" data along with Postgres, in a format better suited for analytics.

In that dream, Postgres is a familiar, operational query shell over literally any amount of data in any location. If that's not "bottomless", I don't know what is. And yet, every project that has chased that dream has missed the mark in one way or another. How? Well, let's take a look at some of the existing projects and see for ourselves.

Neon to The Extreme

The most thorough way to decouple storage from compute is to supplant the storage engine with an entirely new one. Neon decided that was the most direct way to attain "the dream"; just put all data in S3 where "bottomless" is in the product description.

Neon replaces the Postgres storage manager so that pages no longer come from local disk at all. Compute nodes become stateless Postgres processes that fetch pages on demand from a separate service called the pageserver, while a redundant set of safekeepers durably store the write-ahead log before it's processed and uploaded to cloud storage. In a way, the WAL effectively becomes the database. There's no local heap to fsync, and the log sequence number doubles as a temporal key, so the pageserver can reconstruct any page at any point in time by replaying WAL forward from a base image.

It's an elegant design that's entirely transparent to SQL clients. Applications have no idea that pages are being served over the network from layered files in S3. But that flexibility comes with a couple catches.

The first is that this requires a patched Postgres. The storage substitution is woven through the codebase. Neon must start a node without reading a checkpoint record, track the last-written page LSN, and include copious prefetching to address S3 access latency. It's not possible to drop Neon onto a stock community build; it's barely even possible to run it outside of a cloud context. It's more of an ecosystem than the Postgres most users would recognize.

The second catch is latency. Neon's own documentation is refreshingly candid that it "doesn't benefit from OS readahead, and the latency to pageservers is quite high compared to local disk." A cache miss is a network round trip, and a cold miss adds an S3 fetch plus page reconstruction on top. Aggressive prefetching / rehydration softens the blow, but the trade-off is inescapable.

Neon is a remarkable piece of engineering and a fully managed platform. It's also the nuclear option: an entire reimplementation of how Postgres stores data. Perhaps there's a... less invasive way of reaching the same goal?

DuckDB Inside

We already mentioned Parquet and Iceberg storage formats, and they offer a convenient capability. Provided there's an access method, Postgres can treat it like any other externally available data. But something has to read those columnar files on S3 on its behalf. These days, that "something" is usually DuckDB.

DuckDB is an analytical engine with first-class Parquet support and extensions for Iceberg and S3-style object storage. It's fast, columnar, and designed to be dropped into another process. For a Postgres extension author who needs to scan a pile of Parquet files on S3, DuckDB is the obvious shortcut. There's a reason it quietly powers essentially every project in this space.

Practically every "data lake" Postgres extension embeds or leverages DuckDB in some manner, allowing a Postgres client to reach cold data on S3 through the power of SQL. DuckDB is the secret sauce for an all-you-can-eat buffet of storage if used correctly. It turns out, "using it correctly" is highly subjective, and has become its own handful of distinct approaches.

Magic Mirror on the Wall

Sometimes the easiest approach is to copy existing data and call it a day. The goal of the pg_mooncake extension is real-time analytics rather than cheap archival.

Instead of moving data elsewhere, pg_mooncake mirrors it. The extension maintains a synchronized copy of a table in Iceberg alongside it. The approach is non-invasive and leverages the strengths of columnar storage. A Rust ingestion engine called moonlink streams changes through logical decoding, so the mirror stays current with sub-second resolution. Then DuckDB provides the analytical access to the mirrored data.

It's trivial to create a mirror with a single procedure call:

CREATE EXTENSION pg_mooncake CASCADE;

CALL mooncake.create_table('trades_iceberg', 'trades');

SELECT avg(price) FROM trades_iceberg WHERE symbol = 'AMZN';

Notice the naming in that last query. We're selecting from trades_iceberg, not trades. The mirror gets its own name, so the transparency stops at the table boundary; applications must know which copy they want. There are also two copies of the data and the additional continuous CDC to maintain alignment. It's an acceptable trade for fast analytics, but the cold storage isn't being offloaded in this case.

Luckily, pg_mooncake is actively developed and has a healthy following. The architecture has even changed underneath it, moving from a Delta Lake columnstore in its earliest releases to the Iceberg and moonlink design on display today. It's a lakehouse for Postgres, and a good one. But it raises a couple of important questions. If those rows are cold enough to mirror, why keep paying full price to keep the originals warm? Why further compound it with the overhead necessary to continuously synchronize the data?

Moving Day

Extensions like pg_mooncake provide an important lesson: it's almost easy to access Parquet and Iceberg through Postgres. Taken a bit further, we can cut out the middle-man and consider the external data a true archive. If the data is important enough to remain local, it's a separate category of content. Data tiering is nothing new, but using an extension to manage that capability is.

The pg_tier extension sought to provide that interface. Rather than mirror, it prioritized decoupling "old" data from the database entirely. The model is straightforward.

1. Configure an S3 bucket.

2. Point pg_tier at a table.

3. It copies the data to the bucket as a Parquet file.

4. Then it swaps the local table for a foreign table.

The tiering itself runs through a function call:

CREATE EXTENSION pg_tier CASCADE;

SELECT tier.set_tier_config('my-bucket', 'AWS_KEY', 'AWS_SECRET', 'us-east-1');

SELECT tier.table('people');

After that, SELECT * FROM people still works, because the name now resolves to the foreign table. The most common way to do this is with partitions, allowing the root table to access local and remote partitions transparently. Tiering is a one-shot function call rather than an automatic lifecycle. The "real" table becomes an FDW stub, and the whole thing leans on the parquet_s3_fdw FDW from PGSpider. (Not DuckDB?! Gasp!)

Unfortunately, pg_tier is effectively abandoned. Tembo, the original maintainers of the extension, pivoted to AI and eventually deleted the repository. What survives are legacy forks, unmaintained since mid-2024, carrying a beta warning and no published release. It was a good idea that lost its sponsor, and it's a useful reminder that an extension is only as durable as the people willing to maintain it.

Despite that, it offers a valuable window into what's possible. And as with such value, another contender becomes inevitable.

Ready Player Two

While pg_tier provided an interesting prototype, the pg_lake "suite" (a bit more than an extension) was our first glimpse into a full-scale implementation with serious engineering behind it. It began its life as Crunchy Data Warehouse before arriving as open source after the company's acquisition by Snowflake.

The stack of tools enables querying Iceberg and Parquet tables directly from Postgres, with the heavy lifting delegated to DuckDB for vectorized execution. The syntax is clean and feels native thanks to use of Postgres Table Access Methods:

CREATE TABLE measurements (
    station_name text NOT NULL,
    measurement  double precision NOT NULL
)
USING iceberg WITH (location = 's3://mybucket/measurements/');

INSERT INTO measurements VALUES ('Istanbul', 18.5);

The architectural decision that defines pg_lake is how it runs DuckDB. Instead of embedding it in the Postgres backend, pg_lake relies upon a separate daemon called pgduck_server. It's a standalone executable that speaks the Postgres wire protocol over a local socket and runs DuckDB underneath. The project suggests it made this decision because embedding DuckDB directly would collide with Postgres's process-per-connection model, since DuckDB is multi-threaded and the Postgres backend is not. They believe that a separate daemon neatly sidesteps the threading and memory-safety headaches.

While that decision buys performance and stability, there's an operational cost. A pg_lake deployment consists of two components to run, configure, and maintain: Postgres and pgduck_server. For some teams that's a non-issue. For others, an extra daemon is exactly the kind of moving part they adopted Postgres to avoid.

A Cooler Forecast

Which brings us to ColdFront, pgEdge's entry, and the reason for this whole survey. ColdFront tries to occupy the space the others leave open, and it makes a different set of trade-offs at nearly every decision point we've covered.

Like pg_mooncake and pg_lake, it uses DuckDB for the heavy lifting. Unlike pg_lake, it embeds it in-process through pg_duckdb, so there's no separate daemon or sidecar process. Like pg_tier, it moves cold data out to object storage rather than mirroring it, so there's only one copy and Iceberg becomes the system of record for archived rows. Unlike pg_tier, it doesn't leave us with a foreign-table stub or a one-shot manual move. Instead, a small Go archiver runs on a schedule and ages partitions from hot to cold automatically.

The part that ties it together is the query path. ColdFront installs a post_parse_analyze_hook in C that rewrites SQL at parse time. The table name is actually a view spanning a UNION ALL of the hot Postgres partitions and the cold Iceberg data. The hook inspects the WHERE clause against an archive watermark to decide which tier to include. Hot-only queries never go near S3, cold-only queries use DuckDB's Iceberg pruning, and a query straddling the boundary transparently hits both.

Setup looks like ordinary Postgres:

CREATE EXTENSION pg_duckdb;
CREATE EXTENSION coldfront;

SELECT coldfront.set_storage_secret('admin', 'adminsecret', 'seaweedfs:8333');

CREATE TABLE events (
    id   bigint GENERATED ALWAYS AS IDENTITY,
    ts   timestamptz NOT NULL,
    data jsonb,
    PRIMARY KEY (id, ts)
) PARTITION BY RANGE (ts);

From there, we can query events like any other table, and old partitions quietly migrate to Iceberg behind the scenes when the archiver runs. What's genuinely different here is that the cold tier stays writable through the same name. An UPDATE or DELETE against archived rows just works, no rehydration and no special path. Postgres really does treat it just like any other table.

ColdFront can also run in a fully decoupled mode where a table lives in Iceberg from its very first row, making Postgres a stateless compute front-end, or in a partition-only mode that does lifecycle management with no Iceberg at all. And because pgEdge specializes in distributed Postgres, there's a mutual-exclusion scheme modeled on Lamport's bakery algorithm. This allows multiple nodes in a cluster to write the same Iceberg table without stepping on each other. That makes it possible for nodes in an active-active cluster to share the same archival target (and write to it) without locking contention.

For all that, ColdFront is still a brand new extension, complete with a "BETA PRE-RELEASE, NOT FOR PRODUCTION" warning in the repository. Both operating modes work end to end and the test matrix is green, but there are still a few usage caveats. While still not truly "battle tested", it also threads the needle to cover scenarios the other approaches don't.

Finding The Sweet Spot

What do we see in this short external data survey?

Neon paid the steepest price for the most comprehensive result. Rewriting the storage manager buys perfectly transparent, effectively infinite storage. It costs a forked binary plus a network round trip on every page that misses the cache. For anyone who wants a managed platform and never wants to think about disks again, that's an acceptable trade. For anyone who flatly refuses to run a non-standard Postgres, it's a non-negotiable deal-breaker.

The remaining solutions opted to remain as extensions on stock Postgres, but used radically different approaches. pg_mooncake barely moves the needle at all, keeping a live columnar mirror in sync for the sake of analytics. It proves the external data concept, but has limited utility otherwise. The remaining extensions go further and rely on external storage, keeping a single copy and letting object storage become the true home for Parquet or Iceberg data. Whether that access is in-process or an external daemon is the decision between pg_lake and ColdFront.

So where does that leave someone who actually has to choose?

I'll be honest that my own preference here isn't exactly neutral. What wins me over is the one thing none of the others quite manage together: coherence. ColdFront tries to be as transparent as possible, passing everything through a single table name. There's only ever one copy of the data, and there's no sophisticated cache refresh layer or annoying daemon to set up. The extension handles read and write routing, and that's the end of it.

I'll admit the Go archival daemon struck me as an odd choice given Postgres background workers are available, but maybe that'll change in the future. Still, everything else about the extension is just easier to use and makes interacting with tiered storage a no-brainer.

Final Thoughts

The dream of bottomless Postgres is quietly converging. Open table formats won; nearly everyone now stores cold data as Parquet and Iceberg, which means archived rows stay portable no matter which extension wrote them. DuckDB won the engine war by default, sitting under almost every contender whether embedded or sidecar. The remaining arguments are about packaging and trade-offs, not fundamentals, and that's a healthy sign for an ecosystem.

ColdFront is the newest answer with the simplest architecture. Whether the straightforward design survives contact with production is the question only time and a few brave early adopters can answer. It's beta software, and it says so loudly, so treat it as a glimpse of where this is heading for now. We at pgEdge hope to change that in the near future, but we want to stay honest.

A default Postgres install still has no concept of data tiering aside from tablespaces, and that barely counts. But there are now several credible approaches to allowing Postgres to play to its primary strength: query execution. It's now possible to leverage Postgres as a compute layer for result aggregation regardless of data location. Foreign data wrappers are one way to do that, but now that Parquet and Iceberg are becoming the data lake standard, perhaps a more tightly-coupled design makes more sense.

Either way, it's an exciting time in the Postgres ecosystem. Extensions like ColdFront finally allow Postgres to operate based on data temperature, even if that's a concept I made up just now. It's a common enough scenario that I think ColdFront will make a genuine difference. So please, put it through its paces, and let us know how it goes.