A couple of months back, the CEO challenged product and marketing to revamp the developer experience on our website in three weeks. I vibe-coded a proof of concept full of "try it now" buttons and interactive guides, the CEO loved it, and then I had to deal with almost every one of those interactive guides being a placeholder card. Engineering was fully booked, and the Control Plane product I needed to write guides for was one I knew inside out at the architecture level but had never personally operated end-to-end through the API.

So I sat down and learned the pgEdge Control Plane the hard way: by using it. What follows is what I found, organized as the guide I wish I'd had when I started. If you're evaluating Control Plane, deploying it for the first time, or trying to understand what Day 2 operations actually look like, this is for you.

What Is the Control Plane?

pgEdge Control Plane is a lightweight orchestrator for PostgreSQL. It manages the full database lifecycle (creation, replication, failover, backup, restore, scaling) through a declarative REST API. You describe the database you want in a JSON spec, POST it, and Control Plane handles the rest: configuration, networking, Spock multi-master replication, Patroni for high availability, pgBackRest for backups. All of it.

The important thing to understand is that setup is only half the story. There are enough tools out there that can get you a running cluster if you know what you're doing. The hard part, the part where most tools leave you on your own, is Day 2. Modifying a running HA cluster. Adding a node to a live distributed database. Performing a rolling upgrade without downtime. Restoring from backup while keeping replication intact across the remaining nodes. That's where the complexity lives, and that's where Control Plane earns its keep.

Getting Started: Zero to Multi-Master in Five Minutes

Caveat

I’m not promising that every line of code in here will run as-is - It's real, but as I learned the hard way while building the interactive guides, you can’t control everything - so consider it Illustrative.

If you'd rather skip all of that, the Codespaces environment at the end of this post has everything pre-installed and ready to go.

Prerequisites

You'll need Docker, curl, jq, and psql (the PostgreSQL client) installed on your machine. On macOS, you can get psql via brew install postgresql@17 (or whichever major version you prefer). On Linux, your distribution's postgresql-client package will do the job.

If you're using Docker Desktop, there's one gotcha that will bite you if you skip it: you need to enable host networking manually. Go to Docker Desktop > Settings > Resources > Network, check "Enable host networking," and restart Docker. Without this, Control Plane won't be accessible via localhost.

Start the Control Plane

Control Plane uses Docker Swarm to manage the database containers, so the first step is initializing Swarm mode if it isn't already active. Then you pull the image, start the container, and initialize the cluster.

# Initialize Docker Swarm (skip if already active)

docker swarm init

docker run --detach \

    --env PGEDGE_HOST_ID=host-1 \

    --env PGEDGE_DATA_DIR=/tmp/pgedge-cp-demo \

    --volume /tmp/pgedge-cp-demo:/tmp/pgedge-cp-demo \

    --volume /var/run/docker.sock:/var/run/docker.sock \

    --network host \

    --name host-1 \

    ghcr.io/pgedge/control-plane \

    run

# Wait for the API to come up

until curl -sf http://localhost:3000/v1/version >/dev/null 2>&1; do sleep 2; done

# Initialize the cluster

curl http://localhost:3000/v1/cluster/init

The --network host flag is required because Control Plane needs stable IP addresses for both inter-machine communication (between Control Plane instances on different hosts) and intra-machine communication with Patroni and Postgres. The Docker socket mount (/var/run/docker.sock) is how Control Plane creates and manages the Postgres containers it orchestrates. And if docker swarm init fails because you have multiple network interfaces, you may need to specify which IP to advertise: docker swarm init --advertise-addr <your-ip>.  In a production cluster, this should be an IP address that's accessible from all other machines in your cluster.

Once curl http://localhost:3000/v1/cluster/init returns, the API is listening on port 3000 and the cluster is ready to accept database specs.

Create a Distributed Database

A single POST request with one JSON payload gives you three nodes with multi-master replication.

curl -s -X POST http://localhost:3000/v1/databases \
    -H "Content-Type: application/json" \
    --data '{
        "id": "example",
        "spec": {
            "database_name": "example",
            "database_users": [
                {
                    "username": "admin",
                    "password": "password",
                    "db_owner": true,
                    "attributes": ["SUPERUSER", "LOGIN"]
                }
            ],
            "nodes": [
                { "name": "n1", "port": 5432, "host_ids": ["host-1"] },
                { "name": "n2", "port": 5433, "host_ids": ["host-1"] },
                { "name": "n3", "port": 5434, "host_ids": ["host-1"] }
            ]
        }
    }' | jq .task

If ports 5432-5434 are already in use on your machine (maybe you have a local Postgres running), just change the port numbers in the spec. Any available ports will work.

That JSON spec is the entire declaration. Three nodes, each one a full Postgres primary, with Spock replication configured bidirectionally between all of them. No replication slot configuration, no publication/subscription SQL, no manual wiring of logical replication channels. You describe what you want, and Control Plane builds it.

Database creation is asynchronous. The API returns a task ID immediately, and Control Plane works in the background to pull images, start containers, configure Postgres, and wire up Spock replication. On the first run this takes a couple of minutes because it's pulling container images. Poll the database endpoint until the state flips to available:

# Poll until the database is ready

until [ "$(curl -s http://localhost:3000/v1/databases/example | jq -r .state)" = "available" ]; do

    sleep 5

done

echo "Database is ready."

Prove It Works

Create a table on node 1:

PGPASSWORD=password psql -h localhost -p 5432 -U admin example \

    -c "CREATE TABLE example (id int primary key, data text);"

Insert a row on node 2:

PGPASSWORD=password psql -h localhost -p 5433 -U admin example \

    -c "INSERT INTO example (id, data) VALUES (1, 'Hello from n2!');"

Read it back from node 1:

PGPASSWORD=password psql -h localhost -p 5432 -U admin example \

    -c "SELECT * FROM example;"

Row written on n2, readable on n1. Spock replicated it in milliseconds. Every node accepts reads and writes, and every change propagates to every other node automatically.

Node Failure and Recovery

This is the part that sold me when I was running through this myself. Take node 2 offline:

N2_SERVICE=$(docker service ls \

    --filter label=pgedge.component=postgres \

    --filter label=pgedge.node.name=n2 \

    --format '{{ .Name }}')

docker service scale "$N2_SERVICE"=0

Node 2 is dead, gone from the cluster. Now write data while it's missing:

PGPASSWORD=password psql -h localhost -p 5432 -U admin example \

    -c "INSERT INTO example (id, data) VALUES (3, 'Written while n2 is down!');"

Read from n3, which doesn't care that a node is missing:

PGPASSWORD=password psql -h localhost -p 5434 -U admin example \

    -c "SELECT * FROM example;"

All three rows. Now bring n2 back:

docker service scale "$N2_SERVICE"=1

Wait for it to come up, read from n2, and all three rows are there. Including the one written while n2 was down. Spock caught it up automatically, with no manual intervention, no replication conflict resolution, no panicked DBA at 3am running pg_rewind. The recovery, not the initial replication, is the demo that sells itself.

Day 2 Operations: The Hard Part, Made Simple

Initial setup is the easy part. Plenty of tools can get you a running cluster on Day 1. Where Control Plane separates itself is Day 2: modifying, scaling, and maintaining a running cluster without requiring a PhD in distributed systems administration.

High Availability with Read Replicas

Each node in a distributed database can have its own read replicas. You configure this by adding more host IDs to the host_ids array. The first host gets the primary, the rest become replicas managed by Patroni. Our examples so far run on a single machine, but here's what a production spec might look like with hosts across AWS regions:

"nodes": [

    { "name": "n1", "host_ids": ["us-east-1a", "us-east-1c"] },

    { "name": "n2", "host_ids": ["eu-central-1a", "eu-central-1b"] },

    { "name": "n3", "host_ids": ["ap-south-2a", "ap-south-2c"] }

]

That's a 3-node multi-master database spanning three AWS regions, each node with a read replica in a different availability zone. Six Postgres instances, bidirectional replication between the primaries, streaming replication to the replicas, automatic failover via Patroni. All from a single JSON spec.

Switchover vs. Failover

Control Plane exposes two distinct tools for handling primary transitions, both built on Patroni under the hood. The value isn't in reinventing what Patroni already does well, it's in wrapping those operations in the same declarative API you use for everything else, so you can trigger and observe them from anywhere in the cluster.

Switchover is for planned maintenance. It's a graceful transition from a primary to a replica, and you can even schedule it for a specific time:

curl -X POST http://localhost:3000/v1/databases/example/nodes/n1/switchover \

    -H 'Content-Type:application/json' \

    --data '{

        "candidate_instance_id": "example-n1-b",

        "scheduled_at": "2026-04-05T02:00:00Z"

    }'

Control Plane validates cluster health before proceeding, promotes the specified replica, and demotes the old primary to replica status. You can run this during a maintenance window and go to bed knowing it will handle the transition at 2am without you.

Failover is for when things have already gone wrong. A primary is unreachable and you need to promote a replica immediately:

curl -X POST http://localhost:3000/v1/databases/example/nodes/n1/failover \

    -H 'Content-Type:application/json' \

    --data '{

        "candidate_instance_id": "example-n1-b"

    }'

It even has a skip_validation flag for disaster recovery scenarios where the cluster is already in a degraded state and you just need to get a new primary up. The control is there when you need it, and the automation handles the rest when you don't.

Scaling: Adding and Removing Nodes

This is one of the things I didn't appreciate until I dug into the codebase. You can scale a running distributed database by updating the spec and POSTing it back. Want to add a fourth node? Update the nodes array:

curl -s -X POST http://localhost:3000/v1/databases/example \

    -H "Content-Type: application/json" \

    --data '{

        "spec": {

            "nodes": [

                { "name": "n1", "port": 5432, "host_ids": ["host-1"] },

                { "name": "n2", "port": 5433, "host_ids": ["host-1"] },

                { "name": "n3", "port": 5434, "host_ids": ["host-1"] },

                { "name": "n4", "port": 5435, "host_ids": ["host-1"] }

            ]

        }

    }'

Control Plane figures out the delta between the old spec and the new one, provisions the new node, configures Spock replication to and from the existing nodes, and syncs the data. The same declarative model you used on Day 1 works on Day 200. You don't need to learn a different set of commands for modifying a running cluster than the ones you used to create it in the first place.

Instance-Level Control

Sometimes you don't need to operate on a whole node. You need to bounce a specific Postgres instance, maybe to pick up a configuration change or clear shared buffers. Control Plane lets you do that at the instance level:

# Restart a specific instance

curl -X POST http://localhost:3000/v1/databases/example/instances/example-n1-a/restart \

    -H 'Content-Type:application/json' \

    --data '{

        "scheduled_at": "2026-04-05T03:00:00Z"

    }'

You can restart, stop, and start individual instances, and each operation supports scheduling for a future time. The force parameter lets you operate on instances even when the database is in a degraded state, which is exactly when you tend to need fine-grained control the most.

PostgreSQL Configuration

You can pass PostgreSQL configuration parameters directly through the database spec, at both the database level and per-node:

{
    "spec": {
        "postgresql_conf": {
            "max_connections": 1000,
            "shared_buffers": "256MB",
            "effective_cache_size": "1GB",
            "work_mem": "64MB"
        }
    }
}

Set it globally in the spec and it applies to every node. Override it at the node level for nodes that serve different workloads (maybe n1 handles your OLTP traffic and needs more connections, while n3 handles analytics with larger work_mem). Control Plane applies the changes and handles the restarts where needed.

Backups and Restore

Control Plane integrates pgBackRest and makes backup configuration declarative, following the same philosophy as everything else. Add a backup_config to your database spec:

"backup_config": {
    "repositories": [
        {
            "type": "s3",
            "s3_bucket": "my-database-backups",
            "s3_region": "us-east-1"
        }
    ],
    "schedules": [
        {
            "id": "nightly-full",
            "type": "full",
            "cron_expression": "0 0 * * *"
        },
        {
            "id": "hourly-incremental",
            "type": "incr",
            "cron_expression": "0 * * * *"
        }
    ]
}

Nightly full backup, hourly incrementals, to S3. The repositories array supports S3, GCS, Azure Blob Storage, and POSIX/CIFS mounts, and you can configure multiple repositories for geographic redundancy. You can also trigger manual backups through the API when you want one outside the schedule:

curl -X POST http://localhost:3000/v1/databases/example/nodes/n1/backups \
    -H 'Content-Type:application/json' \
    --data '{ "type": "full" }'

Point-in-Time Recovery

Restore is a single API call, and it supports three different targeting modes depending on how precise you need to be:

curl -X POST http://localhost:3000/v1/databases/example/restore \
    -H 'Content-Type:application/json' \
    --data '{
        "restore_config": {
            "source_database_id": "example",
            "source_node_name": "n1",
            "source_database_name": "example",
            "repository": {
                "type": "s3",
                "s3_bucket": "my-database-backups",
                "s3_region": "us-east-1"
            },
            "restore_options": {
                "type": "time",
                "target": "2026-04-01 09:38:52-04"
            }
        }
    }'

That restores to a specific timestamp. You can also target a specific WAL LSN ("type": "lsn") or a transaction ID ("type": "xid") if you need byte-level precision about exactly how far to roll back. Control Plane orchestrates pgBackRest to handle the whole restore sequence: tears down replication subscriptions, stops the instance, runs the restore, brings it back up, reconnects replication. A whole chain of operations that have to happen in the right order, with the right error handling, or you end up with a split-brain cluster and a very bad morning. Automated into a single POST.

You can also create entirely new databases from backups, or seed new nodes from existing backup data. Adding a fourth node to your cluster without copying data over the wire from a running primary? Just point it at the backup repository.

Deploy Supported Services Alongside Your Database

This capability is in beta, so it's not in any of the marketing materials yet. Control Plane can deploy and manage services alongside your databases, using the same declarative spec model. Right now, that means MCP and RAG servers and PostgREST instances (more will be added):

"services": [
    {
        "service_id": "my-mcp",
        "service_type": "mcp",
        "version": "latest",
        "host_ids": ["host-1"],
        "port": 8080,
        "cpus": "0.5",
        "memory": "512M",
        "config": {
            "llm_provider": "anthropic",
            "llm_model": "claude-sonnet-4-5"
        },
        "database_connection": {
            "target_nodes": ["n1"],
            "target_session_attrs": "primary"
        }
    }
]

You declare the service type, resource limits, and which database node it should connect to, and Control Plane handles deployment, health checking, and lifecycle management. The database_connection block lets you control session attributes, so you can point read-heavy services at standby replicas and write-heavy services at the primary.

This is particularly interesting for the AI use case. You can spin up a distributed Postgres database with pgVector, configure Spock replication across regions, and deploy an MCP server on top of it, all from a single JSON spec. The database and the services that consume it are managed as a unit.

Monitoring Operations with Tasks

Every mutating operation in Control Plane (create, update, delete, backup, restore, switchover, failover, instance restart) produces a task that you can track through the API. You already saw this earlier when we polled for the database creation to complete, but you can dig much deeper:

# List all tasks for a database
curl -s http://localhost:3000/v1/databases/example/tasks | jq .

# Get detailed logs for a specific task
curl -s "http://localhost:3000/v1/databases/example/tasks/{task_id}/log?limit=100" | jq .

Tasks have states (pending, running, completed, failed, canceled), and the log endpoint supports streaming with pagination so you can follow long-running operations in real time. If something goes wrong during a restore or a scale-out, the task log tells you exactly where it failed and why.

You can also query tasks at the cluster level with scope filters, so a "show me everything that happened to database X in the last 24 hours" query is straightforward. This is the kind of observability that makes the difference between "something went wrong" and "here's exactly what went wrong, and here's the log entry that tells us why."

The Full Spec: What You Can Declare

Here's a quick reference of everything you can configure in a database spec. The declarative model means all of these fields work the same way: set the value you want, POST the spec, and Control Plane makes it so.

FieldWhat It Controls
database_nameThe Postgres database name
postgres_versionSpecific PG version (e.g., 17.6)
spock_versionSpock extension version
nodesNode topology and host assignments
database_usersUsers, passwords, roles, and attributes
postgresql_confAny Postgres parameter, globally or per-node
backup_configRepositories, schedules, retention policies
restore_configSource, target, and recovery options
servicesMCP, PostgREST, and other co-located services
orchestrator_optsExtra volumes, networks, and labels for Docker
cpus / memoryResource limits per node

The key insight is that the spec is your single source of truth, the same infrastructure-as-code pattern you'd recognize from Terraform or Kubernetes manifests. You don't learn one set of commands for creation and a different set for modifications. Change the spec, POST it, and Control Plane calculates the diff and applies it.

What's Coming Next

The team isn't slowing down.

systemd support is in the works, which means Control Plane won't require Docker on every host. The ability to deploy supporting services alongside your databases is expanding (think connection poolers, monitoring agents, and AI/ML tooling managed through the same declarative API). Extensions like pgVector, PostGIS, pgAudit, and the pgEdge Vectorizer are already supported and configurable through the spec.

The capability surface keeps expanding, and the interaction model stays the same: describe what you want, POST it, let Control Plane figure out the how.

How I Built This Knowledge (And Why It Matters)

I want to circle back to how this blog came to exist, because there's a lesson in it for anyone building developer products.

I'd been the PM for Control Plane for months. I'd read the architecture docs, attended the design reviews, written the user stories, built the roadmap. And yet my understanding of the product shifted when I sat down and ran every API call myself. There's a difference between knowing what your product does and feeling it respond to your inputs through a terminal. Both matter, but the second one is what lets you write a guide that actually helps someone.

That three-week sprint to build the interactive guides forced me through every operation in this blog post. Along the way, I made every rookie mistake you'd expect: I originally had setup scripts that automatically installed Docker and curl and jq on people's machines (engineering rightly pulled me back from that cliff). I built too many delivery formats before figuring out which ones developers actually wanted. I went spectacularly, enthusiastically overboard.

The final result was tighter for all that trimming. But the real output wasn't the guides. It was this deeper understanding of the product that only comes from using it fully. If you're a PM and you haven't REALLY used your own product end-to-end recently, go do it. You'll find things that surprise you,things you had forgotten, things that frustrate you, and things that make you think "damn, this is actually good." All are worth knowing about.

Cleaning Up

When you're done experimenting, tear everything down in three steps:

# Remove the database and its services
docker service rm $(docker service ls --filter label=pgedge.database.id=example -q) 2>/dev/null
 
# Remove the Control Plane container
docker rm -f host-1
 
# Remove the data directory (needs sudo because Docker creates files as root)
sudo rm -rf /tmp/pgedge-cp-demo

Try It Yourself

Everything I built during those three weeks is open source and ready to run.

The fastest path: open it in GitHub Codespaces and you'll have a working environment in under a minute:

Open in Codespaces

On your own machine, one command bootstraps everything:

curl -fsSL https://raw.githubusercontent.com/pgEdge/control-plane/main/examples/walkthrough/install.sh | bash

Prefer VS Code? Install the Runme extension, open the walkthrough, and click Execute Cell on each block.

How long to go from zero to a running distributed Postgres database? About five minutes. I timed it.

Control Plane is open source from pgEdge. And I’m unashamedly proud of what we’ve built. Documentation | API Reference | GitHub | Enterprise Postgres Downloads