<?xml version="1.0" encoding="UTF-8" ?>
    <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
        <channel>
            <title>pgEdge Posts from Antony Pegg</title>
            <link>https://www.pgedge.com/blog</link>
            <description>The latest pgEdge Posts from Antony Pegg</description>
            <atom:link href="https://www.pgedge.com/feeds/rss/user/antony-pegg/all.xml" rel="self" type="application/rss+xml" />
            <language>en-us</language>         
            
            <item>
            <category>pgEdge,PostgreSQL</category>
            <title><![CDATA[pgEdge Control Plane Adds Supporting Services and a Preview of systemd Support]]></title>
            <link>https://www.pgedge.com/blog/pgedge-control-plane-adds-supporting-services-and-a-preview-of-systemd-support</link>
            <pubDate>Thu, 07 May 2026 17:20:17 GMT</pubDate>
            <description><![CDATA[ <p>Most Postgres management tools ask you to pick a lane. You can manage databases, or you can manage the services around them. You can run in containers, or you can run on bare metal. You get one deployment model, one operational surface, one set of assumptions about how your infrastructure works.The pgEdge Control Plane just added two features that refuse to pick a lane: Supporting Services and systemd Support. Together, they push the Control Plane into territory that, as far as we can tell, nobody else in the Postgres world is covering. Supporting Services is fully available, while the systemd support is currently a Preview feature.<h2>Supporting Services: More Than Just Postgres</h2>Here's the thing about enterprise Postgres in 2026: the database is only part of the story. Your AI agents need an MCP server to talk to the data, your applications need a REST API to query it, and your knowledge base needs a RAG server to index and retrieve from it. These services aren't optional extras, they're what make the database useful in production.Until now, you managed those services separately: different deployment pipelines, different configuration, different credentials, different monitoring. The database lived in one world and the services that depended on it lived in another, even though they're fundamentally coupled. When the database moves, the services need to follow, and when credentials rotate, every connected service needs to know about it. When you scale out, everything needs to come along for the ride.Supporting Services in the Control Plane fixes this by treating the database and its surrounding services as a single declarative unit. You add a array to the same JSON spec you already use for your database, and the Control Plane handles deployment, credential provisioning, health checking, and lifecycle management for everything together.That's a two-node distributed database with Spock multi-master replication, an MCP server for AI agent access on the US node, and PostgREST instances on both nodes for REST API coverage. One spec, one POST, and the Control Plane builds the whole thing. Each service instance gets its own automatically provisioned database credentials (read-only by default, read-write when the service needs it), and those credentials are scoped, rotated, and revoked by the Control Plane without you touching them.<h3>What You Can Deploy Today</h3>The beta launches with three service types, and the framework is designed to expand. These are the first three, not the last three.pgEdge Postgres MCP Server connects AI agents and LLM-powered applications to your database. Configure it with your preferred LLM provider (Anthropic, OpenAI, or self-hosted Ollama), and your agents get tools for querying data, inspecting schemas, running EXPLAIN plans, and performing vector similarity searches. You can run it as a pure tool server (where the connecting client supplies its own LLM) or enable the built-in LLM proxy for direct HTTP chat. The MCP server supports fine-grained control over which tools are exposed and whether write access is permitted, so you can give AI agents exactly the access they need and nothing more.pgEdge RAG Server enables retrieval-augmented generation workflows using your database as a knowledge store. Configure multiple pipelines, each targeting specific tables with their own embedding models and search tuning. Point it at your documentation tables, your support ticket history, your product catalog, and the RAG server handles chunking, embedding, and retrieval. It works with Anthropic, OpenAI, Voyage, and Ollama for embeddings, so air-gapped deployments with local models are fully supported.PostgREST automatically generates a REST API from your PostgreSQL schema. No backend code, no endpoint definitions, no ORM configuration. PostgREST reads your schema, respects your row-level security policies, and exposes your tables and views as RESTful endpoints. The Control Plane handles JWT configuration, connection pooling, and CORS settings through the same declarative config.<h3>Deployment Flexibility</h3>Services are independent of your database node topology. You can co-locate a service on the same host as a database node to minimize latency, run it on a dedicated host to isolate the workload, or deploy multiple instances across hosts for redundancy. The  block in the service spec lets you control exactly which database node each service connects to and whether it targets the primary or a standby, so you can point read-heavy services at replicas and write-heavy services at the primary without any manual connection string management.<h2>systemd Support: Your Hosts, Your Way</h2>The other half of this release is a Preview feature that tackles a different constraint entirely, one that's been blocking a whole segment of enterprise customers from adopting the Control Plane. The Control Plane has required Docker Swarm since day one, and for many teams that's fine. But for production database teams in regulated industries (financial services, healthcare, government), containers on database hosts could be a non-starter. Their security teams may not approve of Docker. Their operational runbooks assume standard deployments of system package managers and services. Their monitoring, their backup scripts, their muscle memory all assume Postgres is a service on a host, not a process inside a container.These are exactly the kinds of customers who need what the Control Plane offers (declarative management, automated HA, backup scheduling, rolling upgrades), but they've been locked out by the container requirement. systemd support removes that lock.The same declarative API, the same Patroni-based high availability, the same backup/restore integration, but without Docker anywhere in the picture. Your databases and the Control Plane server run as systemd units on standard OS file paths. The entire stack is standard Linux processes that any sysadmin or security auditor already understands. The RHEL beta is available now, with Debian support to follow shortly.<h3>What This Means in Practice</h3>The Control Plane discovers what's installed on the host (Postgres versions, extensions, existing data directories) and works with what's there. Standard OS file locations (/) mean your existing monitoring, log aggregation, and backup tooling continues to work without reconfiguration.The API is identical regardless of your orchestrator, and that's the important part. Whether your Control Plane is orchestrating Docker Swarm services or systemd units, the endpoints don't change. The same spec that creates a distributed database on containers creates one directly on the host. Your automation, your CI/CD pipelines, your infrastructure-as-code workflows don't care which orchestrator is running underneath. You pick the deployment model that fits your environment, and the Control Plane adapts.This also opens the door for existing pgEdge customers that run directly on the host to migrate to the Control Plane without changing their deployment model. If you're running pgEdge Distributed Postgres on bare-metal hosts today, the systemd orchestrator meets you exactly where you are: same hosts, same packages, same file locations, with the Control Plane's declarative API and operational tooling layered on top.<img src="https://a.storyblok.com/f/187930/624x437/cd071e9a95/control-plane-architecture-diagram.png" ><h2>The Bigger Picture</h2>Take a step back and look at what the Control Plane now covers, because the full picture is worth seeing. You can start with a single-node Postgres database and scale to multi-master distributed Postgres with Spock replication, without re-platforming. You can deploy AI services (MCP, RAG) and data access services (PostgREST) alongside your database, managed as a unit through the same declarative spec. You can run all of it in containers via Docker Swarm or on bare metal via systemd, with the same API either way. And the whole thing is built on sensible defaults that get you running fast while leaving every knob exposed when you need to tune.We went looking for another Postgres vendor doing all of this from a single declarative API, and we couldn't find one. There are enterprise management platforms, and there are services platforms, and there are distributed database platforms. But we haven't found anyone else combining single-to-distributed progression, AI and data access services, and bare-metal-to-container deployment flexibility in a single declarative model with sensible defaults.The Control Plane goes further than anything else we've seen in the Postgres world: one spec, one API, one operational surface for everything. The database, the services, the deployment model, and the operational lifecycle, all declared together and managed together. That's the territory we're pushing into, and as far as we can tell, it's uncharted.The Supporting Services framework is designed to grow. MCP, RAG, and PostgREST are the first three service types, but the architecture is built for expansion. Connection poolers, monitoring agents, and other tools that belong alongside an enterprise database are all candidates. The pattern is the same: add a service to your spec, the Control Plane handles the rest.<h2>Try Out The New Features</h2>Both features are available now in the latest Control Plane Release. Supporting Services works with the Docker Swarm orchestrator today. The systemd support preview is available for RHEL, with Debian coming soon.Get started:<ul><li><a href="https://docs.pgedge.com/control-plane/">Control Plane Documentation</a></li></ul><ul><li><a href="https://docs.pgedge.com/control-plane/api/reference">Control Plane API Reference</a></li></ul><ul><li><a href="https://github.com/pgEdge/control-plane">Control Plane on GitHub</a></li></ul><ul><li><a href="https://www.pgedge.com/download/enterprise-postgres">pgEdge Enterprise Postgres Downloads</a></li></ul></p> ]]></description>
            <guid>https://www.pgedge.com/blog/pgedge-control-plane-adds-supporting-services-and-a-preview-of-systemd-support</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>pgEdge,PostgreSQL</category>
            <title><![CDATA[I Built Three GitHub Codespaces Walkthroughs for Our Products. Would You Use Them?]]></title>
            <link>https://www.pgedge.com/blog/i-built-three-github-codespaces-walkthroughs-for-our-products-would-you-use-them-</link>
            <pubDate>Wed, 06 May 2026 09:18:33 GMT</pubDate>
            <description><![CDATA[ <p>I need your feedback to either convince Marketing that I’m a genius and they should put these GitHub Codespaces Walkthroughs on our website, or to tell me I need to keep looking for different ways to make Quickstarts easier.Bi-directional logical replication is not a simple thing. It's a genuinely complicated problem, and getting it right across multiple nodes in a distributed PostgreSQL cluster is hard. That's what makes what we do at pgEdge special: we've done the hard engineering so you don't have to. Multi-master replication, conflict resolution, failover, all of it wrapped up so you can have this capability without needing to be a rocket scientist.But there's still a gap between "this product exists" and "I've actually tried it," and that gap is almost always the setup. You need multiple Postgres instances, a replication extension configured between them, and enough infrastructure to actually prove it's working. By the time you've got all of that running on your laptop, you've burned an afternoon and you haven't learned anything about distributed Postgres yet. You've learned about Docker networking.I wanted to see if I could close that gap. I’ve created three GitHub Codespaces walkthroughs, each targeting a different pgEdge product, each designed to take you from zero to a running environment without installing a single thing on your machine. The issue is that I have no idea whether developers would actually find them useful until I get some data - and that is where you, dear Reader, can help me out, just by trying them and letting me know.<h2>Why Codespaces</h2>GitHub Codespaces gives you a full Linux development environment in a browser tab, backed by a container running on GitHub's infrastructure. The free tier gives individual developers 120 core-hours per month, (so 60 for a 2-core, 30 for a 4-core machine) which is more than enough to run through all three of these walkthroughs multiple times. For us, the appeal was simple: if you can click a link, you can be inside a working environment in about 60 seconds.The alternative is usually along the lines of asking developers to clone a repo, install Docker, pull a bunch of images, and configure networking. We know what happens with that approach because we've watched it happen: most people bounce somewhere around step two, and the ones who make it through are already sold on the product before they start.<h2>Three Walkthroughs, Three Products</h2>Each walkthrough targets a different pgEdge product and a different audience. Two of them focus on multi-master replication with Spock (our core distributed Postgres capability) and show you that it actually works in real time. The third takes a completely different angle and puts AI-powered natural language queries in front of Postgres.<h3>pgEdge Helm on Kubernetes</h3>This one is the most involved, and probably the most interesting if you're already running Postgres in Kubernetes and wondering how distributed replication fits into that world. The walkthrough takes you through a progressive build: start with a single primary node using our Helm chart, add a standby, then expand into a full multi-master topology with Spock handling replication across nodes. The whole thing runs on a local Kubernetes cluster inside the Codespace using minikube, so you're working with real  and  commands against a real cluster, not a simulation.It takes roughly 20 to 30 minutes if you read everything, faster if you skip ahead. The walkthrough uses Runme (a VS Code extension that turns markdown into executable notebook cells) so you're clicking "Run" on each step rather than copy-pasting commands into a terminal, which means you can actually read the explanations between steps instead of just racing through them. By the end you'll have a running multi-node distributed Postgres cluster on Kubernetes, with Spock replicating writes between nodes, and you'll have done it without touching your own infrastructure.<a href="https://github.com/codespaces/new?repo=pgEdge/pgedge-helm&devcontainer_path=.devcontainer/walkthrough/devcontainer.json"><u>Open the Helm walkthrough in Codespaces</u></a><h3>pgEdge Control Plane</h3>The Control Plane is pgEdge's REST API for managing distributed Postgres clusters, and this walkthrough is the fastest way to see it in action. You spin up the control plane stack, create a distributed database through the API, verify that Spock (our Multi-Master extension) is replicating data between nodes, and then deliberately kill a node to watch the cluster handle it. That resilience test is what gets people's attention. You write data to node A, kill node B, bring it back, and watch Spock reconcile everything automatically. The entire flow takes about 10 to 15 minutes, and most of that time is waiting for containers to start.This is the walkthrough for someone who wants to understand what our management layer looks like, the person who's already past "what is distributed Postgres?" and asking "okay, but how can I actually operate this thing?" If you're evaluating pgEdge for a production workload and you want to see the API before you talk to sales, this is where you start.<a href="https://codespaces.new/pgEdge/control-plane?devcontainer_path=.devcontainer/walkthrough/devcontainer.json"><u>Open the Control Plane walkthrough in Codespaces</u></a><h3>pgEdge Postgres MCP Server</h3>This one is different from the other two. Instead of focusing on replication and cluster management, it puts an MCP Server in front of a PostgreSQL database loaded with sample data and lets you talk to it in plain English. MCP (Model Context Protocol) is the open standard for connecting AI tools to data sources, and pgEdge's MCP Server is the implementation that turns natural language questions into SQL queries against your Postgres database. Ask it "what are our top products by revenue?" or “find indexing improvements I can make to the DB” and it translates that into SQL, runs it, and returns the results. The walkthrough comes pre-loaded with a Northwind dataset so you have interesting data to query from the moment you start.The walkthrough needs an API key from Anthropic or OpenAI to run the included chatbot interface that sits on top of the MCP Server itself (set as a Codespace secret before you launch), but the setup instructions walk you through that. Once the environment is running you get both a web UI and the raw MCP Server API, so you can try natural language queries in the browser and then see exactly what's happening under the hood through the API endpoint. If you've been hearing about MCP and wondering what it actually looks like when it's connected to a real database with real data, this is probably the fastest way to find out.<a href="https://github.com/pgEdge/pgedge-postgres-mcp"><u>Open the MCP Server walkthrough in Codespaces</u></a><h2>What Actually Goes Into Making "Click and Go" Work</h2>There's a reason I wanted to write this post beyond just pointing people at three Codespaces links and asking for feedback. Making something feel effortless takes a lot of effort. The engineering behind these walkthroughs is invisible when it works correctly, and I think it's worth laying out some of what goes into it. You never know, others may follow in my path.Each walkthrough starts with a devcontainer configuration, which is basically a JSON file that tells GitHub Codespaces how to build the environment. I specify an Ubuntu base image, Docker-in-Docker support (because we need to run containers inside the Codespace), the right CLI tools pre-installed for each deployment model, port forwarding rules, and a post-creation script that handles all the setup steps that would normally take a developer a bunch of copy-pasting from a README or Quickstart. When the Codespace finishes building, the walkthrough opens in VS Code with Runme installed, a VS Code extension that turns markdown files into executable notebook cells. The walkthrough instructions aren't just documentation you read and then manually type commands from. They are the commands, and you run them by clicking a button next to each code block.I was also creating interactive guides that followed the same “demo” script, and trying to make that all the same code, including being able to run the interactive guide in Codespaces too (if someone preferred that to the runMe) or being able to view it as a regular .md documentation file. Getting that "it just works" feeling required solving a bunch of problems that only surface when you're building for environments you don't control. The Helm walkthrough has idempotency detection built in because users will inevitably re-run steps, either accidentally or because they want to experiment. It checks whether a Kubernetes cluster already exists before trying to create one, uses marker files to track which setup phases have completed, and runs operator pre-checks so you don't end up in a broken state if you hit "Run" on the same cell twice. The Control Plane walkthrough handles platform-aware port detection with fallbacks because Codespaces can run on different underlying architectures, manages Docker initialization (which behaves differently on Linux versus macOS), and orchestrates container lifecycle management so the whole stack (control plane, database nodes, monitoring) comes up in the right order every time. The MCP walkthrough manages multi-provider API key configuration so you can use either Anthropic or OpenAI as your LLM backend, runs layered health checks against multiple services before declaring the environment ready, and handles Codespace URL detection so the forwarded ports resolve correctly in your browser instead of pointing at localhost.Finally, once I had the Codespaces walkthroughs set up, I realized that its possible our own devs might use Codespaces to get work done - so I had to go create alternative “Dev Profile” devcontainers, and then figure out how to deeplink to the walkthrough profile.None of this is visible to the person clicking the link, and none of it shows up in the walkthrough itself. But it represents real engineering time and real problem-solving aimed at one goal: making distributed Postgres accessible to someone who has never touched it before. I think there's value in at least giving a heads up of some of the things you might run into if you decide to make codespaces demos of your own.<h2>Please Give Me Feedback</h2>I built these walkthroughs because I believe the biggest barrier to trying server-based products is the setup tax. Every database vendor has documentation, and most of it assumes you already have a running environment configured the way they expect. I wanted to remove that assumption entirely and find out what happens when you can go from "I'm curious about distributed Postgres" to "I'm looking at a running cluster" in a few minutes.I think they work well, but I honestly don't know how many developers use Codespaces regularly, or would be willing to try one, or would wait the 60ish seconds for the instance to spin up. What I do know is that it's a really easy way to try something out without having to install anything at all, and that should be worth something if you're still in the "should I even bother looking at this?" phase.If you've been curious about distributed Postgres, or MCP Servers, and never got around to spinning one up, or if you’re just curious as to to how Codespaces can work as a demo/teaching environment, pick a walkthrough, try it out, and give me feedback - did they work? Are they useful? is Codespaces a viable way to show things off?  <a href="https://github.com/codespaces/new?repo=pgEdge/pgedge-helm&devcontainer_path=.devcontainer/walkthrough/devcontainer.json"><u>Kubernetes and Helm</u></a> if you're running containers in production, <a href="https://codespaces.new/pgEdge/control-plane?devcontainer_path=.devcontainer/walkthrough/devcontainer.json"><u>Control Plane</u></a> if you want to see the management API, or <a href="https://github.com/pgEdge/pgedge-postgres-mcp"><u>MCP Server</u></a> if you want to talk to Postgres in plain English. Each one takes less time than it took to read this blog.Depending on where you read this blog, please give feedback as comments, or worst case, just email me at <a href="mailto:antony@pgedge.com"><u>antony@pgedge.com</u></a></p> ]]></description>
            <guid>https://www.pgedge.com/blog/i-built-three-github-codespaces-walkthroughs-for-our-products-would-you-use-them-</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>PostgreSQL,pgEdge,pgEdge,postgres,PostgreSQL</category>
            <title><![CDATA[Introducing the AI DBA Workbench: PostgreSQL Monitoring That Diagnoses, Not Just Reports]]></title>
            <link>https://www.pgedge.com/blog/introducing-the-ai-dba-workbench-postgresql-monitoring-that-diagnoses-not-just-reports</link>
            <pubDate>Wed, 22 Apr 2026 05:54:40 GMT</pubDate>
            <description><![CDATA[ <p>PostgreSQL is dominating the database market, and the monitoring tools haven't noticed.More teams run Postgres in production every year. More of those deployments are distributed, multi-region, and mission-critical. And the tooling most of those teams rely on was built for a simpler world: a single instance, a handful of threshold alerts, and a senior DBA who can interpret what the graphs mean at 3 AM. That works when you have one cluster and one person who knows where the bodies are buried. It falls apart the moment you scale past either of those constraints.We built the pgEdge AI DBA Workbench to close that gap, and today it's entering public beta. We think it's the best PostgreSQL monitoring and management platform you've seen, and the rest of this post explains why any postgres 14+. Local installs, self-hosted enterprise estates, Supabase, Amazon RDS - If you can connect to it, you can monitor it.<h2>Three Services, One Platform</h2>The Workbench is a self-hosted platform that combines three services into a single deployment. A collector gathers metrics from every monitored PostgreSQL instance. An alerter evaluates those metrics against threshold rules and a layered anomaly detection system. A server ties everything together through a web UI, a REST API, and a Model Context Protocol (MCP) endpoint that lets AI tools talk directly to your databases.It runs on your infrastructure, ships under the PostgreSQL license, and stores nothing outside your network.<h2>34 Probes, Zero Agents</h2>The collector is the foundation. Point it at your PostgreSQL instances (any PostgreSQL 14 or later, not just pgEdge) and it starts pulling metrics across 34 built-in probes covering query performance, replication health, active connections, WAL throughput, vacuum activity, checkpoints, database conflicts, IO statistics, system-level CPU and memory, disk usage, and more.Two things matter about how collection works. First, the collector connects remotely over standard PostgreSQL connections. There are no agents to install on your database servers, no binaries to deploy and maintain on each node, no version compatibility headaches between agent and server. You give the collector a connection string and it handles the rest. Second, data management is automatic. The collector partitions metrics tables by time and enforces retention policies, so you don't end up nursing a monitoring datastore that's grown larger than the databases it's supposed to be watching.For distributed PostgreSQL specifically, the collector tracks replication lag, slot usage, subscription status, and node roles as first-class metrics. If you're running Spock multi-master replication, these aren't bolted-on extras but rather baked into the probe system from the ground up.<h2>Five Levels of Drill-Through</h2>The web client ships with a hierarchical dashboard system that lets you go from fleet-wide overview to individual index statistics without switching tools. At the estate level, you see every cluster in your fleet with aggregate health indicators and AI-generated summaries. The cluster view renders replication topology with color-coded edges showing lag and health, so you can spot a degraded replication link at a glance. Server dashboards break down per-instance metrics: connections, transactions, tuple operations, cache hit ratios. Database and object views take you down to individual table and index statistics, including bloat, sequential scan ratios, and dead tuple counts.Every dashboard level includes AI-powered summaries that highlight what needs attention. The summaries aren't canned text templates filled with numbers but are generated by an LLM that has access to the actual metrics context, so the summary for a cluster with rising replication lag reads differently from one with connection pool exhaustion.<h2>Three-Tier Anomaly Detection</h2>Traditional monitoring tools give you threshold alerting: if replication lag exceeds 30 seconds, fire an alert. That catches the obvious problems, but it completely misses the subtle ones, like a slow drift in query latency that doesn't cross any single threshold but indicates an index that's quietly degrading, or a change in checkpoint frequency that correlates with a workload shift you haven't noticed yet.The Workbench's alerter runs a three-tier detection system. The first tier is statistical baselines. The alerter builds rolling baselines for every metric and flags deviations that fall outside expected ranges. This catches the standard stuff efficiently without needing manual threshold configuration for every metric on every server.The second tier is vector similarity. The alerter generates embeddings of anomalous metric patterns and compares them against a library of known anomaly signatures. If the current pattern looks like something that caused an outage last month, it gets flagged even if the raw numbers haven't crossed a threshold yet, which is pattern recognition rather than just pattern matching.The third tier is LLM classification. When the statistical and embedding tiers surface something ambiguous, the alerter can call an LLM to classify what's happening. The LLM gets the metric context, the anomaly history, and the system state, and provides a classification that goes beyond "this number is higher than usual" to "this looks like a connection pool leak combined with long-running transactions," because it can reason about context rather than just reacting to numbers.On top of detection, the alert management system respects the reality of on-call life. Blackout scheduling with cron-based recurrence so your maintenance windows don't generate noise. Hierarchical threshold overrides so production gets tighter bounds than staging. Automatic alert clearing when conditions normalize, because nothing drains on-call morale like manually closing alerts that already fixed themselves. Multi-channel notifications to Slack, email, Mattermost, or webhooks.<h2>Meet Ellie</h2>Ellie is the Workbench's built-in AI assistant, and she's the feature that changes how you actually interact with your monitoring data. Most monitoring tools bolt a chatbot onto their dashboard and call it "AI-powered." Ellie is different because she has access to 21 MCP tools that let her take real action. She can run  on your slow queries and walk you through the execution plan. She can inspect your schema, query historical metrics from the datastore, search the pgEdge documentation knowledge base, and retrieve alert history. When you ask "why is this query slow?", she doesn't give you a generic optimization checklist. She looks at the actual query plan, the actual table statistics, and the actual index usage, and tells you what's happening with your specific data.Ellie supports multi-step diagnostic workflows with up to 50 iterations of tool use per conversation. That means she can follow a thread: start with a slow query, check the execution plan, look at the table's vacuum statistics, notice the bloat ratio is high, check when the last vacuum ran, and recommend either a manual vacuum or a configuration change. That's the kind of investigation a senior DBA does instinctively, and Ellie does it conversationally, step by step, showing her work the entire way.She also remembers what you tell her. Pin a memory ("our busiest period is 2-4 PM EST," "the orders table is being deprecated in Q3," "always check replication lag before recommending schema changes") and it gets included in every conversation automatically, whether it's facts, preferences, instructions, or operational context. You don't have to re-explain your architecture every session, and no other PostgreSQL monitoring tool has this kind of persistent AI memory.<h2>How This Stacks Up</h2>The PostgreSQL monitoring market has options: pganalyze, Datadog, Percona PMM, pgwatch, EDB PEM. Here's where the Workbench sits relative to each.pganalyze gives you index recommendations and query insights, but it's a SaaS product where your monitoring data lives on their infrastructure and AI capabilities stop at static recommendations. Datadog offers ML-based anomaly detection and is strong for multi-service observability, but the PostgreSQL monitoring is one piece of a much larger platform and there's no agentic AI assistant that can actually run queries against your databases. Percona PMM is open source and capable, but its AI features are newer and less deeply integrated into the monitoring workflow. pgwatch is lightweight and open source, but it's pure metrics collection with no AI layer at all.The Workbench's differentiators come down to three things: the agentic AI assistant with real tool access, the three-tier anomaly detection that goes beyond what any of these tools offer, and native support for distributed PostgreSQL. If you're running Spock multi-master replication, nothing else on this list monitors it natively.<h2>Bring Your Own AI…</h2>The Workbench supports Anthropic Claude, OpenAI, Google Gemini, and any OpenAI-compatible local model runner, including Ollama, LM Studio, llama.cpp, and EXO. For teams in regulated industries or air-gapped environments, you can run the entire platform with a local model and nothing leaves your network, and you can switch providers anytime without changing your workflow.The MCP endpoint (that's Model Context Protocol, the open standard for connecting AI tools to data sources) also means any MCP-compatible client can connect to your Workbench, from Claude Code and Cursor to VS Code with GitHub Copilot and Windsurf. Your AI development tools get the same access to your monitoring data that Ellie has.<h2>…Or Don’t</h2>Not a fan of AI? The Workbench can be used as a traditional monitoring system for PostgreSQL – just don’t configure a model provider. You’ll get all of the features you would expect from a monitoring system: alerting, dashboards, cluster visualisation and more, just without the AI features.<h2>Try the Beta</h2>The AI DBA Workbench is available today as a public beta. Getting started with Docker Compose takes a few steps: clone the repo, generate a shared secret for encrypting connection passwords, create a datastore password file, and bring up the services.Once the services are running, point your browser at , create your admin user, add your PostgreSQL connection strings, and the collector starts gathering metrics immediately. The full documentation (including the <a href="https://docs.pgedge.com"><u>Quick Start Guide</u></a>) walks through every step, and the source code is on <a href="https://github.com/pgEdge/ai-dba-workbench"><u>GitHub</u></a> under the PostgreSQL License.This is a beta, which means we want your feedback and we want it loud. Leave feedback by filing issues on the <a href="https://github.com/pgEdge/ai-dba-workbench"><u>GitHub Repository</u></a> to tell us what probes you need that we're missing, what Ellie got wrong, or even what made you grin. We built this because we saw a gap in how PostgreSQL teams monitor and manage their databases, and we believed there was a fundamentally better way to do it. Go try it and tell us if we were right.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/introducing-the-ai-dba-workbench-postgresql-monitoring-that-diagnoses-not-just-reports</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>pgEdge,PostgreSQL,PostgreSQL High Availability,postgres,PostgreSQL</category>
            <title><![CDATA[How to Use the pgEdge Control Plane: From Zero to Multi-Master and Beyond]]></title>
            <link>https://www.pgedge.com/blog/how-to-use-the-pgedge-control-plane-from-zero-to-multi-master-and-beyond</link>
            <pubDate>Tue, 21 Apr 2026 09:37:57 GMT</pubDate>
            <description><![CDATA[ <p>A couple of months back, the CEO challenged product and marketing to revamp the developer experience on our website in three weeks. I vibe-coded a proof of concept full of "try it now" buttons and interactive guides, the CEO loved it, and then I had to deal with almost every one of those interactive guides being a placeholder card. Engineering was fully booked, and the Control Plane product I needed to write guides for was one I knew inside out at the architecture level but had never personally operated end-to-end through the API.So I sat down and learned the pgEdge Control Plane the hard way: by using it. What follows is what I found, organized as the guide I wish I'd had when I started. If you're evaluating Control Plane, deploying it for the first time, or trying to understand what Day 2 operations actually look like, this is for you.<h2>What Is the Control Plane?</h2>pgEdge Control Plane is a lightweight orchestrator for PostgreSQL. It manages the full database lifecycle (creation, replication, failover, backup, restore, scaling) through a declarative REST API. You describe the database you want in a JSON spec, POST it, and Control Plane handles the rest: configuration, networking, Spock multi-master replication, Patroni for high availability, pgBackRest for backups. All of it.The important thing to understand is that setup is only half the story. There are enough tools out there that can get you a running cluster if you know what you're doing. The hard part, the part where most tools leave you on your own, is Day 2. Modifying a running HA cluster. Adding a node to a live distributed database. Performing a rolling upgrade without downtime. Restoring from backup while keeping replication intact across the remaining nodes. That's where the complexity lives, and that's where Control Plane earns its keep.<h2>Getting Started: Zero to Multi-Master in Five Minutes</h2><h3>Caveat</h3>I’m not promising that every line of code in here will run as-is - It's real, but as I learned the hard way while building the interactive guides, you can’t control everything - so consider it Illustrative.If you'd rather skip all of that, the Codespaces environment at the end of this post has everything pre-installed and ready to go.<h3>Prerequisites</h3>You'll need Docker, curl, jq, and psql (the PostgreSQL client) installed on your machine. On macOS, you can get psql via  (or whichever major version you prefer). On Linux, your distribution's  package will do the job.If you're using Docker Desktop, there's one gotcha that will bite you if you skip it: you need to enable host networking manually. Go to Docker Desktop > Settings > Resources > Network, check "Enable host networking," and restart Docker. Without this, Control Plane won't be accessible via localhost.<h3>Start the Control Plane</h3>Control Plane uses Docker Swarm to manage the database containers, so the first step is initializing Swarm mode if it isn't already active. Then you pull the image, start the container, and initialize the cluster.The  flag is required because Control Plane needs stable IP addresses for both inter-machine communication (between Control Plane instances on different hosts) and intra-machine communication with Patroni and Postgres. The Docker socket mount () is how Control Plane creates and manages the Postgres containers it orchestrates. And if  fails because you have multiple network interfaces, you may need to specify which IP to advertise: .  In a production cluster, this should be an IP address that's accessible from all other machines in your cluster.Once  returns, the API is listening on port 3000 and the cluster is ready to accept database specs.<h3>Create a Distributed Database</h3>A single POST request with one JSON payload gives you three nodes with multi-master replication.If ports 5432-5434 are already in use on your machine (maybe you have a local Postgres running), just change the port numbers in the spec. Any available ports will work.That JSON spec is the entire declaration. Three nodes, each one a full Postgres primary, with Spock replication configured bidirectionally between all of them. No replication slot configuration, no publication/subscription SQL, no manual wiring of logical replication channels. You describe what you want, and Control Plane builds it.Database creation is asynchronous. The API returns a task ID immediately, and Control Plane works in the background to pull images, start containers, configure Postgres, and wire up Spock replication. On the first run this takes a couple of minutes because it's pulling container images. Poll the database endpoint until the state flips to : <h3>Prove It Works</h3>Create a table on node 1:Insert a row on node 2:Read it back from node 1:Row written on n2, readable on n1. Spock replicated it in milliseconds. Every node accepts reads and writes, and every change propagates to every other node automatically.<h3>Node Failure and Recovery</h3>This is the part that sold me when I was running through this myself. Take node 2 offline:Node 2 is dead, gone from the cluster. Now write data while it's missing:Read from n3, which doesn't care that a node is missing:All three rows. Now bring n2 back:Wait for it to come up, read from n2, and all three rows are there. Including the one written while n2 was down. Spock caught it up automatically, with no manual intervention, no replication conflict resolution, no panicked DBA at 3am running . The recovery, not the initial replication, is the demo that sells itself.<h2>Day 2 Operations: The Hard Part, Made Simple</h2>Initial setup is the easy part. Plenty of tools can get you a running cluster on Day 1. Where Control Plane separates itself is Day 2: modifying, scaling, and maintaining a running cluster without requiring a PhD in distributed systems administration.<h3>High Availability with Read Replicas</h3>Each node in a distributed database can have its own read replicas. You configure this by adding more host IDs to the  array. The first host gets the primary, the rest become replicas managed by Patroni. Our examples so far run on a single machine, but here's what a production spec might look like with hosts across AWS regions:That's a 3-node multi-master database spanning three AWS regions, each node with a read replica in a different availability zone. Six Postgres instances, bidirectional replication between the primaries, streaming replication to the replicas, automatic failover via Patroni. All from a single JSON spec.<h3>Switchover vs. Failover</h3>Control Plane exposes two distinct tools for handling primary transitions, both built on Patroni under the hood. The value isn't in reinventing what Patroni already does well, it's in wrapping those operations in the same declarative API you use for everything else, so you can trigger and observe them from anywhere in the cluster.Switchover is for planned maintenance. It's a graceful transition from a primary to a replica, and you can even schedule it for a specific time:Control Plane validates cluster health before proceeding, promotes the specified replica, and demotes the old primary to replica status. You can run this during a maintenance window and go to bed knowing it will handle the transition at 2am without you.Failover is for when things have already gone wrong. A primary is unreachable and you need to promote a replica immediately:It even has a  flag for disaster recovery scenarios where the cluster is already in a degraded state and you just need to get a new primary up. The control is there when you need it, and the automation handles the rest when you don't.<h3>Scaling: Adding and Removing Nodes</h3>This is one of the things I didn't appreciate until I dug into the codebase. You can scale a running distributed database by updating the spec and POSTing it back. Want to add a fourth node? Update the  array:Control Plane figures out the delta between the old spec and the new one, provisions the new node, configures Spock replication to and from the existing nodes, and syncs the data. The same declarative model you used on Day 1 works on Day 200. You don't need to learn a different set of commands for modifying a running cluster than the ones you used to create it in the first place.<h3>Instance-Level Control</h3>Sometimes you don't need to operate on a whole node. You need to bounce a specific Postgres instance, maybe to pick up a configuration change or clear shared buffers. Control Plane lets you do that at the instance level:You can restart, stop, and start individual instances, and each operation supports scheduling for a future time. The  parameter lets you operate on instances even when the database is in a degraded state, which is exactly when you tend to need fine-grained control the most.<h3>PostgreSQL Configuration</h3>You can pass PostgreSQL configuration parameters directly through the database spec, at both the database level and per-node:Set it globally in the spec and it applies to every node. Override it at the node level for nodes that serve different workloads (maybe n1 handles your OLTP traffic and needs more connections, while n3 handles analytics with larger ). Control Plane applies the changes and handles the restarts where needed.<h2>Backups and Restore</h2>Control Plane integrates pgBackRest and makes backup configuration declarative, following the same philosophy as everything else. Add a  to your database spec:Nightly full backup, hourly incrementals, to S3. The  array supports S3, GCS, Azure Blob Storage, and POSIX/CIFS mounts, and you can configure multiple repositories for geographic redundancy. You can also trigger manual backups through the API when you want one outside the schedule:<h3>Point-in-Time Recovery</h3>Restore is a single API call, and it supports three different targeting modes depending on how precise you need to be:That restores to a specific timestamp. You can also target a specific WAL LSN () or a transaction ID () if you need byte-level precision about exactly how far to roll back. Control Plane orchestrates pgBackRest to handle the whole restore sequence: tears down replication subscriptions, stops the instance, runs the restore, brings it back up, reconnects replication. A whole chain of operations that have to happen in the right order, with the right error handling, or you end up with a split-brain cluster and a very bad morning. Automated into a single POST.You can also create entirely new databases from backups, or seed new nodes from existing backup data. Adding a fourth node to your cluster without copying data over the wire from a running primary? Just point it at the backup repository.<h2>Deploy Supported Services Alongside Your Database</h2>This capability is in beta, so it's not in any of the marketing materials yet. Control Plane can deploy and manage services alongside your databases, using the same declarative spec model. Right now, that means MCP and RAG servers and PostgREST instances (more will be added):You declare the service type, resource limits, and which database node it should connect to, and Control Plane handles deployment, health checking, and lifecycle management. The  block lets you control session attributes, so you can point read-heavy services at standby replicas and write-heavy services at the primary.This is particularly interesting for the AI use case. You can spin up a distributed Postgres database with pgVector, configure Spock replication across regions, and deploy an MCP server on top of it, all from a single JSON spec. The database and the services that consume it are managed as a unit.<h2>Monitoring Operations with Tasks</h2>Every mutating operation in Control Plane (create, update, delete, backup, restore, switchover, failover, instance restart) produces a task that you can track through the API. You already saw this earlier when we polled for the database creation to complete, but you can dig much deeper:Tasks have states (pending, running, completed, failed, canceled), and the log endpoint supports streaming with pagination so you can follow long-running operations in real time. If something goes wrong during a restore or a scale-out, the task log tells you exactly where it failed and why.You can also query tasks at the cluster level with scope filters, so a "show me everything that happened to database X in the last 24 hours" query is straightforward. This is the kind of observability that makes the difference between "something went wrong" and "here's exactly what went wrong, and here's the log entry that tells us why."<h2>The Full Spec: What You Can Declare</h2>Here's a quick reference of everything you can configure in a database spec. The declarative model means all of these fields work the same way: set the value you want, POST the spec, and Control Plane makes it so.The key insight is that the spec is your single source of truth, the same infrastructure-as-code pattern you'd recognize from Terraform or Kubernetes manifests. You don't learn one set of commands for creation and a different set for modifications. Change the spec, POST it, and Control Plane calculates the diff and applies it.<h2>What's Coming Next</h2>The team isn't slowing down.systemd support is in the works, which means Control Plane won't require Docker on every host. The ability to deploy supporting services alongside your databases is expanding (think connection poolers, monitoring agents, and AI/ML tooling managed through the same declarative API). Extensions like pgVector, PostGIS, pgAudit, and the pgEdge Vectorizer are already supported and configurable through the spec.The capability surface keeps expanding, and the interaction model stays the same: describe what you want, POST it, let Control Plane figure out the how.<h2>How I Built This Knowledge (And Why It Matters)</h2>I want to circle back to how this blog came to exist, because there's a lesson in it for anyone building developer products.I'd been the PM for Control Plane for months. I'd read the architecture docs, attended the design reviews, written the user stories, built the roadmap. And yet my understanding of the product shifted when I sat down and ran every API call myself. There's a difference between knowing what your product does and feeling it respond to your inputs through a terminal. Both matter, but the second one is what lets you write a guide that actually helps someone.That three-week sprint to build the interactive guides forced me through every operation in this blog post. Along the way, I made every rookie mistake you'd expect: I originally had setup scripts that automatically installed Docker and curl and jq on people's machines (engineering rightly pulled me back from that cliff). I built too many delivery formats before figuring out which ones developers actually wanted. I went spectacularly, enthusiastically overboard.The final result was tighter for all that trimming. But the real output wasn't the guides. It was this deeper understanding of the product that only comes from using it fully. If you're a PM and you haven't REALLY used your own product end-to-end recently, go do it. You'll find things that surprise you,things you had forgotten, things that frustrate you, and things that make you think "damn, this is actually good." All are worth knowing about.<h2>Cleaning Up</h2>When you're done experimenting, tear everything down in three steps:<h2>Try It Yourself</h2>Everything I built during those three weeks is open source and ready to run.The fastest path: open it in GitHub Codespaces and you'll have a working environment in under a minute:<a href="https://codespaces.new/pgEdge/control-plane?devcontainer_path=.devcontainer/walkthrough/devcontainer.json"><u>Open in Codespaces</u></a>On your own machine, one command bootstraps everything:Prefer VS Code? Install the <a href="https://marketplace.visualstudio.com/items?itemName=stateful.runme"><u>Runme extension</u></a>, open the walkthrough, and click Execute Cell on each block.How long to go from zero to a running distributed Postgres database? About five minutes. I timed it.Control Plane is open source from pgEdge. And I’m unashamedly proud of what we’ve built. <a href="https://docs.pgedge.com/control-plane/"><u>Documentation</u></a> | <a href="https://docs.pgedge.com/control-plane/api/reference"><u>API Reference</u></a> | <a href="https://github.com/pgEdge/control-plane"><u>GitHub</u></a> | <a href="https://www.pgedge.com/download/enterprise-postgres"><u>Enterprise Postgres Downloads</u></a></p> ]]></description>
            <guid>https://www.pgedge.com/blog/how-to-use-the-pgedge-control-plane-from-zero-to-multi-master-and-beyond</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>Distributed Postgres</category>
            <title><![CDATA[pgEdge MCP Server for Postgres Is Now GA. Here’s Why That Matters]]></title>
            <link>https://www.pgedge.com/blog/pgedge-mcp-server-for-postgres-is-now-ga-here-s-why-that-matters</link>
            <pubDate>Thu, 02 Apr 2026 12:08:30 GMT</pubDate>
            <description><![CDATA[ <p>If you’re building agentic AI applications, you’ve probably already hit the wall where your LLM needs to actually talk to a database. Not just dump a schema and hope for the best, but genuinely understand the data model, write reasonable queries, generate code for new UIs and even entire applications, and do it all without you holding its hand through every interaction. That’s the problem MCP servers are supposed to solve, and most of them do a decent enough job of it when you’re prototyping on your laptop.Production is a different story.The pgEdge MCP Server for Postgres is now generally available, and it’s also available as a managed service inside <a href="/products/pgedge-cloud"><u>pgEdge Cloud</u></a>. We built it because the gap between “works in a demo” and “runs in production with real security, real availability requirements, and real compliance constraints” is wider than most people realize, and the existing MCP servers out there weren’t closing it.<h2>The Problem With Most MCP Servers</h2>Here’s what typically happens. You grab an MCP server, wire it up to Claude Code or Cursor, point it at a local Postgres instance, and everything works great. Your LLM can introspect the schema, write queries, generate application code and UIs, and even suggest optimizations. You feel like you’re living in the future.Then someone asks you to run it against the production database. The one with PII in it. The one that needs to stay in eu-west-1 or on-prem for compliance reasons. The one that can’t go down because three other services depend on it. And suddenly you’re staring at a tool that doesn’t support TLS, doesn’t have real authentication, can’t enforce read-only access, and definitely wasn’t designed to run in an air-gapped environment.So we built one that closes it. The pgEdge MCP Server works with any standard Postgres database running v14 or newer (not just pgEdge’s own products), and it’s designed from the ground up for the kind of environments where “just spin it up and see what happens” isn’t an acceptable deployment strategy.<h2>What Ships in pgEdge MCP Server v1.0</h2>Let’s talk about what’s in the box, because the feature list matters more than the marketing language around it.The first thing you’ll notice is the schema introspection. The server doesn’t just hand the LLM a list of table names and call it a day. It pulls primary keys, foreign keys, indexes, column types, constraints, and even partitioned table hierarchies, which means the LLM can actually reason about how your data model fits together instead of blindly firing off SELECT * queries and hoping something useful comes back. For databases with time-based partitioning (where you might have hundreds of child tables cluttering the context window), it recognizes partitioned parents and hides the children by default, keeping things clean for the model to work with.On the security side, this isn’t a “we’ll add auth later” situation. The server supports stdio, HTTP, and HTTPS with TLS out of the box, along with user and token authentication that hot-reloads so you can rotate credentials without restarting anything. Read-only mode is enforced by default, and the 1.0 release goes further with active defense against bypass attacks, rejecting queries that try to manipulate  settings through PL/pgSQL DO blocks or  calls. The net effect is that you can give your LLM agent access to production data without giving it the keys to the castle.Multi-database support lets you connect to dev, staging, and production from the same server and switch between them, which sounds simple until you realize most MCP setups require a separate configuration for every environment you touch.<h2>What’s New in the GA Release</h2>The pgEdge MCP Server v1.0 release adds a bunch of capabilities that came directly out of developer feedback during the beta, and a few of them fundamentally change what you can do with the server.The biggest one is probably custom tools. You can now extend the MCP server by writing tools in SQL, Python, Perl, or JavaScript, defining them in a YAML file and dropping them into your configuration. They show up as first-class MCP tools alongside the built-in ones, which means if your team has a specific workflow or analysis you run regularly, you can package it as something the LLM can invoke directly. This is where things start to get genuinely interesting for teams with domain-specific database operations they want to bring into their AI workflows.To show what that looks like in practice, we’re shipping a <a href="https://github.com/pgEdge/pgedge-postgres-mcp/blob/main/examples/pgedge-postgres-mcp-dba.yaml"><u>DBA starter pack as a drop-in YAML</u></a> definitions file. It comes with three pre-built tools:  for analyzing your most resource-consuming queries,  for running a seven-category health check, and  for two-tier index recommendations with optional HypoPG simulation. Think of it as giving your LLM a solid foundation of DBA knowledge out of the box, so it can start being useful for performance work without you having to teach it everything from scratch. (Read the blog <a href="https://www.pgedge.com/blog/replicating-crystaldba-with-pgedge-mcp-server-custom-tools"><u>Replicating CrystalDBA With pgEdge MCP Server Custom Tools</u></a> to learn more about this).On the operational side, there are a few things worth calling out. Multi-host connection support means the server handles HA and failover natively, using  to route reads to standbys and writes to the primary through libpq-compatible connection strings. Write query confirmation prompts you before the server executes any DDL or DML when write access is enabled, and the server sets MCP tool annotations (, ) so third-party clients can implement their own confirmation flows. And one-command installers for Claude Code and Claude Desktop automate the binary download, config generation, and client registration, so you don’t have to manually edit JSON files just to get started.There’s also a set of changes you’ll feel more than see. The server now uses tab-separated values instead of JSON for query results, paginates automatically, and applies context window compaction, all of which reduces token usage significantly. If you’re running agents that make a lot of database calls (and the interesting ones always do), this hits your API bill in a good way.<h2>Works With What You’re Already Using</h2>The pgEdge MCP Server supports the tools developers are actually reaching for today: Claude Code, Claude Desktop, Cursor, Windsurf, VS Code Copilot. On the model side, it works with Anthropic and OpenAI frontier models, plus locally hosted models through Ollama, LM Studio, and anything else that speaks the OpenAI API. The production CLI client includes Anthropic prompt caching, which cuts costs by up to 90% for repeated interactions against the same schema. You don’t have to rearchitect your AI stack to use this.<h2>Deploy It Your Way</h2>Every database product talks about deployment flexibility, but here’s what it actually looks like. If you want the fully managed path, <a href="https://www.pgedge.com/products/pgedge-cloud"><u>pgEdge Cloud</u></a> deploys the MCP Server alongside your database cluster with nothing extra to set up or maintain. If you want to run it yourself, there are Docker images on GitHub Container Registry and Docker Compose support for multi-instance deployments, giving you full control over configuration and networking in your own cloud environment. And if you need to run on-premises, including in air-gapped environments where nothing touches the public internet, the server compiles to a fully static binary with no C compiler dependency, so there’s nothing to fight with in locked-down environments.The whole thing is open source under the PostgreSQL license, and pgEdge customers with paid subscriptions for Enterprise Postgres or Distributed Postgres get support at no extra cost.<h2>Get Started</h2>If you want the managed path, <a href="https://app.pgedge.com"><u>pgEdge Cloud</u></a> has the MCP Server ready to go. Spin up a database, enable the MCP Server, and start querying.For self-hosted deployments, see the <a href="https://docs.pgedge.com/#pgedge-agentic-ai-toolkit-for-postgres"><u>documentation</u></a> or <a href="/download/ai-toolkit"><u>download </u></a>now.  It works with any standard Postgres v14+, so you can point it at whatever database you’re already running.If you want to kick the tires without any setup at all, the <a href="https://github.com/pgEdge/pgedge-postgres-mcp"><u>GitHub Codespaces demo</u></a> gives you a one-click browser-based environment with sample data loaded and ready to query.And if you’re in New York on April 2-3, come find us at the <a href="https://events.linuxfoundation.org/mcp-dev-summit-north-america/"><u>MCP Dev Summit 2026</u></a> at booth S13. We’ll be running live demos and would love to show you what this looks like in practice.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/pgedge-mcp-server-for-postgres-is-now-ga-here-s-why-that-matters</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>PostgreSQL,Distributed Postgres,pgEdge,pgEdge,postgres,PostgreSQL</category>
            <title><![CDATA[Replicating CrystalDBA With pgEdge MCP Server Custom Tools]]></title>
            <link>https://www.pgedge.com/blog/replicating-crystaldba-with-pgedge-mcp-server-custom-tools</link>
            <pubDate>Wed, 01 Apr 2026 07:44:03 GMT</pubDate>
            <description><![CDATA[ <p>A disclaimer before we start: I'm product management, no longer an engineer. I can read code, I can write it … incredibly slowly. I understand PostgreSQL at a product level, and I know what questions to ask. But the code in this project was written by Claude - specifically, Claude Code running in my terminal as a coding agent. I directed the architecture, made the design calls, reviewed the output, and did the testing. Claude wrote the code. This is a vibe-coding story as much as it is a technical one.The pgEdge Postgres MCP Server has a custom tool system. You write PL/pgSQL in a YAML file, drop it into the config, and the server exposes it as an MCP tool. I wanted to find out how far that system could go - not with toy examples, but with something genuinely hard.<h2>The pgEdge Postgres MCP Server</h2>The <a href="https://github.com/pgEdge/pgedge-postgres-mcp"><u>pgEdge Postgres MCP Server</u></a> connects AI agents and tools - Claude Code, Claude Desktop, Cursor, and others - directly to any PostgreSQL database. It supports Postgres 14 and newer, including standard community Postgres, Amazon RDS, and managed services. It's open source under the PostgreSQL licence.The server handles multi-database connections, user and token authentication, TLS, read-only enforcement, and optional write access. It runs over both  and HTTP transports. You can read more about it on the <a href="https://www.pgedge.com/ai-toolkit"><u>pgEdge AI Toolkit</u></a> page or in the <a href="https://docs.pgedge.com/pgedge-postgres-mcp-server/development/"><u>documentation</u></a>.The feature that matters for this post is <a href="https://docs.pgedge.com/pgedge-postgres-mcp-server/development/developers/mcp-protocol/"><u>custom tools</u></a>. You define MCP tools in a YAML file - SQL queries, PL/pgSQL code blocks, or stored function calls - and the server exposes them alongside its built-in tools. No recompilation, no plugins. Drop in the YAML, point the config at it, restart.<h2>CrystalDBA: The Benchmark</h2>The <a href="https://github.com/crystaldba/postgres-mcp"><u>Crystal DBA</u></a> team built one of the most popular Postgres MCP servers on GitHub - over 2,000 stars, and deservedly so. Their  server packs genuine DBA intelligence into a clean MCP interface: health checks, top query analysis, and an index tuner that uses real cost-based simulation. It's impressive work.One thing worth noting: the project's development has slowed. The most recent commit is from January 2026, and there are 25 open issues - some dating back to April 2025 - with many unanswered. That's not a criticism. Open source maintainers don't owe anyone their time, and what the Crystal DBA team shipped is genuinely good, but if you're evaluating MCP servers for production use, activity matters. It was another reason I wanted to see if our custom tools system could cover the same ground - not to replace CrystalDBA, but to offer an actively maintained alternative for teams that need DBA tooling in their MCP workflow.<h2>How CrystalDBA Works</h2>CrystalDBA is a standalone MCP server for PostgreSQL database administration - health checks, top query analysis, and index tuning with hypothetical index simulation. The kind of things a DBA reaches for daily.Under the hood, it's a ~4,000-line Python application with a non-trivial architecture. Three things matter:pglast is a Python SQL parser that wraps PostgreSQL's own parser (libpg_query) via C bindings. CrystalDBA uses it to parse normalised queries from  into abstract syntax trees, then walks those trees to extract every column referenced in WHERE clauses, JOIN conditions, ORDER BY, and GROUP BY. That's how it generates index candidates - it knows exactly which columns each query filters and joins on.HypoPG is a PostgreSQL extension for hypothetical indexes. You create a "what-if" index without actually building it, then run EXPLAIN to see whether the query planner would use it. CrystalDBA pairs pglast's candidate extraction with HypoPG's cost simulation to validate whether each candidate actually helps.The execution model is server-side Python. CrystalDBA runs pglast in the Python process that hosts the MCP server. It connects to PostgreSQL as a client, pulls queries, parses them locally, generates candidates, then reaches back into PostgreSQL to create hypothetical indexes and run EXPLAIN. The intelligence lives in the server, not in the database.That's the system I wanted to recreate, as a drop-in YAML file. No changes to the pgEdge server binary, no new dependencies, just PL/pgSQL running inside PostgreSQL.<h2>Why Not Just Use query_database?</h2>Before building anything, I checked whether named tools were even necessary. The pgEdge server already has . Every DBA diagnostic is just a SQL query:Top queries - done. Unused indexes? One query against . Buffer cache hit rates? . The capability already exists.But LLMs don't always reach for it. The  tool describes itself as a tool for "structured, exact data retrieval" with examples like "How many orders were placed last week?" An LLM scanning that description thinks "business data tool," not "DBA diagnostic tool." When a user asks "why is my database slow?" the LLM doesn't reach for  to check  if it doesn't know it should.CrystalDBA has a tool named  with a description about "slowest or most resource-intensive queries." which is impossible to miss.This is a UX problem, not a capability problem. Named tools with DBA-oriented descriptions solve it. And there's a bonus effect - the mere presence of a DBA tool in the tool list shifts how the LLM perceives the other tools. Once it sees , it understands this is a DBA-capable server and becomes more willing to use  for adjacent diagnostic queries too. One tool shifts the entire context.<h2>How Custom Tools Work</h2>It helps to understand the execution model, because it shapes every design decision that follows.The pgEdge server supports three custom tool types:  (a plain query),  (a temporary stored function), and  (an anonymous code block). All three execute inside PostgreSQL. The server is written in Go, but custom tools don't run Go code, they run database code.I used  for all three DBA tools. Here's the flow:<ul><li>The LLM sends a standard MCP </li><li>tools/call</li><li> request with the tool name and parameters as JSON.</li></ul><ul><li>The Go server wraps the YAML </li><li>code:</li><li> block into a PostgreSQL anonymous DO block. It injects two variables: </li><li>args</li><li> (input parameters as JSONB) and </li><li>result</li><li> (a JSONB variable you populate with your output).</li></ul><ul><li>PostgreSQL executes the DO block. The PL/pgSQL code reads parameters from </li><li>args</li><li>, runs whatever queries it needs, and builds a JSON result.</li></ul><ul><li>To return data, the code calls </li><li>set_config('mcp.tool_result', result::text, true)</li><li>. This stashes the result in a transaction-local configuration variable.</li></ul><ul><li>The Go server reads the result back with </li><li>current_setting('mcp.tool_result', true)</li><li> and packages it into the MCP response.</li></ul>JSONB in,  out, PL/pgSQL in between. That's the entire interface.A custom tool can do anything PostgreSQL can do - query system catalogues, call extensions, run EXPLAIN, create hypothetical indexes with HypoPG, loop and branch with procedural logic. But it cannot do anything PostgreSQL can't - no HTTP calls, no filesystem access, no shelling out. The tool runs inside the database, not inside the server process. This is the constraint that makes the CrystalDBA comparison interesting.<h2>The pglast Problem</h2>CrystalDBA runs pglast in its server process - a Python library wrapping a C parser. I can't do that. Custom tools run inside PostgreSQL, and PL/pgSQL has no SQL parser.The Go equivalent (pg_query_go) wraps the same C library, which means it requires CGO. We literally just spent effort removing CGO from the project to get pure Go static binaries. Adding it back would undo that work.I could use pglast inside a PL/Python custom tool running inside PostgreSQL via plpython3u, but most managed Postgres services don't support untrusted PL languages. Not a great dependency for a "drop-in" toolkit.So the question became: what does pglast actually give CrystalDBA that I can't approximate without it?<h2>Regex Beats AST (When You Have a Validator)</h2>pglast extracts columns from query conditions precisely by using an Abstract Syntax Tree, which is a fancy way of stripping away the surface level syntax and leaving only the meaningful structural relationships. Regex pattern matching does it noisily - it catches most columns but also picks up some it shouldn't.Here's the thing: it doesn't matter that much. Every candidate index gets tested through HypoPG. A bad candidate shows no cost improvement and gets discarded. The algorithm self-corrects. Noisy candidate generation produces the same final recommendations - it just evaluates a few extra candidates along the way.  Table alias resolution?  is a regex pattern.  is a regex pattern. You don't need an AST for that.  Column extraction from WHERE clauses?  is a regex pattern.  is a regex pattern. Imprecise, sure. But HypoPG handles the imprecision.The worst case is slightly slower analysis (more HypoPG rounds), not worse recommendations, and I gave the search loop a 30-second time budget anyway.<h2>Graceful Degradation</h2>CrystalDBA requires HypoPG - If the extension isn't installed, the tool errors out, full stop.  I wanted to make something useful, even without HypoPG, so the toolkit has two tiers:Tier 1 runs without any extensions. It checks system catalogues for missing foreign key indexes, tables with excessive sequential scans, unused indexes, and duplicate indexes. This is heuristic, no cost simulation, but it's useful information CrystalDBA doesn't offer at all.Tier 2 runs when HypoPG is available. Full simulation-based analysis with regex candidate generation and a greedy search loop.The tool auto-detects what's installed and uses the best available tier. One tool, two modes, always useful.<h2>Token Budget</h2>More tools means more token usage. Every tool definition (name, description, parameter schema) gets sent to the LLM on every request. Research shows LLMs start making worse tool choices as the list grows.The pgEdge server has about 11 built-in tools. Adding three more is roughly 600-900 extra tokens per request. That's fine at this scale, but I used the token budget concern to drive design decisions anyway. Instead of replicating CrystalDBA's seven separate health check tools, I built one combined  with a  parameter. One tool in the list, seven capabilities behind it.<h2>What CrystalDBA Actually Does</h2>I had Claude go through CrystalDBA's codebase and break down what each tool does and how. Here's what it found.Health checks: Seven categories: index, connection, vacuum, sequence, replication, buffer, constraint. Some are sophisticated (the index bloat calculation is a multi-CTE query that estimates btree page counts). Some are surprisingly basic (the connection health check just counts connections against hardcoded thresholds of 500 total and 100 idle). I wanted to match the sophisticated parts and improve the basic ones.Top queries: Three sort modes including a "resources" mode that computes fractional consumption across five dimensions - execution time, shared blocks accessed/read/ dirtied, and WAL bytes. I matched all three modes and added a  filter to cut noise from one-off queries.Index tuning: CrystalDBA's index tuner is a ~670-line Python system. It uses pglast for candidate extraction and HypoPG for validation. The approach here was to replace pglast with regex, add Tier 1 heuristics, and keep the HypoPG simulation loop. The Tier 2 replaces  bind parameters with NULL before running EXPLAIN - cruder than CrystalDBA's type-aware substitution via pglast, but functional. The planner still produces a reasonable execution plan, and the before/after cost comparison holds because both runs use the same substitution.<h2>What I Shipped</h2>Three custom tools in a single YAML file:<ul><li>get_top_queries</li><li> - Three sort modes, extension detection, PG version-aware column names.</li></ul><ul><li>analyze_db_health</li><li> - Seven health check categories, JSON output with summary scoring.</li></ul><ul><li>recommend_indexes</li><li> - Two-tier degradation, regex-based candidate generation, HypoPG simulation.</li></ul>No core server changes. No new dependencies. No CGO. Drop in the YAML file, add one line to the config, restart.The final YAML file is 1,668 lines. Most of that is PL/pgSQL inside  blocks. The  tool alone is ~780 lines because it packs seven diagnostic categories into a single anonymous DO block. Verbose, sure. But it's one tool in the tool list, not seven.<h2>PL/pgSQL Will Hurt You</h2>Testing on PG 18 surfaced two bugs worth documenting because they're the kind of thing that silently eats your results and leaves no trace.The jsonb array concatenation bug. PL/pgSQL's  operator behaves differently depending on how you give it its operands. This works:This doesn't:Same types. Same values. Different results. When the right-hand side comes from the  operator inline, PL/pgSQL treats it as an object merge instead of an array append. The empty array quietly becomes a plain object. Every downstream  check fails. Every  loop gets  as its upper bound and silently skips. Zero simulation results. No error. No warning. Nothing.The fix is :Explicit. Unambiguous. Works every time. I hit this in both the candidate deduplication and the existing-index filter - two places where the simulation pipeline builds arrays by appending extracted elements.The int2vector indexing bug. PostgreSQL's  column is an  - a compact array of column attribute numbers. Standard PostgreSQL arrays are 1-based.  is 0-based. So  is the first indexed column, and  is the second.I wrote . For single-column indexes,  returns 0. No real column has . So the existing-index filter matched nothing and let every candidate through - including columns that already had perfectly good indexes. The simulation would then show 0% improvement (the index already exists), and the candidate would be quietly discarded. Correct final result, completely wrong intermediate logic, impossible to spot without tracing every step.Both bugs shared the same signature: silent failure producing plausible-looking output. The tool returned heuristic recommendations and reported . Everything looked right. it just had zero simulation results where there should have been three. Without end-to-end testing against a real database with known-good data, I'd have shipped it.<h2>Did It Work?</h2>Testing confirmed all three tools compile and execute on PG 18 without errors. The test database surfaced four missing foreign key indexes, a stack of duplicate indexes, and, with the bugs fixed, simulation results showing 99.6% cost reduction for the obvious missing index on . Exactly the kind of finding that makes a DBA tool worth having.So - can you replicate a 4,000-line Python MCP server with custom tools in a YAML file? Yes. 1,668 lines of PL/pgSQL, covering regex-based SQL analysis, HypoPG simulation, and multi-tier graceful degradation. The constraint - running inside PostgreSQL instead of outside it - turned out to be a feature, not a limitation. Zero server-side dependencies. Drop in the YAML. Restart.<h2>What's Next</h2>The toolkit covers the CrystalDBA feature gap, but there's room to grow:<ul><li>Multi-column index candidates.</li><li> The regex extraction currently generates single-column candidates. Composite indexes (covering multiple WHERE columns or matching ORDER BY sequences) would improve Tier 2 recommendations.</li></ul><ul><li>Greedy selection refinement.</li><li> The current Tier 2 evaluates each candidate independently. A full greedy loop - pick the best, keep it active, re-evaluate remaining candidates - would catch index interactions where adding one makes another redundant.</li></ul><ul><li>Additional toolkit packs.</li><li> Security audit, migration helper, and monitoring packs could follow the same YAML drop-in pattern. The custom definitions system handles it.</li></ul><ul><li>Dynamic tool discovery.</li><li> If the number of custom tools grows past 20-30, the three-tool pattern (search_tools, describe_tools, execute_tool) could replace static listing and cut token usage by 96%. Nowhere near that threshold today, but the architecture supports it.</li></ul><h2>Vibe-Coding Learnings</h2>I said up top that Claude wrote the code. That's true. But "Claude wrote the code" undersells the amount of steering involved. Here's what I actually learned about using an AI coding agent for a project like this.Architecture is your job. Claude will happily write 1,668 lines of PL/pgSQL, but it won't tell you to use custom tools instead of modifying the server core. It won't decide that three tools is better than seven. It won't pick the regex-over-pglast approach or design the two-tier degradation. Those are product and architecture decisions, and if you don't make them, you get whatever the agent's default instincts produce. Direct the what and the why. Let the agent handle the how.You have to know what to look at. When I pointed Claude at CrystalDBA and said "tell me how the index tuner works," it came back with a clear breakdown of pglast, HypoPG, and the candidate-generation pipeline. But I had to know to ask. The agent doesn't go looking for comparable projects on its own. Competitive analysis is still a human skill.Testing is where you earn your keep. Claude generated code that compiled, ran without errors, and produced plausible-looking output. It also had two silent bugs that meant the most important feature (HypoPG simulation) returned zero results. Everything looked right. The tool reported  and returned heuristic recommendations. Without testing against a real database with known-good data, I'd have shipped broken code with confidence. The agent can write tests, but you have to be the one who notices the output doesn't match reality.Push back on the agent. Claude's first instinct on several occasions was wrong - suggesting approaches I had to reject, making changes I didn't ask for, or missing context I had from elsewhere in the project. The more domain knowledge you bring, the more useful the agent becomes. It's a force multiplier, not a replacement.The bug you fix yourself hits different. During this project I found a genuine bug in the pgEdge server's custom tool execution, unrelated to the DBA toolkit, but surfaced by testing the toolkit against it. I had Claude diagnose the root cause and write the fix, then submitted the PR myself. Without Claude I'd have been flailing and pulling an engineer off real work to help me. Instead, product management found, diagnosed, and fixed a server bug without bothering anyone. That's new.The whole project (research, design, implementation, debugging, testing) took about three days of evening sessions. Most of that time was me reading output, testing in the browser, and telling Claude what to fix. The ratio of my typing to Claude's typing was probably 1:50. The ratio of my decision-making to Claude's decision-making was the inverse.<h2>Try It</h2>Download <a href="https://github.com/pgEdge/pgedge-postgres-mcp/blob/main/examples/pgedge-postgres-mcp-dba.yaml"><u>pgedge-postgres-mcp-dba.yaml</u></a> into your MCP server directory and add one line to your MCP server config:Restart the server. The three tools appear in the tool list immediately. For , your database needs the  extension enabled. For Tier 2 index simulation, install <a href="https://hypopg.readthedocs.io/"><u>HypoPG</u></a>. Both are optional - the tools degrade gracefully without them.Once the tools are loaded, try these prompts:<ul><li>"What are the slowest queries on this database?" - triggers </li><li>get_top_queries</li></ul><ul><li>"Run a health check on this database" - triggers </li><li>analyze_db_health</li></ul><ul><li>"Are there any missing indexes I should add?" - triggers </li><li>recommend_indexes</li><li> Tier 1 (heuristic)</li></ul><ul><li>"Simulate which indexes would improve my top queries" - triggers </li><li>recommend_indexes</li><li> Tier 2 (HypoPG)</li></ul>Links:<ul><li>pgEdge Postgres MCP Server</li><li> - </li><li>GitHub</li><li> | </li><li>Documentation</li><li> | </li><li>Getting started</li></ul><ul><li>DBA Toolkit YAML</li><li> - </li><li>examples/pgedge-postgres-mcp-dba.yaml</li></ul><ul><li>pgEdge</li><li> - </li><li>pgedge.com</li><li> | </li><li>AI Toolkit</li></ul></p> ]]></description>
            <guid>https://www.pgedge.com/blog/replicating-crystaldba-with-pgedge-mcp-server-custom-tools</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>pgEdge,Distributed Postgres,PostgreSQL,PostgreSQL High Availability,Distributed Postgres,pgEdge,PostgreSQL,postgres,PostgreSQL High Availability</category>
            <title><![CDATA[MM-Ready - An Origin Story]]></title>
            <link>https://www.pgedge.com/blog/mm-ready-an-origin-story</link>
            <pubDate>Mon, 23 Mar 2026 06:34:31 GMT</pubDate>
            <description><![CDATA[ <p>I'm a Product Manager. Not a developer. I want to be upfront about that because everything that follows only makes sense if you understand that I have no business writing software - and I did it anyway.I built <a href="https://github.com/AntTheLimey/mm-ready">MM-Ready</a>, an open-source CLI tool that scans a PostgreSQL database and tells you exactly what needs to change before you can run multi-master replication with pgEdge Spock. It checks 56 things across your schema, replication config, extensions, sequences, triggers, and SQL patterns. It gives you a severity-graded report - CRITICAL, WARNING, CONSIDER, INFO - with specific remediation steps for each finding. It runs against a live database, a schema dump file, or an existing Spock installation.I built the first version in about four hours using Claude Code while operating in a zombie-like half-asleep state. Not so much vibe-coding as trance-coding.<h2>The Priority that's never priority enough</h2>Every customer evaluating the move to multi-master replication has the same question: "What, if anything, do I need to change in my database before I can turn this on?"The answer could be "nothing", or it could potentially touch dozens of things. Tables without primary keys are insert-only replication - with logical replication, UPDATE and DELETE require a unique row identifier. Foreign keys with CASCADE actions can fire on the origin and create conflicts because foreign keys aren't validated at the subscriber. Certain sequence types don't replicate well. Some extensions aren't compatible. Deferrable constraints get silently skipped for conflict resolution.  doesn't guarantee uniqueness for logical replication conflict resolution. So on and so forth. These things aren't bugs or issues - they're a natural consequence of a shift in operational mindset. It's akin to moving from a one-steering-wheel driving experience, and getting in a skid-steer with its highly maneuverable double-joystick.Our Customer Success team was having to do this analysis mostly by hand. They had scripts they'd written, tribal knowledge passed between team members, and a process that worked but didn't scale. Our Slack was littered with such comments as "Wouldn't it be nice if we had a tool that checked everything?" I'd wanted to build a tool to automate this for months. It was on the roadmap. But engineering was deep in development, and this kept sliding - the thing that would make sales smoother losing to the thing that makes the product work. Classic problem. Has happened at every single place I have worked.<h2>3 AM on remote calls, I regret quitting caffeine.</h2>Spockfest is pgEdge's annual planning and hack week. This year it was in Istanbul. I decided to join remotely - a decision that turned out to be accidentally brilliant when a snowstorm grounded every flight that day anyway. What I forgot about was the time difference. Eight hours ahead.So there I was, alarm going off at 1 AM, stumbling to my desk to join the calls for eight hours before my actual day had even started. For a week. I was living on one- and two-hour catnaps, no caffeine (I quit that 20 years ago), and trying to maintain the appearance of a functioning human being.The team would break for lunch. Stretch their legs. Clear their heads. Normal stuff. I'm stuck at the keyboard. Can't go back to sleep - they'll be back in 45 minutes. Can't really start anything meaningful. Just sitting there in the dark, lit only by the glow of Google Meet, trying not to let my forehead hit the desk. So I decided to see if I could make a readiness checker with Claude Code.<h2>Light Speed is too slow, We're going to Ludicrous Speed!</h2>I knew the problem domain. I'd been editing the Spock documentation, talking to CS, fielding customer questions about replication readiness for months. I'd read the Spock source code - the actual C, not just the docs, because I wanted to be SURE. And to be clear, no, I can't write C code, but I can follow logic. I knew what  does when you set REPLICA IDENTITY FULL without a primary key. I knew that delta-apply columns need NOT NULL constraints. I knew trigger firing behavior under .I knew what to check. I'm just 20 years removed from writing real code efficiently, effectively, coherently or with any sense of speed or developer decorum.So I grabbed every source of knowledge I could. The Spock source code. Our documentation. The Customer Success knowledge base from JIRA. Every note I'd ever taken on things to watch out for when converting from single-primary to multi-master. I threw it all at Claude to churn through and plan out.Then I started describing the tool. A Python CLI. A registry pattern where checks auto-discover themselves - drop a new file in the right directory and it just works. A base class with a  method that returns findings. Severity levels as dataclasses. System catalog queries using , not . HTML output because I wanted something I could actually read without squinting at JSON.The iteration speed was incredible. Describe how I wanted it to work, enjoy some silly Claude thinking messages, see it work, describe the next one. A few minutes per feature or fix. Fifty-six checks in a few hours. The hard part wasn't the code. It was knowing what to check and why - the domain knowledge accumulated for two years. By the time the team came back from lunch, I had a functioning scanner. Thanks to Claude, it had unit tests, CI tests, and documentation.<img src="https://a.storyblok.com/f/187930/480x270/04d8d6eb73/its-alive.jpg"><h2>How Fast Can This Thing Pivot?</h2>I was all excited to show what I had to Customer Success. It didn't go quite as awesome as I'd hoped. "This is great," they said. "But we don't connect to customer databases. They send us a  file and we analyze that."I'd built the right tool for the wrong workflow. The entire scanner assumed live database access - querying system catalogs, checking GUC settings, inspecting . None of that works against a SQL dump file.This is where I have to be honest about what happened, because it's embarrassing and instructive in equal measure. I fell into two classic product management traps. Traps I teach other people to avoid. Traps I have given actual presentations about.<img src="https://a.storyblok.com/f/187930/800x518/c58b6fd261/facepalm.jpg" >Trap one: "I already know the problem." I'd sat in enough CS conversations, read enough JIRA tickets, heard enough customer calls to feel confident I understood the workflow. I didn't verify. A quick Slack message - "hey, when a customer asks about Spock readiness, walk me through exactly what you do step by step" - would have saved me building the wrong thing first.Trap two: scope creep dressed up as vision. I didn't just build a schema checker. I built a live database scanner, an audit mode for existing Spock installations, a monitoring mode for long-running observation, configurable check filtering, three output formats. All before validating that CS even needed any of it. Classic "while I'm in here I might as well..." thinking - except "in here" was 3 AM and I was running on fumes and hubris.I do think the extra modes have value. A customer scanning their own live database before a migration conversation is a real use case. Auditing a running Spock deployment for things that slipped through is a real use case. But I built four tools when I could have built one, validated it, and then expanded.Anyway. I went back and built  mode - a schema parser that extracts tables, constraints, indexes, and sequences from a dump file and runs the 19 checks that work from schema structure alone. The other 37 get marked as skipped with a note saying "requires live database connection." That's the mode CS actually uses.<h2>Apart from that, Mrs. Lincoln, How was the Play?</h2>I want to give some specifics here, because the "PM builds a tool with AI" headline writes itself, and the reality is more nuanced than that.What I brought: domain knowledge. Eighteen months of Spock internals, customer conversations, and schema analysis experience. I could describe exactly what each check should look for, why it matters, and what the remediation should say. That part was irreplaceable. Of course, I also brought my umpteen years of product management experience - being able to describe the requirements and the reasons for them.What Claude Code brought: the ability to turn "I need a check that finds tables without primary keys" into working, tested Python. I described the architecture. Claude wrote it.Along the way I picked up new tools. I saw another pgEdge project using CodeRabbit for automated code review and basically ripped off their setup - their , their sub-agents, the lot. Then I learned to customize. I guess that's how everyone actually learns this stuff. You see someone doing it, you copy it, you make it yours.If I had to pick out the big lessons I learned:1) Setting up the <a href="https://claude.md/">CLAUDE.md</a> file, and the .claude folder for sub-agents makes a WORLD of difference. I have the honour of working with the inestimable Dave Page, and it's his setup I ripped off. I downloaded his project next to mine, and told Claude to go examine his project for inspiration, bring over everything relevant to my project, and customize appropriately. It ended up writing an additional Spock Expert along the way.2) Using the /plugin command and finding useful tools. One tool especially I found to be particularly amazing: <a href="https://github.com/obra/superpowers">https://github.com/obra/superpowers</a> - the brainstorming, design and planning tools are amazing. Especially brainstorming. I wish I'd known about this one before I began.3) Automated code review. I fell in love with <a href="https://www.coderabbit.ai/">CodeRabbit</a> (especially its poems). My absolute favourite thing is that, apart from being free for open source projects, is that it provides a single "paste this into your AI to fix the things I found" prompt.There were many other lessons, but they can wait for another time.<img src="https://a.storyblok.com/f/187930/1812x1378/2b4e102c27/vscode-screenshot.png" >What would I warn people about? Vibe-coding gives you speed but not understanding. You end up owning code you can't fully explain. When something breaks, you're back to describing the problem and hoping the AI can fix what the AI wrote. The test coverage exists because Claude wrote the tests too - which means the blind spots are probably shared. That thought keeps me up at night. Well … more than I was already up at night. One more thing I did bring to the table was the oversight. I watched what it was doing, like a hawk. I would frequently interrupt it, to ask it to explain what it was doing and WHY it was doing it like that. Many times I was then able to ask "why aren't you doing it this way instead?" or "wouldn't that cause this other problem?" and force it to rethink or justify its answer. Maybe that could be summarized as Wisdom or Experience - Maybe you know the saying, "Knowledge is knowing a tomato is a fruit. Wisdom is not putting it in a fruit salad".<img src="https://a.storyblok.com/f/187930/1024x561/527fed9b0a/whatd-you-do.jpg" >Oh, and the product engineers prefer Go. So I asked Claude to convert the entire project to Go. New repo. Full port. Took about an hour. I called it "MM-Ready-Go." The dad in me thought that was very funny. I had nobody to share it with at 4 AM.<h2>What MM-Ready Actually Does</h2>Four modes, one goal: tell you if your PostgreSQL database is ready for multi-master replication. - Connect to a live PostgreSQL database and run all 56 checks against the system catalogs. You get an HTML report (or Markdown, or JSON) with severity-graded findings and specific remediation steps. This is for teams who want to self-assess before talking to us. - Feed it a  file. No database connection needed. Runs the 19 checks that work from schema structure alone. This is what our Customer Success team actually uses when a prospect sends over their schema for evaluation. - Point it at a database that already has Spock installed and running. Checks subscription health, replication set configuration, and things that might have slipped through initial setup. For existing pgEdge customers who want a health check. - Long-running observation mode. Takes snapshots of  data and parses logs over time. For when you want to watch behavior, not just check structure.What does it check? Primary keys. Foreign key cascade actions. Sequence types and ownership. Trigger firing behavior under replication. Unsupported extensions. Encoding mismatches. Deferrable constraints. REPLICA IDENTITY settings. WAL configuration. Advisory lock usage. TRUNCATE patterns in . And about 40 more things I won't list here because you get the point.The checks are pluggable. Drop a new Python file in the right subdirectory, subclass , implement , and it auto-discovers. No registration needed, no rebuild. CS can add their own checks when they find new things to watch for. That was deliberate from day one - I wanted this to grow without me being the bottleneck.<img src="https://a.storyblok.com/f/187930/1266x903/86c9adf16f/mm-ready-report.png"><h2>Where It Stands - Honestly</h2><a href="https://github.com/AntTheLimey/mm-ready">MM-Ready is on my personal GitHub repo</a>. Not the pgEdge organization. I'm too nervous to find out everything I'd need to fix to get it "up to snuff" for the company repository. It works. CS is using the analyze mode. But it was built by a sleep-deprived PM with an AI copilot during a week where I was averaging three hours of sleep a night, and I am under no illusions about what that means for code quality.I'm not selling you on this tool. I'm telling you it exists, it solves a real problem, and I'm excited - and terrified - to get feedback on it.<h2>If This Is Useful to You</h2>If you're thinking about multi-master PostgreSQL replication - or you're already evaluating pgEdge Spock - grab the tool and try it:The <a href="https://github.com/AntTheLimey/mm-ready"><u>GitHub repo</u></a> has the full documentation. File issues. Tell me what's broken. Tell me what's missing. I can actually fix things now - I have an AI that doesn't need sleep even if I do. - MM-Ready tells you what needs to change. <a href="/home-2026">pgEdge</a> is the team that actually makes the replication work. Multi-master across regions, automatic conflict resolution, standard PostgreSQL. No application changes required. If the report surfaces things you want help with, that's what we're here for. - I'm not going to tell you to go vibe-code it at 3 AM. That was dumb. But I will tell you this: the gap between "I know exactly what this should do" and "I can build it" is smaller than it used to be. Domain knowledge matters more than it ever has. The question isn't whether you can build it. It's whether you should build it before talking to the people who'll actually use it.I should have sent that Slack message first. Do better than I did.<i>Ant (Antony Pegg) is Director of Product at </i><a href="/home-2026">pgEdge</a><i>, where he works on distributed PostgreSQL and tries very hard not to build things at 3 AM anymore. MM-Ready is available on </i><a href="https://github.com/AntTheLimey/mm-ready">GitHub</a><i>. MM-Ready-Go exists too, because sometimes the joke is the feature.</i></p> ]]></description>
            <guid>https://www.pgedge.com/blog/mm-ready-an-origin-story</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>PostgreSQL,pgEdge,postgres,PostgreSQL</category>
            <title><![CDATA[How to Use the pgEdge MCP Server for PostgreSQL with Claude Cowork]]></title>
            <link>https://www.pgedge.com/blog/how-to-use-the-pgedge-mcp-server-for-postgresql-with-claude-cowork</link>
            <pubDate>Thu, 29 Jan 2026 05:18:20 GMT</pubDate>
            <description><![CDATA[ <p>The rise of agentic AI is transforming how we build applications, and databases are at the center of this transformation. As AI agents become more sophisticated, they need reliable, real-time access to data.If you’ve <a href="https://www.pgedge.com/blog/rag-servers-vs-mcp-servers-choosing-the-right-approach-for-ai-powered-database-access"><u>decided to use an MCP server</u></a> for exposing data to large language models (LLMs) to build internal tools for trusted users, apply sophisticated database schema changes, or translate natural language into SQL, you might find the pgedge-postgres-mcp project (available on <a href="https://github.com/pgEdge/pgedge-postgres-mcp/"><u>GitHub</u></a>) useful to try.This 100% open source Natural Language Agent for PostgreSQL provides a connection between any MCP-compatible client (including AI assistants like Claude) and any standard flavor of Postgres, whether you’re creating a new greenfield project or are using an existing database.<h2>Connecting AI agents to PostgreSQL with pgedge-postgres-mcp</h2>The Model Context Protocol (MCP) is a standardized way for AI assistants to communicate with external data sources. Think of MCP as a universal adapter; just as USB-C provides a standard connection for devices, MCP provides a standard way for AI agents to connect to tools, databases, and services.pgedge-postgres-mcp implements this protocol specifically for PostgreSQL, creating a bridge between AI assistants and your data. It enables users to:<ul><li>Query databases using natural language</li></ul><ul><li>Execute SQL queries safely in read-only transactions</li></ul><ul><li>Access local or distributed PostgreSQL instances</li></ul><ul><li>Work with production data safely (read-only by default)</li></ul><ul><li>Interact with database schemas and metadata</li></ul>Instead of writing custom integration code for each AI application, you get a ready-to-use connection between AI agents and your PostgreSQL database. It works with any PostgreSQL instance, whether you're running locally for development or hosting remotely.<h2>Benefits for AI Development</h2>The pgEdge MCP Server solves a specific problem: giving AI assistants database access without building custom middleware, with configurable controls for authentication, read-only transactions, and per-database permissions.For developers, it means:<ul><li>No custom database connectors to write and maintain</li></ul><ul><li>Faster prototyping of AI-powered applications</li></ul><ul><li>Natural language database operations through Claude</li></ul><ul><li>The ability to test and iterate on AI workflows without writing integration code</li></ul>For teams running production systems:<ul><li>Read-only transaction enforcement for safety</li></ul><ul><li>Works with both single-node and distributed PostgreSQL</li></ul><ul><li>Open source with production-ready performance</li></ul><ul><li>Support for complex queries with semantic search capabilities</li></ul>The pgEdge MCP Server also provides flexibility. You can start with quick prototypes or enterprise-grade production AI applications.<h2>Claude vs. Claude Cowork: What's the Difference?</h2>Claude is Anthropic's AI assistant available through claude.ai and API. It's conversational and helps with analysis, writing, code generation, and problem-solving. When you chat with Claude in your browser or integrate it into applications through the API, you're using the core AI model.Claude Cowork is a research preview feature within the Claude Desktop application that brings agentic capabilities to Claude. Unlike standard chat interactions where Claude responds to one message at a time, Cowork enables Claude to take on complex, multi-step tasks and execute them autonomously on your behalf.The key differences:<ul><li>Agentic execution</li><li>: Cowork can break complex projects into parallel workstreams and coordinate sub-agents to complete them</li></ul><ul><li>Extended processing</li><li>: Work on complex tasks for extended periods without conversation timeouts or context limits</li></ul><ul><li>Direct file access</li><li>: Read and write files on your local system without manual uploads and downloads</li></ul><ul><li>Professional outputs</li><li>: Generate Excel files with formulas, PowerPoint presentations, formatted documents, and more</li></ul>When you connect the pgEdge MCP Server to Claude Cowork, the AI can query your database, analyze the results, and produce complete deliverables—all autonomously. For example, you could ask Cowork to "analyze our customer data and create a quarterly report with charts," then return later to find a finished document ready for review.<h2>Getting Started: The Simplest Example</h2>If you have Go installed and a PostgreSQL database accessible from your machine, the fastest way to get started is to build the MCP server from source and connect it directly to Claude Desktop using stdio.<h3>Prerequisites</h3><ul><li>Claude Desktop application installed.  The MCP Server works with Claude Desktop, even without CoWork</li></ul><ul><li>Claude Pro, Max, Team, or Enterprise subscription with access to Cowork</li></ul><ul><li>Go 1.24.0 or later installed</li></ul><ul><li>PostgreSQL running locally or on your network</li></ul><h3>Step 1: Build the MCP Server</h3>Clone the repository and build the server binary:This produces the binary at <h3>Step 2: Configure Claude Desktop</h3>Edit the Claude Desktop configuration file at Replace the path and database credentials with your own. Restart Claude Desktop, and the pgEdge MCP Server will appear as an available tool.This approach runs the MCP server as a subprocess of Claude Desktop, communicating over stdio. There is no HTTP server, no authentication tokens, and no Node.js dependency. The trade-off is that the database must be directly reachable from the machine running Claude Desktop.<h2>Getting Started: The Deep Dive</h2>The stdio approach is ideal for quick experimentation, but for team environments, remote access, or production deployments, you will want to run the MCP server as a network service. The following sections cover deploying the server in HTTP mode—either as a standalone binary or as a Docker container—and configuring authentication, multiple databases, and Claude Desktop connectivity.<h3>Step 1: Install the pgEdge MCP Server</h3>The pgEdge MCP Server can be deployed as a standalone binary managed by systemd, or as a Docker container. Choose the option that best fits your infrastructure.Option A: Download the BinaryDownload the pre-built binary for your platform from the <a href="https://docs.pgedge.com/enterprise/"><u>pgEdge Enterprise repository</u></a><a href="https://docs.pgedge.com/enterprise/">.</a> This option works well with systemd for service management.You will need to configure authentication tokens for Claude Desktop to connect. See Step 2 for configuration details.Option B: Docker ComposeFor a containerized deployment, use the Docker Compose setup from the GitHub repository:Edit the file with your configuration. At minimum, you need to set:The  setting creates an API token that you will use to connect Claude Desktop to the MCP server. Choose a secure token value and save it—you will need it in Step 3.For local development only, you can disable authentication by not setting , but this is not recommended for production or network-accessible deployments.After configuring your file, start the containers:This deploys both the MCP server (port 8080) and a web interface (port 8081). The web interface at  is useful for testing your connection and exploring the server's capabilities.The  approach is convenient for development, but for production deployments you will want to use a YAML configuration file instead. The repository includes a that mounts a local directory for the server configuration file, giving you full control over all server options. If you use the production compose file, read Step 2 below for guidance on the YAML configuration format and token creation. For details, see the <a href="https://github.com/pgEdge/pgedge-postgres-mcp/blob/main/docs/guide/deploy_docker.md"><u>Docker deployment guide</u></a><a href="https://github.com/pgEdge/pgedge-postgres-mcp/blob/main/docs/guide/deploy_docker.md">.</a><h3>Step 2: Configure the MCP Server (Binary Deployments)</h3>If you are using the standard  with an  file, you can skip this step—configuration is handled through environment variables.For binary deployments, the MCP server reads its configuration from a YAML file named . By default, the server searches for this file in first, then in the same directory as the binary. You can specify a different location using the  flag:Basic Configuration for HTTP ModeIf you are deploying the binary as a network service (e.g., with systemd), create a configuration file with HTTP mode enabled:Creating Authentication TokensFor HTTP mode deployments, you need to create authentication tokens before clients can connect. Use the built-in token management command:This generates a new token and displays it. Save this token—you will need it to configure Claude Desktop in Step 3.You can also list existing tokens:For local development, you can disable authentication by setting  in your configuration file, or by passing the flag when starting the server. This is not recommended for production or network-accessible deployments.Configuring Multiple DatabasesIn beta 3 or later the MCP server supports multiple database connections with agents like Claude, allowing users to switch between different databases at runtime. This is useful for environments with separate development, staging, and production databases:Each database must have a unique  that users reference when switching connections. The  field controls which users can access each database—an empty list means all authenticated users have access.The  and  options (available from v1.0.0-beta3) allow agents such as Claude to select a different database connection using MCP Tools (unlike the pgEdge Natural Language Agents which use more tightly controlled REST APIs).For a complete configuration reference with all available options, see the <a href="https://github.com/pgEdge/pgedge-postgres-mcp/blob/main/docs/reference/config-examples/server.md"><u>server configuration documentation</u></a>.<h3>Step 3: Connect Claude Desktop to the MCP Server</h3>How you connect Claude Desktop to the MCP server depends on your deployment method.Claude Desktop Configuration FileClaude Desktop reads MCP server configurations from a JSON file. On macOS, this file is located at:If this file does not exist, create it with a text editor.For HTTP Mode (Docker or Binary with systemd)When the MCP server is running in HTTP mode, you need to use the  proxy to connect. This requires passing the authentication token you configured in Step 1 or Step 2:Replace  with the token you configured in your  file (for Docker) or created with (for binary deployments).Replace  with your server's address if it is running on a different host or port. The  path is required.Note: This method requires Node.js to be installed on your system. The  command (included with Node.js) automatically downloads and runs the  package. If you don't have Node.js installed, you can install it on macOS with Homebrew:Alternatively, download it from <a href="https://nodejs.org/"><u>nodejs.org</u></a><a href="https://nodejs.org/">.</a>After saving the file, restart Claude Desktop for the changes to take effect.For Remote HTTPS ServersWhen the MCP server is deployed remotely with HTTPS enabled, you can use Claude Desktop's built-in connector feature instead of editing the configuration file:<ul><li>Open Claude Desktop and go to </li><li>Settings > Connectors</li></ul><ul><li>Click </li><li>Add custom connector</li></ul><ul><li>Enter your MCP server URL (e.g., </li><li>https://mcp.example.com:8080</li><li>)</li></ul><ul><li>Click </li><li>Add</li></ul>This method requires HTTPS—Claude Desktop's custom connector feature does not support plain HTTP URLs.<h3>Step 4: Query Your Database</h3>Now you can ask Claude Cowork to work with your database using plain English:You: "Show me the top 10 customers by revenue."Claude Cowork: Connects to the database and executes a queryReturns results in a formatted tableThe AI understands your request, translates it into appropriate SQL (inferring the semantics of the database from the schema object names it finds), executes the query in a read-only transaction, and presents the results in a readable format.<h3>Step 5: Combine Database Queries with File Operations</h3>The real power of Cowork emerges when you combine database queries with its file system capabilities:<ul><li>"Pull last month's sales data and create an Excel spreadsheet with charts showing trends by region"</li></ul><ul><li>"Analyze customer purchase patterns and generate a PowerPoint presentation with key insights"</li></ul><ul><li>"Query the orders table for delayed shipments and create a formatted report I can send to the logistics team"</li></ul><ul><li>"Compare this quarter's revenue to last quarter and save a summary document to my Desktop"</li></ul>Because Claude Cowork can interact with both your database and your file system, it can turn database insights into actionable outputs—reports, charts, presentations, or spreadsheets—all without manual intervention. You can describe the outcome you want, let Cowork run, and return to find completed deliverables.<h3>Step 6: Explore Your Database Structure</h3>You can also ask Claude Cowork to help you understand your databases:You: "What tables are in the database and what do they contain?"You: "Show me the schema for the orders table."You: "What columns have foreign key relationships?"This is helpful when working with unfamiliar databases or when you need to understand how data is structured before writing more complex queries.<h2>Common Use Cases</h2>Here are some practical ways to use the pgEdge MCP Server with Claude Cowork.Ad-hoc reporting: "Pull all orders from Q4 where the shipping cost exceeded 10% of the order value, and put it in a spreadsheet with a summary tab." Instead of writing the query, exporting to CSV, and formatting in Excel, you describe the output and get a finished file.Investigating issues: A customer reports a billing discrepancy. Ask Cowork to trace their order history, compare invoice amounts to order totals, and flag any mismatches. You get a summary of what it found rather than spending an hour joining tables.Understanding unfamiliar databases: You inherit a database with 200 tables and no documentation. Ask Cowork to explore the schema, identify the core entities, and explain how they relate. It's faster than reading through pg_catalog yourself.Data quality checks: "Are there any orders with a ship date before the order date? Any customers with duplicate email addresses?" Run sanity checks in plain English and get results you can act on.'Building queries: When you know what you want but not the exact SQL, describe it and let Cowork write the query. Review what it produces, learn the schema as you go, and iterate until you have what you need.<h2>Why This Matters</h2>The pgEdge MCP Server combined with Claude Cowork changes how you interact with databases. Instead of SQL being a barrier between you and your data, an AI agent becomes an intelligent intermediary that understands both what you want and how your database is structured—and can deliver complete, polished outputs.<h3>Enterprise-Grade Postgres Features for Production Use</h3>As you move from experimentation to production use, the foundation matters. The pgEdge MCP Server is built with enterprise requirements in mind:<ul><li>Security:</li><li> Read-only transactions by default, TLS support, token authentication</li></ul><ul><li>Reliability:</li><li> Production-tested connection handling and query execution</li></ul><ul><li>Performance:</li><li> Connection pooling and efficient query processing</li></ul><ul><li>Compatibility:</li><li> Support for standard PostgreSQL features, extensions, and tools including pgvector for semantic search</li></ul><h3>The Distributed Postgres Advantage</h3>This becomes particularly important for distributed PostgreSQL deployments. pgEdge's infrastructure is built for multi-region, active-active database architectures. As your database scales across nodes and regions, having an AI agent that can intelligently query the right data becomes <br>increasingly valuable.<h2>Getting Started Today</h2>There are ever more possibilities for interacting with databases in new ways as technology progresses. With Claude Cowork and the pgEdge MCP Server, you can describe the outcome you want—a report, an analysis, a data export—and let an AI agent handle the queries, analysis, and document creation. It’s a new way to work with data that can help expedite information management in the database.The pgEdge MCP Server is open source under the PostgreSQL license and is ready to use. Visit the <a href="https://github.com/pgEdge/pgedge-postgres-mcp"><u>pgEdge MCP Server GitHub repository</u></a><a href="https://github.com/pgEdge/pgedge-postgres-mcp"> </a>for documentation and configuration examples.To download the MCP Server binary, visit the <a href="https://docs.pgedge.com/enterprise/"><u>pgEdge Enterprise repository</u></a><a href="https://docs.pgedge.com/enterprise/">.</a> For containerized deployments, clone the repository and use the included Docker Compose setup, or follow the <a href="https://github.com/pgEdge/pgedge-postgres-mcp/blob/main/docs/guide/deploy_docker.md"><u>Docker deployment guide</u></a>.The pgEdge MCP Server is part of the larger pgEdge Agentic AI Toolkit, which includes additional tools and integrations for building AI-powered applications on PostgreSQL. Learn more at <a href="https://pgedge.com/ai"><u>pgedge.com/ai</u></a>.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/how-to-use-the-pgedge-mcp-server-for-postgresql-with-claude-cowork</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>pgEdge,PostgreSQL,PostgreSQL</category>
            <title><![CDATA[Zero-Downtime PostgreSQL Maintenance with pgEdge]]></title>
            <link>https://www.pgedge.com/blog/zero-downtime-postgresql-maintenance-with-pgedge</link>
            <pubDate>Fri, 12 Dec 2025 06:02:57 GMT</pubDate>
            <description><![CDATA[ <p>PostgreSQL maintenance doesn't have to mean downtime anymore. With pgEdge's zero-downtime node addition, you can perform critical maintenance tasks like version upgrades, hardware replacements, and cluster expansions without interrupting production workloads. Your applications stay online. Your users stay connected. Your business keeps running.This capability is available across both single-primary deployments (with integration of the open source extension Spock) and globally distributed deployments (by default), giving you the same operational advantages whether you're running a single-region deployment or a globally distributed system. Now, for an incredibly quick and easy approach to zero-downtime node addition, you can use the pgEdge Postgres Control Plane (<a href="https://github.com/pgEdge/control-plane">hosted on GitHub</a>). This approach provides drastically simplified management and orchestration of Postgres databases using a declarative API, whether you're running a single-primary or a globally distributed deployment of PostgreSQL clusters. Spock and other high-availability components come built-in accompanying community PostgreSQL for quick database administration with simple commands.And because pgEdge and all associated components are 100% open source under the PostgreSQL license, using 100% core community PostgreSQL, you get to leverage the high-availability components that enable zero-downtime maintenance without vendor lock-in or compatibility concerns.<h2>What Is the Spock Extension?</h2>Spock is pgEdge's advanced logical replication extension for PostgreSQL that enables active-active (multi-master) replication in clusters with row filtering, column projection, conflict handling, and more. Even though Spock originated from earlier projects like pgLogical and BDR 1, it has seen enormous growth. Our dedicated team of PostgreSQL experts continues to push Spock forward, making it a high-performance, enterprise-grade replication system built for distributed environments.<h2>Zero-Downtime Node Addition: Maintenance Without Interruption</h2>Adding a new node to a live cluster used to force you to choose between downtime and complexity. When using the latest versions of Spock, you no longer need to choose - now, you can get both seamless operation and simplicity.This feature lets you add a new PostgreSQL node to an existing Spock cluster without any downtime on the origin or existing subscriber nodes. The process creates a temporary replication slot and subscription, clones the origin's state in parallel, and then promotes the new node to a fully active peer once synchronization completes.<h3>Why Zero-Downtime Node Addition Changes Everything</h3> Your cluster stays live throughout the entire process, with no replication pause and no application downtime. Add capacity when you need it without scheduling maintenance windows or warning users about outages. The process runs through standard Spock CLI commands or scripted workflows - no complex coordination needed. Perform in-place major PostgreSQL version upgrades across your entire cluster without taking anything offline.<h2>How Zero-Downtime Upgrades Work</h2>Regardless of whether you are running a single primary node, or are choosing to run a multi-master deployment,  it’s a straightforward process - even more so if you are using the Control Plane to handle deployment and orchestration for you.First, introduce a new node running the higher PostgreSQL version you want. That node joins the cluster using the zero-downtime addition workflow. Once it's synchronized and active, you remove or replace the older-version nodes one at a time.The only difference if you are running a single primary deployment is you would need to first enable Spock prior to proceeding with the above steps; afterwards, you would need to either remove Spock or just disable the subscriptions between your old node and the new node.If you are using the Control Plane, the Spock extension comes already included and enabled in your database server. You would additionally not need to disable the subscription between your old node and your new node as you would when deploying with a single-primary instance; once the database operation is complete, you would simply need to update your configuration to remove the old node.This rolling upgrade approach means you maintain full read and write availability throughout the entire upgrade process. The workflow involves a coordinated process to ensure clean, consistent cluster integration: The new node configures a disabled subscription to each existing node except the designated source. Each existing node begins buffering new transactions for the new node through a dedicated replication slot. Before initiating the data copy, the source node fully synchronizes with all in-flight transactions from other nodes. The  and  functions guarantee this precondition is met. The source node establishes a subscription to the new node, synchronizes data and structure, and initiates a snapshot-based copy to bring the new node current to the point of the copy. Once the copy completes, the new node activates its subscriptions to the other existing nodes. Spock calculates the last commit timestamp for each node as seen via the source and advances each node's replication slot to that timestamp. This prevents duplicate transactions and allows replication to begin at the right point, receiving only new transactions that occurred after the initial copy.This approach works with no interruption of activity in the existing cluster when running the Control Plane or a distributed cluster with multi-master logical replication across regions. If running just a single primary instance, as noted above, you will need to integrate Spock as a starting step and remove it again at the end of the upgrade process.<h2>Implementation Resources</h2>The <a href="https://docs.pgedge.com/spock-v5/modify/zodan/">Spock documentation</a> provides a complete step-by-step guide to the full process when deploying with an already distributed-enabled cluster. <i>A video tutorial is coming soon on how to perform this operation from start-to-finish on a single-primary node!</i>Additionally, when using Control Plane as your method of deployment, documentation is available for <a href="https://docs.pgedge.com/control-plane/using/upgrade-db/#major-version-upgrades">how to perform a major upgrade</a> leveraging this zero-downtime add node feature from Spock, or on <a href="https://docs.pgedge.com/control-plane/using/update-db/">updating a database</a> outside the context of an upgrade.The Spock GitHub repository includes several working examples in the <a href="https://github.com/pgEdge/spock/tree/main/samples/Z0DAN">samples/Z0DAN</a> directory:<ul><li>A Python-based orchestration script that runs outside the database and coordinates node addition through external automation tools</li></ul><ul><li>A stored procedure version that performs the entire process within PostgreSQL using the dblink extension, offering a fully internal option for controlled or restricted environments</li></ul><h2>LSN Checkpointing: The Technical Foundation</h2>Spock 5+ includes functions that make seamless node addition possible. LSN checkpointing using  and  creates a logical checkpoint in the WAL stream on the source node. You can then monitor another node for the arrival of that checkpoint's LSN to ensure all transactions have completed.When adding a node, you use this to guarantee schema or data changes have fully replicated to your source node before continuing. In the context of zero-downtime node addition, these functions are critical. Once you confirm all in-flight transactions from all nodes have arrived on the designated source node, you initiate the data copy to the new node. Without this precise synchronization, adding a node without interrupting cluster usage wouldn't be possible.<h2>pgEdge Enterprise Postgres and pgEdge Distributed Postgres</h2>These zero-downtime capabilities are available across both pgEdge offerings. gives you production-grade PostgreSQL with built-in support for high availability, advanced backup and restore, monitoring integration, connection pooling, and auditing. It comes with the Spock extension, making it possible to be distributed-ready from day one - just create the extension within the PostgreSQL server itself to enable it. You can start with a standard single-region deployment and seamlessly upgrade to multi-region when your needs evolve. takes this further with active-active, multi-master replication across geographic regions. It delivers low-latency access, data residency compliance, and ultra-high availability through its distributed architecture. is a distributed application designed to simplify the management and orchestration of Postgres databases. It provides a declarative API for defining, deploying, and updating databases across multiple hosts. It seamlessly handles anything from a single primary database all the way up to a globally deployed active-active (multi-master) cluster with attached read-only replicas.All three products are 100% open source under the PostgreSQL license. You get all the power of standard PostgreSQL with no proprietary forks, no compatibility issues, and no vendor lock-in. Use pgvector, PostGIS, JSONB, foreign data wrappers, and more - all out of the box.<h2>Our Core Values</h2>pgEdge's commitment to open source is absolute. Everything runs on standard PostgreSQL. The Spock extension is open source. The tools are open source. You're not buying into a proprietary system that traps you.This matters when you're making infrastructure decisions. You can move between pgEdge Enterprise Postgres and pgEdge Distributed Postgres as your needs change. You can deploy on-premises, in the cloud, or in containers. You can switch hosting providers. You maintain complete control over your data and your deployment.We implemented a core philosophy in Spock 5+: to eliminate avoidable disruption and make high-availability PostgreSQL truly hands-off at scale.This takes away a lot of the risk and allows dev teams to upgrade PostgreSQL versions as soon as new releases are available, to expand capacity in response to traffic spikes, or replace hardware without scheduling maintenance windows weeks in advance.The result?<ul><li>Reduced operational complexity</li></ul><ul><li>Lower risk of human error during maintenance</li></ul><ul><li>Improved cluster elasticity and resilience</li></ul><ul><li>Real-time distributed applications that stay online and synchronized</li></ul><ul><li>True zero-downtime for reads and writes across all nodes in a cluster while expanding or upgrading</li></ul><h2>Getting Started</h2>Looking for a visual how-to to help you hit the ground running? Our solutions engineer Paul Rothrock created a video on how to use the Zero Downtime Add Node feature in Spock to enable seamless cluster scaling and major version upgrades with no service interruption. You can <a href="/video/new-feature-highlight-zero-downtime-add-node">walk through the process, here</a>.It’s easy to start using pgEdge Enterprise Postgres and pgEdge Distributed Postgres <a href="/download">no matter how you choose to deploy</a>. You can get started as a fully managed SaaS offering through pgEdge Cloud, run on your own infrastructure with VM installations, or deploy to containers in Kubernetes environments.Both products include 24x7x365 support from PostgreSQL experts who are core contributors to the community. Whether you need standard support or dedicated Forward Deployed Engineer services, you get access to people who know PostgreSQL inside and out.To learn more about pgEdge's distributed multi-master replication technology and zero-downtime maintenance capabilities, visit the <a href="/solutions/benefit/multi-master">pgEdge website</a> or explore the <a href="https://docs.pgedge.com/?_gl=1*y38ztw*_gcl_au*MTE4NjE3NjIwLjE3NjE3MDgxNjE.">pgEdge documentation</a>.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/zero-downtime-postgresql-maintenance-with-pgedge</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>PostgreSQL,postgres,pgEdge</category>
            <title><![CDATA[Meeting High Availability Requirements in Non-Distributed PostgreSQL Deployments]]></title>
            <link>https://www.pgedge.com/blog/meeting-high-availability-requirements-in-non-distributed-postgresql-deployments</link>
            <pubDate>Fri, 31 Oct 2025 18:25:07 GMT</pubDate>
            <description><![CDATA[ <p>High availability in PostgreSQL doesn't always require a globally distributed architecture. Sometimes you need reliable failover and replication within a single datacentre or region. pgEdge Enterprise Postgres handles this scenario with a production-ready PostgreSQL distribution that includes the tools you need for high availability out of the box.<h2>What You Get</h2>pgEdge Enterprise Postgres bundles PostgreSQL with components you'd otherwise install and configure separately. The distribution includes pgBouncer for connection pooling, pgBackRest for backup and restore, pgAudit for audit logging, and pgAdmin for database management. You also get PostGIS and pgVector if you need geographic or vector comparison capabilities.High availability support is added in through the addition of open-source PostgreSQL extensions from pgEdge (all available to work with on GitHub), like <a href="https://github.com/pgEdge/spock"><u>spock</u></a>, <a href="https://github.com/pgEdge/lolor"><u>lolor</u></a>, and <a href="https://github.com/pgEdge/snowflake"><u>snowflake sequences</u></a>.The package supports PostgreSQL versions 16, 17, and 18 running on Red Hat Enterprise Linux v9 and v10 (including Rocky, Alma, and Oracle Enterprise) on both x86 and ARM architectures. You can currently deploy it as a VM, and we'll be adding container and managed cloud editions soon (keep an eye on our social channels for updates along the way).<h2>High Availability Through Physical and Logical Replication</h2>pgEdge Enterprise Postgres supports physical replication with automated failover for traditional high availability setups. This works for the standard scenario where you need a primary database with standby replicas ready to take over if the primary fails.For more advanced scenarios, Spock comes bundled out-of-the-box to enable logical multi-master replication. Unlike physical replication that copies entire database clusters at the block level, Spock uses PostgreSQL's logical decoding to replicate individual table changes between nodes. This enables active-active deployments where multiple nodes can accept writes simultaneously.<h2>Spock: Multi-Master Replication for High Availability</h2>Spock provides multi-master replication for PostgreSQL 15 and later. The extension comes pre-integrated with pgEdge Enterprise Postgres, built against a patched 100% standard PostgreSQL installation that integrates necessary hooks directly into the database engine.<h3>Core Capabilities</h3>Spock replicates data between nodes using logical decoding. You can configure provider and subscriber nodes, add tables to replication sets, and create subscriptions that keep data synchronised across multiple PostgreSQL instances. The extension supports replicating tables from the same schemas across nodes.The extension tracks commit timestamps for conflict resolution. When multiple nodes update the same row simultaneously, Spock uses these timestamps to determine which change takes precedence. This is important for high availability because you can continue processing transactions on any available node without waiting for failover procedures.<h3>Beyond Table Data</h3>Spock integrates with additional extensions included in pgEdge Enterprise Postgres:LOLOR (Large Object Logical Replication) extends Spock to replicate PostgreSQL large objects. Standard logical replication doesn't handle large objects, but LOLOR adds this capability so you can replicate binary data stored using PostgreSQL's large object facility.Snowflake Sequences handles sequence replication across nodes. In a multi-master setup, you need sequences that won't generate conflicting values across different nodes. This extension provides sequences that work correctly in distributed scenarios where multiple nodes generate IDs simultaneously.<h3>Schema Requirements</h3>Spock requires tables to have identical structures across all nodes. Tables must have the same names, schemas, column definitions, and primary keys. Check constraints and not null constraints need to be the same or more permissive on the subscriber than on the provider.This strictness exists because logical replication applies changes based on row identity through the primary key. The extension needs matching schemas to reliably identify and update rows across nodes.<h3>High Availability Without Downtime</h3>With Spock configured across multiple nodes, you can route traffic to any node in your cluster. If one node fails, your application continues writing to the remaining nodes without downtime. You don't wait for a failover election or for a standby to get promoted to primary.This differs from physical replication where only the primary accepts writes. With Spock, every node can accept writes, giving you true active-active capability within your datacentre or region.<h2>Connection Pooling and Workload Management</h2>pgBouncer handles connection pooling to manage database connections efficiently. This is important for high availability because connection storms during failover events can overwhelm a database. With connection pooling in place, you can limit and manage connections to prevent resource exhaustion when traffic shifts between nodes.The bundled pgBouncer configuration works with the rest of the stack, so you don't need to figure out how to integrate a separate connection pooler with your replication setup.<h2>Backup and Recovery</h2>pgBackRest provides advanced backup and restore capabilities. In a high availability setup, you need reliable backups that work with your replication architecture. pgBackRest handles full and incremental backups, parallel backup and restore operations, and can work across multiple repositories.Having integrated backup tooling means your backup strategy accounts for your replication setup from the start, rather than bolting on backup solutions that might not understand your multi-node configuration.<h2>The Open Source Approach</h2>pgEdge Enterprise Postgres runs on 100% standard PostgreSQL (bar some small patches which do not affect compatibility) with no proprietary forks. The distribution is fully open source, licenced under the permissive PostgreSQL Licence. You can use any extensions or tools from the PostgreSQL ecosystem that you'd like.Need support for PostgreSQL and the included tools in your deployments? pgEdge offers 24x7x365 support subscriptions with access to PostgreSQL experts. Forward Deployed Engineer services are available for organisations that need dedicated assistance with architecture reviews, performance tuning, and ongoing guidance.<h2>When to Use pgEdge Enterprise Postgres</h2>Consider using pgEdge Enterprise Postgres when you need high availability within a single geographic region or datacentre, when you want to avoid managing multiple PostgreSQL installations and extensions separately, or when you need the option to scale to multi-master replication without changing your entire infrastructure.If you value having a tested, integrated stack over assembling individual components yourself, this package is designed for you. The logical replication capabilities through Spock add flexibility that physical replication can't provide, particularly for scenarios where you need multiple writable nodes or want to minimise downtime during maintenance windows.Read more about how to get started in the<a href="https://www.pgedge.com/docs"> </a><a href="https://docs.pgedge.com/enterprise/"><u>official pgEdge Enterprise Postgres docs</u></a><a href="https://docs.pgedge.com/enterprise/">,</a> or check out the<a href="https://github.com/pgEdge"> </a><a href="https://github.com/pgEdge"><u>pgEdge GitHub page</u></a> to browse our code repositories. Have any questions along the way? You can join our official Discord community channel - we're here to help!</p> ]]></description>
            <guid>https://www.pgedge.com/blog/meeting-high-availability-requirements-in-non-distributed-postgresql-deployments</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>PostgreSQL,postgres,pgEdge</category>
            <title><![CDATA[When Failure Isn't an Option: Choosing Postgres for Critical Operations]]></title>
            <link>https://www.pgedge.com/blog/when-failure-isn-t-an-option-choosing-postgres-for-critical-operations</link>
            <pubDate>Mon, 22 Sep 2025 04:44:50 GMT</pubDate>
            <description><![CDATA[ <p><i>just about any use case</i> Using Postgres means you get total control over your data and how it’s managed; it’s the ultimate lens for understanding your infrastructure, reducing costs, and optimizing your workload for performance, high availability, and resiliency.With decades of development backing the project and a global community contributing from every industry and background, Postgres has become a solid choice as a <a href="https://www.pgedge.com/PostgresHAsurvey"><u>data management solution for mission critical applications</u></a> across any kind of workload, including geospatial, vector, time-series, IoT, OLTP, and OLAP.Postgres adoption is growing rapidly, and businesses that end up using it often have an “a-ha!” moment that leads them to understand that Postgres really does work for a diverse array of use cases.However, as enterprise demand for scalability and flexibility grows, it’s still common to see concerns arise, like:<h1>Can Postgres Handle Our Uptime Requirements?</h1>A <a href="https://www.pgedge.com/PostgresHAsurvey"><u>survey published July 10, 2025</u></a> from <a href="https://foundryco.com/"><u>Foundry</u></a> focused on the evaluation of "PostgreSQL Usage in Mission-Critical Operations: From High Availability to Cloud Outages". 212 IT professionals working at companies (with 500+ employees) using Postgres were surveyed on their use of Postgres in development and production deployments.The conclusion:  Postgres can be adapted to handle workloads that are in the terabytes in real-time, with fault tolerance, consistency, and availability. So much so, that in the survey it was found that the vast majority (62%) of organizations using Postgres have a hard requirement that there can be no more than four minutes of downtime a month (99.99%). 24% actually required that there be <i>less than 30 seconds a month of downtime</i> (99.999%).These results show Postgres is trusted by companies with extremely high availability standards to handle their data, across industries like FBSI, Software & Computing, and Manufacturing.Beyond the survey, you'll find Postgres powering well-known emerging platforms like Mastodon, established services like Groupon and Trivago, financial services companies like Revolut, and countless government institutions and international banks.  Even the internet-based grocery delivery service <a href="https://www.infoq.com/news/2025/08/instacart-elasticsearch-postgres/"><u>Instacart recently announced</u></a> they chose to switch from Elasticsearch to PostgreSQL and saw “nearly 80% savings on storage and indexing costs, reduced dead-end searches, and [overall improved] customer experience.”The common thread? These organizations chose Postgres not because it was free, but because it delivered the reliability, performance, and scalability their business demanded.<h1>Choosing a Postgres HA Solution</h1>Postgres is 100% free-and-open-source (under its own licensing). As a result, the Postgres ecosystem is vast - you should leverage it to make the most of the power of Postgres. Many distributed Postgres extensions are out there, but only a few are in complete alignment with the open-source core of Postgres, ensuring you’re not subject to unexpected limitations. (<i>A great resource for comparing some of the options is </i><a href="https://pgscorecard.com/"><u>PGScorecard</u></a><i>.</i>)The Foundry survey shows organizations are employing many different solutions for database failover and redundancy management. Of the solutions available:<ul><li>41% are Built-in cloud provider solutions.</li></ul><ul><li>33% are commercial high availability products.</li></ul><ul><li>29% are open-source (including Patroni, CloudNativePG, repmgr, and pg_autofailover) or are custom-built.</li></ul>But as with anything, it’s important to compare solutions and know the drawbacks.<h1>Cloud Failures Do Happen</h1>41% of solutions for handling Postgres failover and redundancy management are built into cloud provider solutions; it is worth noting that 21% of survey respondents directly experienced cloud region failures in the past 12 months that <i>exceeded downtime goals</i>. This problem is specific to the cloud and the nature of how it operates; if uptime is a hard requirement for your organization, you should consider the implementation of solutions such as multi-cloud or multi-region deployments.Among organizations with built-in cloud solutions, the list of cloud providers is topped by AWS.<ul><li>AWS RDS: 55%</li></ul><ul><li>AWS cross-region backups: 55%</li></ul><ul><li>AWS Aurora Global Database: 45%</li></ul><ul><li>Azure Cosmos DB: 29%</li></ul><ul><li>Google Cloud SQL: 24%</li></ul><ul><li>Other cloud provider technologies: 12%</li></ul><h1>Postgres has the Ecosystem Advantage</h1>Postgres extensibility is what separates it from other commercial options on the market. Extensibility means you're not locked into a single vendor's vision of your database requirements.<ul><li>Need to add time-series capabilities? TimescaleDB extends Postgres without breaking compatibility.</li></ul><ul><li>Want full-text search? Postgres' built-in features rival dedicated search engines.</li></ul><ul><li>Interested in vector similarity search? pgvector (and many others, including pgvectorscale and pgai) handles AI workloads at scale.</li></ul>But here's the critical part: stick with extensions that maintain Postgres compatibility. Avoid solutions that require proprietary SQL syntax or lock you into specific deployment patterns. The power of Postgres is that it remains Postgres, regardless of how you extend it.For distributed deployments and high availability scenarios, solutions like pgEdge provide multi-master replication while maintaining full Postgres compatibility. You get the benefits of a distributed system without the vendor lock-in or learning curve of platform-specific alternatives.<h1>Enterprise-Grade Availability</h1>If you have an enterprise or project with a use case for 99.99% of high availability and above, Postgres can deliver on your requirements. pgEdge Enterprise Postgres is a great example of a robust Postgres ecosystem that delivers low latency, ultra-high availability, redundancy, and reliability. When you consider a solution like <a href="https://pgedge.com"><u>pgEdge</u></a>, you'll find you can run fully open, fully distributed Postgres with advantages over other database alternatives, such as Oracle's GoldenGate, CockroachDB, or AWS RDS. pgEdge Enterprise Postgres comes without vendor lock-in or the steep learning curves associated with platform-specific SQL syntax, or other obstacles to seamless integration.<h1>More Than Just "Good Enough"</h1>Postgres is no longer just "good enough"; it does so much more, and the thriving community behind the project ensures that it will continue to adjust to handle modern day use cases for years to come. Have specific requests or things you'd like to see implemented in the project? Consider contributing time and/or code to the Postgres project and be the change you want to see.The question isn't whether Postgres can handle your critical workloads; companies across every industry have already proven it can. The question is whether you're ready to make the move.When failure truly isn't an option, Postgres delivers. The technology is robust, the community is strong, the ecosystem is growing, and the operational costs are predictable. What's taking you so long to make the switch and join us?</p> ]]></description>
            <guid>https://www.pgedge.com/blog/when-failure-isn-t-an-option-choosing-postgres-for-critical-operations</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>PostgreSQL</category>
            <title><![CDATA[Scaling Without Stopping: Inside pgEdge Distributed Postgres Zero-Downtime and Exception-Resilient Replication]]></title>
            <link>https://www.pgedge.com/blog/scaling-without-stopping-inside-pgedge-distributed-postgres-zero-downtime-and-exception-resilient-replication</link>
            <pubDate>Thu, 21 Aug 2025 06:16:00 GMT</pubDate>
            <description><![CDATA[ <p>pgEdge Distributed Postgres v25 introduces several new features that make managing and scaling distributed PostgreSQL clusters dramatically easier and more resilient. Among these, zero-downtime node addition and a new Apply-Replay mechanism for replication exception handling stand out for their ability to improve operational efficiency and system stability, allowing teams to scale and more effectively handle a wide range of runtime scenarios with minimal disruption.<h2>What Is the Spock Extension?</h2>Spock is pgEdge’s advanced logical replication extension for PostgreSQL. It powers active-active, multi-master clusters with support for row filtering, column projection, conflict handling, and more. Spock is a core component underpinning both the self-hosted pgEdge Distributed postreSQL: VM Edition and the managed pgEdge Distributed postreSQL: Cloud Edition from pgEdge.While Spock is descended from earlier projects like pgLogical and BDR 1, it has evolved far beyond its roots. Backed by a dedicated team of PostgreSQL experts, Spock has undergone continuous innovation and improvement to become a high-performance, enterprise-grade replication system built for distributed environments.<h2>Feature Spotlight: Zero-Downtime Node Addition</h2>Adding a new node to a live cluster has historically meant a tradeoff between downtime and complexity. In Spock 5.0, zero-downtime node addition eliminates this tradeoff entirely.This feature allows you to add a new PostgreSQL node to an existing Spock cluster without requiring any downtime on the origin or existing subscriber nodes. The feature works by creating a temporary replication slot and subscription, and allowing the new node to clone the origin’s state in parallel. Once synchronization is complete, the temporary slot is retired, and the new node is promoted to a fully active peer.<h3>Benefits of Zero-Downtime Node Addition:</h3><ul><li>No service interruption or replication pause</li></ul><ul><li>Safe scaling of production clusters</li></ul><ul><li>Minimal manual intervention</li></ul><ul><li>Works with the standard Spock CLI or via scripted workflows</li></ul>This workflow is based on a coordinated process that ensures the new node joins the cluster cleanly and consistently:<ol></ol>This approach ensures a seamless and accurate integration of the new node without interrupting activity in the existing cluster.This same workflow can be used to perform seamless, in-place major PostgreSQL version upgrades across your entire cluster. By introducing a new node running the desired higher version of PostgreSQL, and following the coordinated steps for data synchronization and slot management, you can bring an updated node into the cluster without interrupting read or write traffic. Once the new node is in place, older-version nodes can be removed or replaced one at a time, performing a rolling upgrade with zero downtime. This approach provides a safe, flexible method for upgrading infrastructure while maintaining full application availability.For users ready to implement zero-downtime node addition in their own environment, the <a href="https://docs.pgedge.com/spock_ext/modify/zodan_tutorial">Spock documentation</a> offers a step-by-step guide to the full process.To further support implementation, the Spock GitHub repository provides several working examples:<ul><li>A Python-based orchestration script, designed to run outside the database and coordinate the node addition via external automation tools.</li></ul><ul><li>A stored procedure version that performs the entire process within PostgreSQL itself using the dblink extension, offering a fully internal option ideal for controlled or restricted environments.</li></ul>You can explore these examples in the <a href="https://github.com/pgEdge/spock/tree/main/samples/Z0DAN">samples/Z0DAN directory</a><a href="https://github.com/pgEdge/spock/tree/main/samples/Z0DAN">.</a>For more complex or scripted rollouts, pgEdge also provides spockctrl, a lightweight command-line orchestrator written in C. Spockctrl accepts a structured JSON plan that defines the sequence of operations to perform, including a sample JSON file demonstrating how to add a new node. Both the tool and sample configurations can be found in the<a href="https://github.com/pgEdge/spock/tree/main/utils/spockctrl"> </a><a href="https://github.com/pgEdge/spock/tree/main/utils/spockctrl">spockctrl directory</a><a href="https://github.com/pgEdge/spock/tree/main/utils/spockctrl">.  </a><h2>Ensuring Seamless Node Addition with LSN Checkpointing</h2>pgEdge’s Spock 5.0 extension includes functions that make seamless node addition with zero downtime possible.  LSN checkpointing (using the spock.sync_event() and spock.wait_for_sync_event() functions) allows you to create a logical checkpoint in the WAL stream on the source node, and then monitor another node for the arrival of that checkpoint's LSN to ensure that all transactions have completed.  When adding a node, you can use this to guarantee that schema or data changes have been fully replicated to your source node before you continue.Used in the context of zero-downtime node addition, these functions are critical. When you have confirmed that any in-flight transactions (from all nodes) have arrived on the designated source node, you can initiate a data copy to the new node.  Without this precise synchronization, adding a node without interrupting cluster usage would not be possible; Spock's checkpointing guarantees safety and consistency as part of the overall orchestration strategy.<h2>Feature Spotlight: Apply-Replay for Exception Handling</h2>Spock has always been more resilient than earlier logical replication solutions. Neither pgLogical nor BDR 1 included any automatic error recovery. Spock, by contrast, introduced automated exception handling early in its lifecycle.<h3>Exception Handling in Early Spock Versions</h3>nitially, when a data conflict occurred during replication that couldn't be automatically resolved, the apply worker would restart itself, re-request the transaction from the origin, and continue from there. This was already a step above legacy approaches. Spock also allows you to configure how these exceptions are handled: you can pause replication and discard the problematic transaction, or Spock can step through each sub-transaction to isolate only the part of the transaction that caused the exception.Although better than the other open source solutions, this exception handling process came at a cost:<ul><li>The apply worker had to terminate and restart</li></ul><ul><li>The transaction had to be re-fetched over the network</li></ul><ul><li>XID resources were consumed during replay</li></ul><ul><li>The overall process </li><li>could introduce some</li><li> replication lag and resource overhead</li></ul><h3>Taking a New Approach in Spock 5.0</h3>With Spock 5.0, this process has been vastly improved via a new “Apply-Replay” mechanism.Now, when a replicated transaction encounters an exception, the apply worker does not terminate. Instead, Spock buffers transactions in memory up to a default of 4MB (configurable via a new GUC: spock.exception_replay_queue_size). If an exception occurs, the apply worker enters exception-handling mode and simply replays the buffered transaction from memory.  <h3>Benefits of Apply-Replay:</h3><ul><li>No worker restart required</li></ul><ul><li>No need to re-fetch from origin</li></ul><ul><li>Dramatic reduction in lag during exception handling</li></ul><ul><li>Reduced XID usage and less WAL churn</li></ul>This enhancement significantly improves replication stability, especially in high-throughput environments or under intermittent network conditions where transient data conflicts could previously cause substantial delays.In a synthetic benchmark test involving 10,000 intentionally triggered conflicts that would cause an exception, the Apply-Replay feature resolved all exceptions in just 3 to 4 seconds—compared to over 5 minutes using the previous approach. This represents a dramatic leap in both speed and efficiency for exception handling in distributed PostgreSQL clusters.Large transactions that exceed the memory size will still use the old process, but we are currently at work making those transactions that exceed the allotted memory to be written to, and replayed, from disk, allowing us to completely retire the previous approach.<h2>Why These Features Matter</h2>Both zero-downtime node addition and Apply-Replay reflect a core philosophy behind Spock 5.0: to eliminate avoidable disruption and make high-availability PostgreSQL truly hands-off at scale. These improvements:<ul><li>Reduce operational complexity</li></ul><ul><li>Lower the risk of human error</li></ul><ul><li>Improve cluster elasticity and resilience</li></ul><ul><li>Enable real-time distributed applications to stay online and in sync</li></ul><ul><li>Truly allow full zero-downtime for reads and writes across all nodes in a cluster while expanding or upgrading.</li></ul><h2>Final Thoughts</h2>The pgEdge Spock 5.0 extension isn’t just a version bump—it is a major change to how robust, fast, and easy logical replication can be. Whether you’re managing global clusters or network edge deployments, the new features highlighted here will help your team scale smarter and operate more effectively. And these are just part of a broader set of enhancements: <a href="#v50-on-july-15-2025"><u>Spock 5.0 also includes</u></a> improved automatic conflict resolution that can now handle more conflict scenarios without user intervention, along with other performance and usability upgrades that make it the most capable version yet.Spock 5.0 is available now as a fully integrated component of both the self-hosted pgEdge Distributed Postgres: VM Edition 25.2 and the managed SaaS pgEdge Distributed Postgres: Cloud Edition offering. Whether you’re running your infrastructure on-premises or in the cloud, you can take advantage of the powerful features described in this post, plus many others, to build fast, resilient, globally distributed PostgreSQL applications.To learn more about pgEdge’s distributed multi-master replication technology, visit the <a href="https://www.pgedge.com/solutions/benefit/multi-master">pgEdge website</a> or explore the <a href="https://docs.pgedge.com/">pgEdge documentation</a><a href="https://docs.pgedge.com/">.</a></p> ]]></description>
            <guid>https://www.pgedge.com/blog/scaling-without-stopping-inside-pgedge-distributed-postgres-zero-downtime-and-exception-resilient-replication</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>PostgreSQL,pgEdge,PostgreSQL High Availability,pgEdge,postgres,PostgreSQL</category>
            <title><![CDATA[pgEdge Distributed PostgreSQL Now Available on Akamai Cloud]]></title>
            <link>https://www.pgedge.com/blog/pgedge-distributed-postgresql-now-available-on-akamai-cloud</link>
            <pubDate>Wed, 18 Jun 2025 13:47:47 GMT</pubDate>
            <description><![CDATA[ <p>Today your applications face unprecedented demands: they must be always-on, globally responsive, and capable of serving users anywhere with the option of meeting complex data residency requirements. For web and AI applications that rely on PostgreSQL, we're excited to announce that the <a href="https://www.pgedge.com/landing-pages/platform-download-for-akamai"><u>pgEdge Distributed PostgreSQL platform is now available on Akamai Cloud</u></a> (formerly Linode), a cloud vendor that "brings core cloud computing and edge computing together, along with industry-leading security — all on the most distributed network on the planet." Together, pgEdge and Akamai Cloud create a powerful solution, bringing performance and availability for database infrastructure to the network edge.<h2>Providing Consistency to Akamai Cloud with pgEdge</h2>While <a href="https://www.linode.com/"><u>Akamai</u></a> has long made it easy for developers to place their applications close to users through their extensive global infrastructure (with over 4,100 points of presence), many databases have remained centralized, creating latency bottlenecks and single points of failure. Now, with pgEdge running on Akamai Cloud, you can <a href="https://docs.pgedge.com/#deployment-options"><u>deploy distributed active-active multi-master PostgreSQL databases</u></a> at or near the edge to ensure your applications deliver consistently fast performance regardless of where your users are located.<h3>How pgEdge Makes a Difference</h3>pgEdge is a fully distributed PostgreSQL database optimized for high availability and low latency. As a true multi-master (active-active) distributed database, pgEdge facilitates read and/or write operations on every node in a cluster, providing several key advantages:100% Standard PostgreSQL: pgEdge <a href="https://pgscorecard.com/"><u>maintains full compatibility with PostgreSQL</u></a>, including complete language support, triggers, stored procedures, all data types, functions, operators, and the full SQL syntax. This means you can leverage existing PostgreSQL expertise and tools without modification.100% Open (Source Available): Our source code is completely open and available for review, ensuring transparency and security for deployments everywhere.Write Anywhere Architecture: Unlike traditional read-replica setups, pgEdge's multi-master architecture allows you to write to any node in your cluster. Our logical replication system keeps nodes synchronized while providing automatic conflict resolution and load distribution.Fault Tolerance: If one data center experiences an outage—whether from infrastructure failure or something as simple as a severed fiber optic cable—when applications are configured for zero-downtime operations, traffic can be automatically rerouted to another database node for optimal resilience.Interested in learning more about pgEdge on your own? Don’t forget, you can:<ol></ol><h2>Real-World Impact</h2>Customers, ranging from startups to Fortune 500 companies and governmental organizations like the European Parliament, leverage pgEdge's flexibility to address diverse use cases such as:<ul></ul><ul></ul><ul></ul><ul></ul><h2>AI at the Edge</h2>By distributing your PostgreSQL databases globally, you can:<ul><li>Share state and context across geographic locations</li></ul><ul><li>Maintain data locality for compliance requirements</li></ul><ul><li>Reduce response times for AI inference requests</li></ul><ul><li>Scale AI workloads efficiently across regions</li></ul>One of the most compelling use cases for pgEdge on Akamai Cloud is enabling AI at the edge. This distributed data layer opens up additional flexibility for handling complex AI use cases in a performant manner while maintaining application availability.<h2>pgEdge Deployment Options for Akamai Cloud</h2>pgEdge is certified to run on Akamai Cloud with flexible self-hosted <a href="https://docs.pgedge.com/#deployment-options"><u>deployment options</u></a>:<ul></ul><ul></ul><h2>Looking to the Future</h2>The combination of pgEdge's distributed PostgreSQL capabilities and Akamai Cloud's global infrastructure represents a fundamental shift in how we think about database architecture. Instead of accepting the latency and availability trade-offs of a centralized database, you can deliver consistently fast, reliable data access to users anywhere in the world, self-hosting your instances with vendors like Akamai Cloud or receiving fully managed services with <a href="https://www.pgedge.com/products/pgedge-cloud"><u>pgEdge Cloud</u></a>.Whether you're building the next generation of web applications, deploying AI workloads at scale, or modernizing existing systems for global reach, pgEdge on Akamai Cloud provides the foundation you need to succeed.Ready to get started? Head to the <a href="https://www.pgedge.com/landing-pages/platform-download-for-akamai"><u>Platform Download for Akamai page</u></a> to start deploying distributed PostgreSQL on Akamai Cloud, <a href="https://www.pgedge.com/contact"><u>contact our team</u></a> to discuss your specific use case and requirements, or <a href="https://pages.pgedge.com/schedule-demo"><u>schedule a live demo</u></a> to see how it all works. Stay tuned for our upcoming tutorial on setting up pgEdge specifically with Akamai Cloud, coming soon!</p> ]]></description>
            <guid>https://www.pgedge.com/blog/pgedge-distributed-postgresql-now-available-on-akamai-cloud</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>PostgreSQL</category>
            <title><![CDATA[Unlocking The Power of Multi-Master: 7 Migration Design Considerations]]></title>
            <link>https://www.pgedge.com/blog/unlocking-the-power-of-multi-master-7-migration-design-considerations</link>
            <pubDate>Mon, 14 Apr 2025 06:25:09 GMT</pubDate>
            <description><![CDATA[ <p><h2>From Sundials to Chronometers: The Shift to Multi-Master</h2>Making the move from a traditional single-master database to a multi-master system is like trading in your sundial for a marine chronometer. A sundial is simple, reliable… and completely dependent on a single source of truth (the sun overhead). It works, but only if conditions are perfect and you're standing in the same place. A chronometer, on the other hand, lets you navigate the open seas, across longitudes, giving you freedom you never had before, but it demands precision, discipline, and an entirely new way of thinking about time.The same is true when moving from a single-master (single-write) database to a multi-master (also known as active-active) database. You gain the benefits of global availability, reduced latency, and higher resilience, but you are also now dealing with changes happening simultaneously across space ( on distributed nodes). To take complete advantage of a multi-master replication cluster, you may want to make some fundamental changes to your original underlying schema design, application distribution, and possibly data residency.Switching to a multi-master architecture isn’t just about changing how you replicate data, it’s about unlocking a fundamentally new capability for your applications. This blog outlines the limitations of a centralized system and how to move to a globally distributed, highly available writes and reads system.<h2>Multi-Master vs Single-Master: The Big Picture</h2>In a single-master database, there is only ever one source of truth at any given moment. Every , , or  happens in a predictable, linear timeline. With multi-master replication, you are working with multiple timelines converging, synchronizing, and sometimes even colliding.This post covers a few of the most immediate and major considerations to help get you started.<h2>Why Every Table Needs a Primary Key In Multi-Master Replication</h2>Primary Keys aren’t just best practice, they are essential.  In a multi-master system, replication depends on knowing exactly which row is which, even when written to different nodes at the same time. Without a primary key, your replication system has no guaranteed way of identifying and synchronizing rows correctly.Every replicated table should have a unique primary key. If you're missing them, you’re going to want to add them. BUT!  Read on before you do.<h2>Uniqueness Across Space: Generating Unique Primary Key IDs Safely</h2>In a single-master system, simple auto-incremented sequences or UUIDs often suffice. On a multi-master cluster, they can become dangerous. Imagine two nodes both inserting new rows at exactly the same time — using the same sequence starting at 1. You'll end up with duplicate primary keys and immediate replication conflicts.Here’s a hypothetical scenario:A replicated table of “customers” currently has 200 rows in it, with primary key IDs from 1 - 200.<ul><li>At 12:00 PM, Node A in Germany creates a new customer row locally, with the next auto-assigned ID of 201, with a customer name of “John Smith”</li></ul><ul><li>Simultaneously, Node B in California also creates a new row locally, also with the ID of 201, for a customer with the name of “Jane Doe”</li></ul>Both new rows are then replicated to each other.  Each node receives the other row, but they both have the same primary key of 201. So, is 201 meant to be John Smith, or Jane Doe?The solution is to use an ID generation strategy designed for distributed systems.  There are a few ways this could be approached, such as using node-specific ID ranges or UUIDs (but be cautious of index bloat).  For pgEdge, we recommend using Snowflake sequences.  A Snowflake sequence is a globally unique, time-ordered ID that is unique across nodes. <a href="https://docs.pgedge.com/platform/advanced/snowflake"><u>You can read more about them in the pgEdge documentation</u></a>.Snowflake sequences are composite values that let you:<ul><li>add or modify data in different regions while ensuring a unique transaction sequence.</li></ul><ul><li>preserve unique transaction identifiers without manual/administrative management of a numbering scheme.</li></ul><ul><li>accurately identify the order in which globally distributed transactions are performed.</li></ul>pgEdge provides <a href="https://docs.pgedge.com/platform/advanced/snowflake#example-converting-an-existing-sequence"><u>functions to conver</u></a><a href="https://ace.pgedge-docs-sandbox.pages.dev/platform/advanced/snowflake#example-converting-an-existing-sequence"><u>t</u></a> your PostgreSQL sequence key field to a Snowflake sequence field. After converting a table to use a Snowflake sequence, old keys remain in the original format, but new keys are a unique composite value that contains information about the row.  For example, in the following query, you can see a mix of PostgreSQL-style sequences and Snowflake sequences:In the first column, you can see that the id assigned to each new row changes from a simple value to a more complex Snowflake sequence after the first seven rows - that change indicates the point at which the table was converted to use a Snowflake sequence for its primary key.Since a Snowflake sequence is a composite value, it provides a bonus; you can use a Snowflake function to extrapolate information from each unique id; for example:<h2>Multi-Master Conflict Management</h2><h3>What are Conflicts and Why Do They Happen?</h3>In a multi-master system, conflicts arise when two nodes make concurrent changes to the same data before they have a chance to synchronize.  Examples include:<ul></ul><ul></ul><ul></ul>In a single-master system, like a single sundial, the linear timeline is enforced by WRITE transactions happening on a single node in the order in which they occur - conflicts can't happen. In a multi-master system however, like the ships roaming the seas with chronometers that must stay in sync with Greenwich Mean Time, there are now multiple, simultaneous writers that are geographically distributed. This means there is replication lag that must be dealt with.<img src="https://a.storyblok.com/f/187930/1600x792/5da9fe6f8a/design_consideration.png" >Timeline:Without timestamps: Write X might incorrectly overwrite Write Y.With timestamps: Write Y wins, as it happened later.<h3>Approaches to Conflict Management</h3><h4>Resolution: Handling Conflicts After They Happen</h4>The most common approach is to detect and resolve conflicts when synchronizing, through the use of accurate timestamps<ul><li>Last-Update-Wins: The change with the latest timestamp wins.</li></ul><ul><li>Insert-Insert conflict resolution: When two inserts collide, the one with the latest timetamp is converted into a full-row update, ensuring no data is lost.</li></ul>This makes accurate clocks critical. pgEdge ensures this by maintaining a monotonically increasing logical clock on each node, preventing clock drift from causing inconsistent conflict resolution.<h4>Avoidance: Designing to Prevent Conflicts</h4>A more elegant approach is, wherever possible, to avoid conflicts entirely through smart data modeling. Two of the main approaches here are:<ul><li>CRDTs (Conflict-Free Replicated Data Types): Data structures designed to automatically merge without conflict.</li></ul><ul><li>Immutable Data Patterns: Prefer insert-only or append-only models where possible.</li></ul><h2>Summed-Value Fields Need Special Handling - The Delta_Apply CRDT </h2>In pgEdge, one of the most practical tools for multi-master conflict avoidance is the  mechanism. Instead of sending the final value, pgEdge can replicate the change itself (the delta), allowing each node to apply the adjustment rather than overwrite the value.Think about values that are naturally summed over time:<ul><li>Bank account balances</li></ul><ul><li>Inventory quantities</li></ul><ul><li>Game scores</li></ul>If Node A and Node B both adjust the same field concurrently, which value should win? If you rely on simple "last write wins" logic, you'll lose data.By replicating deltas instead of full values for numeric fields, you remove the possibility of overwriting concurrent increments or decrements. This is especially important for fields like balances, counters, and inventory levels.This ensures that all concurrent adjustments are merged correctly, rather than just overwritten. This one function can handle all numeric column types, without requiring any additional schema changes.<h3>Why delta_apply Matters: A Concrete Example</h3>Imagine you are managing a simple bank account balance replicated across two nodes (Node A and Node B). The account starts with a balance of $100.<h4>Without delta_apply:</h4>Each node independently updates its local copy of the balance based on concurrent transactions.In a multi-master system, as replication occurs each node will try to overwrite the other node’s balance with its own final value:<ul><li>Node A will send: balance = $70</li></ul><ul><li>Node B will send: balance = $150</li></ul>Depending on conflict resolution (typically Last-Update-Wins), you might end up with either $70 or $150, but the correct value should have been: $100 - $30 + $50 = $120<img src="https://a.storyblok.com/f/187930/1600x631/8d5c152b3f/design_consideration2.png"><h4></h4><h4>With delta_apply enabled:</h4>When the system replicates the transactions, it no longer tries to send full values. Instead, it sends deltas:<ul><li>Node A sends: delta = -30</li></ul><ul><li>Node B sends: delta = +50</li></ul>The receiving nodes will now apply both changes, no matter the order:$100 - $30 + $50 = $120<img src="https://a.storyblok.com/f/187930/1600x631/3609f8a087/design_consideration3.png" >This is why  is essential for fields that store a summed value. It preserves correctness even when updates happen concurrently across nodes.<h2>Rethinking Backup & Restore in a Multi-Master World</h2>In a single-master (or primary-replica) database, backup and restore are relatively straightforward:<ul><li>You take a snapshot (base backup + WAL logs) from the primary node.</li></ul><ul><li>You can restore this snapshot to a new replica or a replacement primary.</li></ul>But in a multi-master system, things are more complex, because:<ul><li>There is no single "source of truth" — all nodes are simultaneously authoritative.</li></ul><ul><li>The replication state (including sequence numbers, logical replication positions, and timestamps) is part of the system’s integrity.</li></ul><ul><li>Restoring from an old backup can cause immediate replication conflicts or inconsistencies if not done carefully.</li></ul><h3>Why Standard Backups Can Go Wrong</h3>Imagine restoring a node from a snapshot that is 1 hour old:<ul><li>The node will start with stale data and outdated replication state.</li></ul><ul><li>Upon reconnecting, it may replicate old changes as if they were new, or it may incorrectly attempt to overwrite more recent updates from other nodes.</li></ul><ul><li>Worse, it can trigger primary key conflicts or timestamp regressions that violate system integrity.</li></ul><h3>Principles of Multi-Master Backup & Restore</h3><h4>Coordinated Backups</h4>If you're using multi-master replication, your backups should be taken from all nodes (or at least a designated consistent set) at the same logical point in time, not just one node. This ensures you can rebuild the whole cluster without conflicting histories.<h4>Consistent Restore</h4>You cannot restore just one node in isolation and let it rejoin an existing cluster unless you are certain that:<ul><li>the backup is recent enough to be safely replayed.</li></ul><ul><li>the logical replication state is reconciled with the other nodes.</li></ul><h4>Node Replacement Strategy</h4>If you lose a single node, it’s generally safer to:<ul><li>Remove the failed node from the cluster.</li></ul><ul><li>Deploy a fresh node from a recent backup.</li></ul><ul><li>Let the new node perform a sync from a healthy peer to catch up.</li></ul>This avoids introducing stale data back into the cluster.<h2>Rethinking Application Connectivity for Multi-Master</h2>When using a traditional single-master replicated database, application connection patterns typically look like this:<ul></ul><ul></ul>In this model, developers often hard-code or configure their applications to direct all write traffic to a specific host (the master) and distribute read-only traffic across replicas.<h3>What changes in Multi-Master?</h3>In a multi-master system like pgEdge:<ul><li>Every node is writable, not just readable.</li></ul><ul><li>Each node participates equally in accepting writes and replicating them globally.</li></ul><ul><li>This means you can now:</li></ul><ul><li>Write locally, significantly reducing round-trip latency.</li></ul><ul><li>Still read locally, as before.</li></ul><h3>Why This Matters</h3>If you continue using a single-master connection strategy:<ul><li>You may still be sending writes across the globe unnecessarily.</li></ul><ul><li>You might not fully benefit from the performance improvements of using local writes.</li></ul><ul><li>You are underutilizing the core feature of multi-master replication.</li></ul><h3>Practical Connection Changes You May Need</h3><h4>Topology-Aware Connection Strings</h4>Configure applications to connect to the nearest node (e.g., via DNS, load balancer, or topology-aware connection string). By doing this, your applications gain the benefits of low-latency for both reads and writes.<h4>Connection Pool Adjustments</h4>In some frameworks, connection pools may be tuned under the assumption that writes are slow due to network latency introduced by remote write transactions. You may want to revisit timeouts, pool sizes, and retry logic now that writes can be local and fast.<h4>Multi-Region Application Awareness</h4>In multi-region deployments, you may want each regional deployment of your application to connect to its co-located pgEdge node:<ul><li>Region A app → Region A database node</li></ul><ul><li>Region B app → Region B database node</li></ul><h4>Failover Considerations</h4>Since all nodes are writable, application failover logic may also be simplified. Instead of failing over to a remote master, or being stuck waiting for a physical standby to be promoted, you may simply redirect the connection to another nearby writable node.Example: Before vs AfterBy adapting your application’s connection strategy, you unlock one of the biggest practical benefits of pgEdge and multi-master replication: low-latency writes anywhere.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/unlocking-the-power-of-multi-master-7-migration-design-considerations</guid>
            <author><name>Antony Pegg</name></author>
            </item>
            <item>
            <category>pgEdge,PostgreSQL</category>
            <title><![CDATA[Scaling AI Inference at the Edge]]></title>
            <link>https://www.pgedge.com/blog/scaling-ai-inference-at-the-edge</link>
            <pubDate>Thu, 20 Feb 2025 06:15:00 GMT</pubDate>
            <description><![CDATA[ <p>In the rapidly evolving landscape of artificial intelligence (AI), the demand for real-time data processing has never been more critical. Traditional cloud-based AI inference often introduces latency by transmitting data to centralized servers for analysis. In a distributed world, a centralized inference pipeline becomes a bottleneck. This delay can be detrimental in applications requiring immediate responses, such as autonomous vehicles, production line monitoring or real-time analytics. This challenge can now be addressed by moving AI inference closer to where the inference is used, distributing AI across your network to local devices near the data source. This approach significantly reduces latency, enhances data security, and improves overall system efficiency.However, implementing AI inference at the edge presents its own set of challenges, particularly concerning data availability and consistency across distributed environments. This is where pgEdge Distributed PostgreSQL becomes indispensable. By providing a distributed PostgreSQL architecture optimized for the network edge, pgEdge ensures that data remains consistently accessible across multiple nodes, even in the face of network issues or hardware failures. This multi-master (active-active) setup guarantees high availability, lowers response times, and facilitates seamless data replication and synchronization, which are crucial for maintaining the integrity of AI models and their inferences.<h2>Key Benefits of pgEdge for AI Applications</h2><h3>High Availability</h3>pgEdge's multi-master (active-active) architecture ensures that read and write operations can occur at any node within a geographically distributed cluster. This design eliminates single points of failure, providing continuous data availability even during maintenance or unexpected outages. This resilience is crucial for AI applications that demand uninterrupted access to data for real-time processing and decision-making.<h3>Distributed Processing</h3>By enabling data to be stored and processed across multiple locations, pgEdge facilitates distributed AI workloads. This allows for parallel processing of large datasets, enhancing the efficiency of tasks such as training machine learning models or executing complex inference algorithms. For instance, in a three-node cluster managing a 900,000-row table, each node can process 300,000 rows concurrently, significantly reducing overall processing time, with the resulting data automatically distributed across all nodes without needing to repeat the computations.  This can be especially valuable adding and maintaining embeddings for large data sets with constant and geographically distributed read/write activity.<h3>Data Consistency Across Nodes</h3>pgEdge employs advanced replication and conflict resolution mechanisms to maintain data consistency across all nodes in an active-active Multi-Master configuration. This ensures that AI models operate on accurate and up-to-date information, which is essential for generating reliable predictions and insights. The platform's support for synchronous read replicas within regions further enhances data integrity, making it a dependable choice for mission-critical AI applications.<h3>Flexibility in Deployment</h3>pgEdge's architecture supports deployment across various cloud regions and data centers, as well as on-premise or in air-gapped deployments. This unparalleled flexibility and resilience is particularly beneficial for AI applications that require scalability and adaptability to different operational environments. By integrating pgEdge into their AI infrastructure, organizations can effectively overcome the data limitations associated with centralized AI inference, thereby achieving faster decision-making processes and enhanced user experiences.<h3>Considerations for Distributed AI Compute</h3>While an high availability, multi-master distributed data environment is an essential foundation of a distributed AI inference implementation, it's only half of the story.  The AI Compute itself also needs to be distributed to realize the full benefits.When a distributed database is integrated with a single, centralized AI compute environment, you can still encounter latency, as all nodes are required to send data to and await responses from the centralized compute resource. This bottleneck undermines the key advantages of a distributed database.<h3>Implementing Localized AI Compute Instances</h3>To mitigate this issue, you can deploy a localized AI compute instance in proximity to each database node. This approach ensures that AI processing, such as vector generation, occurs locally, thereby minimizing latency and reducing the need for data transmission over potentially congested networks. By processing data closer to its source, your system can achieve faster inference times and improved overall performance.<h3>Parallel Processing Through Distributed AI Compute</h3>Distributing AI compute resources across multiple nodes not only alleviates centralized bottlenecks but also enables parallel processing of large datasets. For instance, a smart city project can process sensor data (e.g., traffic, weather, or transport) at nodes near data sources, enabling real-time decisions like rerouting traffic or adjusting bus schedules, with global synchronization of critical updates. This parallelism accelerates data processing and leverages the inherent scalability of distributed systems, leading to more efficient AI workflows.  New data received at one node can be vectorized locally for immediate use by the local application, with other nodes receiving the new or updated embeddings through logical replication.<h3>Integrating AI Compute with Databases</h3>The goal of course is to get the Compute as close to the data, at the point of usage, as possible.  The question is just how close do you want to get?In-Database Processing: Integrating AI capabilities directly within the database using extensions like <a href="https://github.com/postgresml/postgresml"><u>PostgresML</u></a> (PGML) allows for execution of machine learning tasks without the need for an external compute system. This tight integration reduces data movement and can enhance performance for certain workloads, but there are quite a few caveats with this approach.  Extensions like PGML are usually limited to running on very specific OS flavours, and can come with heavy Python library and version dependencies.  The additional AI workload will have a direct impact on hardware capability considerations for the database instance, especially if you intend to use GPU acceleration, as this will also require NVIDIA CUDA libraries.In a distributed environment, other limitations are likely to emerge.  For example, in our local testing and experimentation with PGML, we found that it was using Primary Key Sequences in its model storage. With active-active replication, this will cause duplicate Primary Key conflicts on insert-insert scenarios (where two nodes insert a new row with the same Primary Key value).  To make PGML function in a distributed environment, pgEdge patches it to use our snowflake sequence extension, so that if a <a href="https://huggingface.co/models"><u>HuggingFace</u></a> model was first used on one node, after download, it would replicate successfully to the other nodes. PGML also caches the model as part of its usage, making it necessary to invalidate the cache on the receiver node so that it would be rebuilt and function correctly.Sidecar Deployments: Implementing AI models as sidecar services using frameworks such as ONNX or OLLAMA enables AI processing to occur alongside the database. This configuration offers flexibility, allowing for the use of specialized hardware or software environments tailored to AI tasks while maintaining close proximity to the database. This can be seen in extensions such as <a href="https://github.com/mudler/LocalAI"><u>localAI</u></a> (which uses ONNX), <a href="https://github.com/tembo-io/pg_vectorize"><u>pg_vectorize</u></a>, and <a href="https://github.com/timescale/pgai"><u>TimescaleDB’s PGAI </u></a>(both of whom use OLLAMA), that support remote calls to openAI or calls to a local framework/API. Maintaining a close proximity with the Compute environment also enables the ability to add asynchronous processes to update vectors when the underlying information is added or updated, without having to use triggers to invoke and wait for the return of the updated embedding.<h2>Practical Use Cases</h2>This post provides a rather high level overview, but let's end with a few practical use cases:<h3>Accelerating Vector Search</h3>Vector search is pivotal in AI applications, enabling similarity comparisons essential for recommendation systems and semantic search. pgEdge has integrated the <a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a> extension, providing efficient storage and querying of vector embeddings directly within a distributed PostgreSQL database. This integration facilitates low-latency, distributed access to embeddings, ensuring that AI-powered search operations are both swift and scalable.<h3>Parallelizing Vectorization of Large Datasets</h3>Handling large datasets is a common challenge in AI workflows. pgEdge's distributed architecture enables an accelerated parallel vectorization of data across multiple nodes. For example, a global e-commerce platform can use a multi-master database to process transaction logs locally on regional nodes, identifying fraud or issues in real-time, while replicating key insights across the cluster for global visibility.<h3>Real-Time Updates to AI Models and Embeddings</h3>In AI applications, especially those involving real-time data processing, the ability to update models and embeddings promptly is crucial. pgEdge's multi-master replication ensures that updates made to AI models or vector embeddings on any node are propagated across the entire cluster in near real-time. This capability guarantees that all nodes operate with the most current data, enhancing the accuracy and reliability of AI-driven insights.<h3>Enhancing AI Inference at the Edge</h3>Deploying AI inference closer to end-users reduces latency and improves responsiveness. pgEdge's support for the pgvector extension allows AI inference and similarity search requests to be processed nearer to users, delivering faster search results regardless of their location.<h3>Implementing Edge AI for Real-Time Analytics</h3>Edge AI enables real-time data processing and analysis without constant reliance on cloud infrastructure. By bringing computation closer to the source of data, edge AI reduces latency, optimizes bandwidth usage, and enables faster decision-making.</p> ]]></description>
            <guid>https://www.pgedge.com/blog/scaling-ai-inference-at-the-edge</guid>
            <author><name>Antony Pegg</name></author>
            </item>    
    
        </channel>
    </rss>