Teaching an LLM What It Doesn't Know About PostgreSQL

Large language models know a remarkable amount about PostgreSQL. They can write SQL, explain query plans, and discuss the finer points of MVCC with genuine competence. But there are hard limits to what any model can know, and when you're building tools that connect LLMs to real databases, those limits become apparent surprisingly quickly.

The core issue is training data. Models learn from whatever was available at the time they were trained, and that corpus is frozen the moment training ends. PostgreSQL 17 might be well represented in a model's training data, but PostgreSQL 18 almost certainly isn't if the model was trained before the release. Extensions and tools from smaller companies are even worse off, because there simply isn't enough public documentation, blog posts, and Stack Overflow discussions for the model to have learned from. And products that were released after the training cutoff are invisible entirely.

This is the problem we set out to solve with the knowledgebase system in the pgEdge Postgres MCP Server. Rather than hoping the LLM already knows what it needs, we give it a tool that lets it search curated, up-to-date documentation at query time and incorporate the results into its answers. It's RAG, in essence, but tightly integrated into the MCP tool workflow so the LLM can use it as naturally as it would run a SQL query.

Products the LLM has never heard of

To understand why this matters, consider a few of the products whose documentation we index.

Spock is an open source PostgreSQL extension that provides asynchronous multi-master logical replication. It allows multiple PostgreSQL nodes to accept both reads and writes simultaneously, with automatic conflict resolution between nodes. It supports automatic DDL replication, configurable conflict resolution strategies, row filtering, column projection, and cross-version replication for zero-downtime upgrades. Spock grew out of earlier work on pgLogical and BDR2, but has been substantially enhanced since pgEdge first introduced it in 2023.

If you ask an LLM about Spock without any supplementary context, you'll most likely get an answer about the Java testing framework of the same name, or at best a vague and outdated reference to the PostgreSQL extension. The model has no way of knowing about the current configuration syntax, the available conflict resolution modes, or how to set up a multi-node cluster with the latest release. The documentation simply wasn't in its training data, and for a niche product in a specialised corner of the PostgreSQL ecosystem, it never will be in sufficient detail.

The pgEdge RAG Server is another example. It's a Go-based API server for Retrieval-Augmented Generation that uses PostgreSQL with pgvector as its backend, combining vector similarity search with BM25 text matching for hybrid retrieval. The entire product was announced in December 2025 as part of the pgEdge Agentic AI Toolkit, which means any model trained before that date knows nothing about it whatsoever.

The same applies to other pgEdge components like the pgEdge Platform itself, which bundles standard PostgreSQL with Spock replication, the ACE consistency engine, Snowflake Sequences for globally unique IDs, and over twenty popular extensions into a self-managed distributed PostgreSQL distribution. Each of these products has its own documentation covering installation, configuration, and troubleshooting, and none of it is likely to appear in a model's training data with any reliability.

Even PostgreSQL itself presents a moving target. The official documentation runs to thousands of pages and changes with every major release. A model trained on PostgreSQL 16 documentation will give subtly wrong answers about features that were added or changed in version 17 or 18, and it has no way of knowing that its information is out of date.

How we built the knowledgebase

The knowledgebase is built offline by a dedicated builder tool that processes documentation from a variety of sources and stores the results in a SQLite database. The builder supports several input formats, including Markdown, HTML, reStructuredText, DocBook XML, and the SGML format used by the official PostgreSQL documentation. Each format is converted to clean Markdown before chunking, with format-specific handling to preserve the structure of the original content.

The sources themselves can be git repositories or local filesystem paths, which makes the system flexible enough to index far more than just product documentation. For git repositories, the builder clones each one and checks out the appropriate branch or tag for each version. Local paths can point at anything on the filesystem, including exported blog posts, internal support knowledge base articles, or runbooks that your team has accumulated over time. If it can be converted to Markdown, HTML, or one of the other supported formats, it can go into the knowledgebase.

A single configuration file defines all the documentation sources, and we currently index documentation for PostgreSQL versions 14 through 18, several versions of pgAdmin, and a range of pgEdge products including Spock, the RAG Server, the Postgres MCP Server, pgEdge Platform, PostGIS, pgvector, pgBouncer, and pgBackRest. But the same mechanism works equally well for your own content. A team that maintains a collection of blog posts about their database architecture, or an internal wiki with troubleshooting guides and operational procedures, can add those as local path sources and have them appear alongside the official product documentation in the knowledgebase. The LLM doesn't distinguish between the two; it simply searches the entire corpus and returns whatever is most relevant to the query.

Chunking

Converting whole documents into something useful for semantic search requires breaking them into chunks that are small enough to be meaningful as individual search results but large enough to carry sufficient context. We use a two-pass hybrid algorithm that preserves the structural elements of the source documents.

In the first pass, the algorithm parses the Markdown content into structural elements: code blocks, tables, lists, blockquotes, and paragraphs. It never splits within a structural element, because a code block that's been cut in half is useless as a search result. Instead, it splits at the boundaries between elements, targeting around 250 words per chunk. When an individual element exceeds the target size, it uses type-specific splitting strategies. Code blocks split at line boundaries with fencing re-added to each piece. Tables split at row boundaries with the header row preserved in each chunk. Lists split at top-level item boundaries, and paragraphs split at sentence boundaries.

The second pass merges undersized chunks. Any chunk smaller than 100 words is merged with an adjacent chunk, provided the combined result doesn't exceed 300 words or 3,000 characters. The size constraints are deliberately conservative to maintain compatibility with Ollama models that have lower token limits, but they also happen to produce chunks that work well with all the embedding providers we support.

One detail that turned out to be more important than we expected is heading hierarchy tracking. As the chunker works through a document, it maintains a stack of headings at each level. When it creates a chunk, it records the full heading path, so a chunk about OAuth configuration might carry the hierarchy "API Reference > Authentication > OAuth". This context significantly improves the quality of search results, because the embedding captures not just the content of the chunk but its position in the broader document structure.

Embeddings

Each chunk is embedded using all three supported providers: OpenAI (using the text-embedding-3-small model by default), Voyage AI (using voyage-3), and Ollama (using nomic-embed-text for fully offline operation). The embeddings from every provider are generated in parallel and stored together as compact float32 binary blobs in the SQLite database, which is considerably more space-efficient than storing them as JSON arrays.

The reason for embedding with all three providers at build time is purely practical. By shipping a knowledgebase database that already contains OpenAI, Voyage AI, and Ollama embeddings side by side, the system administrator installing the MCP server can simply choose whichever embedding provider suits their environment. An organisation that uses OpenAI for everything can use the OpenAI embeddings. A team that needs fully offline operation can use the Ollama embeddings without having to regenerate the entire database themselves. At query time, the tool automatically selects the embeddings that match the configured provider, with a smart fallback to other providers if the preferred one happens to be missing for a particular chunk.

The builder is incremental. It uses SHA-256 checksums to detect which source files have changed since the last build, and only re-processes files that are new or modified. It also deduplicates across versions, since documentation that hasn't changed between PostgreSQL releases doesn't need to be chunked and embedded again. For a full build covering all PostgreSQL versions from 14 to 18 plus all the pgEdge products, the result is a database of roughly 150,000 chunks that takes around 25 to 50 minutes to generate embeddings for using the cloud providers.

How the LLM uses the knowledgebase

The knowledgebase is exposed to the LLM as a single MCP tool called search_knowledgebase. The tool accepts a natural language query and returns the most semantically similar chunks from the database. Behind the scenes, it converts the query into a vector embedding using whichever provider is configured for the MCP server, then calculates cosine similarity against the corresponding embeddings stored in the knowledgebase and returns the top results.

The tool supports filtering by product name and version, which is important both for relevance and for token efficiency. If the user is asking about Spock replication, there's no point returning chunks from the PostgreSQL 14 documentation or the pgBouncer manual. The LLM can also call the tool with a list_products parameter to discover what documentation is available before performing a search, which prevents it from guessing at product names that need to match exactly.

A typical interaction looks something like this. The user asks a question about configuring Spock multi-master replication. The LLM recognises that this is a topic it may not have reliable training data for, so it calls search_knowledgebase with list_products set to true. It sees that documentation for Spock 5.0.4 is available, and calls the tool again with a targeted query and the product name filter. The tool returns the five most relevant chunks from the Spock documentation, which the LLM reads, synthesises, and presents to the user as a coherent answer with accurate configuration details and version-specific information.

The key insight is that the LLM doesn't need to know about Spock in advance. It just needs to know that the search_knowledgebase tool exists and that it can search for documentation on products it isn't confident about. The tool descriptions include guidance that encourages this behaviour, and in practice we find that LLMs are quite good at recognising when they're uncertain and reaching for the knowledgebase rather than guessing.

What makes this different from generic RAG

The distinction between the knowledgebase and a generic RAG setup is worth drawing out. A general-purpose RAG system typically indexes whatever documents you throw at it and returns results based purely on semantic similarity. The knowledgebase is more opinionated. It understands the concept of products and versions, so it can filter results to a specific release. It uses a chunking algorithm that was designed specifically for technical documentation, preserving code blocks, tables, and heading hierarchies rather than splitting blindly on token counts. And because it's integrated into the MCP tool framework, the LLM can use it alongside the database query tools in the same conversation, checking the documentation for a feature before writing a query that uses that feature.

The practical difference is that the LLM can give accurate, version-specific answers about products and features that are completely absent from its training data. That's not something you get from prompt engineering or fine-tuning, because neither approach can inject knowledge about a product that was released after the model was trained. The knowledgebase is simply the most practical way to bridge the gap between what the model knows and what the user needs.

Try it yourself

The pgEdge Postgres MCP Server is open source under the PostgreSQL licence, and the knowledgebase builder and search tool are included. You can build a knowledgebase from your own documentation sources, or use the pre-built database that ships with the project's releases. Full documentation is available at docs.pgedge.com.

Teaching an LLM What It Doesn't Know About PostgreSQL

Products the LLM has never heard of

How we built the knowledgebase

Chunking

Embeddings

How the LLM uses the knowledgebase

What makes this different from generic RAG

Try it yourself

Dave Page

Vice President of Engineering

Get started today.

Teaching an LLM What It Doesn't Know About PostgreSQL

Products the LLM has never heard of

How we built the knowledgebase

Chunking

Embeddings

How the LLM uses the knowledgebase

What makes this different from generic RAG

Try it yourself

Dave Page

Vice President of Engineering

SUBSCRIBE TO BLOG

Get started today.