Building a RAG Server with PostgreSQL - Part 1: Loading Your Content

Retrieval-Augmented Generation (RAG) has become one of the most practical ways to give Large Language Models (LLMs) access to your own data. Rather than fine-tuning a model or hoping it somehow knows about your documentation, RAG lets you retrieve relevant content from your own sources and provide it as context to the LLM at query time. The result is accurate, grounded responses based on your actual content.

In this three-part series, I'll walk through building a complete RAG server using PostgreSQL as the foundation. We'll cover:

Part 1 (this post): Creating a schema and loading your documents
Part 2: Chunking documents and generating embeddings with pgEdge Vectorizer
Part 3: Deploying a RAG API server for your applications

By the end of the series, you'll have a working RAG system that can answer questions using your own documentation or knowledge base.

Why PostgreSQL for RAG?

If you're already running PostgreSQL (and let's face it, you probably are), adding RAG capabilities to your existing infrastructure makes a lot of sense. With the pgvector extension, Postgres becomes a capable vector database without requiring you to deploy and manage yet another specialised system. Your documents, embeddings, and application data can all live in one place, with the transactional guarantees and operational tooling you already know.

The Architecture

Our RAG system consists of three components:

Document Loader - Converts your source documents (HTML, Markdown, reStructuredText) into a consistent format and stores them in PostgreSQL (Github)
Vectorizer - Chunks the documents into smaller pieces and generates vector embeddings for semantic search (Github)
RAG Server - Provides an API that retrieves relevant chunks and sends them to an LLM for response generation (Github)

In this first post, we'll focus on getting your documents into the database.

Setting Up the Database

First, let's create a database and the schema we'll need. I'm assuming you have PostgreSQL 14 or later installed (I’m using PostgreSQL 18 on Debian Trixie, installed using pgEdge Enterprise Postgres):

-- Connect as a superuser

psql -U postgres

-- Create the database

CREATE DATABASE ragdb;

-- Connect to it

\c ragdb

Now let's create our documents table. This schema is designed to support the full RAG pipeline:

CREATE TABLE documents (
    id IDENTITY PRIMARY KEY,
    title TEXT,
    content TEXT NOT NULL,
    source BYTEA,
    filename TEXT UNIQUE NOT NULL,
    file_modified TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMPTZ,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMPTZ
);

-- Index for filename lookups (used during updates)

CREATE INDEX idx_documents_filename ON documents(filename);

-- Full-text search index (useful for hybrid search later)

CREATE INDEX idx_documents_content_fts ON documents

    USING gin(to_tsvector('english', content));

A few notes on this schema:

content stores the document converted to Markdown format, which provides a clean, consistent format regardless of the source
source stores the original document as binary data, useful if you ever need to reprocess or reference the original
filename has a UNIQUE constraint, which allows us to update documents when they change rather than creating duplicates
The full-text search index isn't strictly necessary for RAG (we'll use vector search), but it's useful for hybrid search approaches and debugging

Now let's create a user for the document loader:

-- Create a user for the loader

CREATE USER docuser WITH PASSWORD 'your_secure_password';

-- Grant the necessary permissions

GRANT SELECT, INSERT, UPDATE ON documents TO docuser;
GRANT USAGE, SELECT ON SEQUENCE documents_id_seq TO docuser;

Installing the Document Loader

The pgEdge Document Loader is a command-line tool that handles the conversion and loading of documents. It supports HTML, Markdown, and reStructuredText files, automatically extracting titles and metadata.

To install from source:

git clone https://github.com/pgEdge/pgedge-docloader.git
cd pgedge-docloader
make build
sudo make install

This installs the pgedge-docloader binary to /usr/local/bin. You can verify the installation:

pgedge-docloader version

To see the supported formats:

pgedge-docloader formats

Note that we will be adding packages for pgedge-docloader and the other projects used in this blog series to our pgEdge Enterprise Postgres repositories over the coming weeks.

Loading Documents

Let's say you have documentation in a docs directory. The simplest way to load it is:

export PGPASSWORD='your_secure_password'

pgedge-docloader \
    --source ./pgadmin4/docs/en_US \
    --db-host localhost \
    --db-name ragdb \
    --db-user docuser \
    --db-table documents \
    --col-doc-title title \
    --col-doc-content content \
    --col-source-content source \
    --col-file-name filename \
    --col-file-modified file_modified \
    --col-row-created created_at \
    --col-row-updated updated_at

Note that for convenience, I’m using the documentation from pgAdmin 4. The tool will:

Scan the source directory for supported files
Convert each document to Markdown format
Extract the title from the document
Insert everything into the database in a single transaction

If anything fails, the entire operation is rolled back, so you won't end up with partially loaded content.

Using a Configuration File

For repeated use, a configuration file is more convenient. Create a file called docloader.yml

 # Source documents
source: "./pgadmin4/docs/en_US"

# Database connection
db-host: localhost
db-port: 5432
db-name: ragdb
db-user: docuser
db-sslmode: prefer
db-table: documents

# Column mappings
col-doc-title: title
col-doc-content: content
col-source-content: source
col-file-name: filename
col-file-modified: file_modified
col-row-created: created_at
col-row-updated: updated_at

# Update existing documents when re-running
update: true

Now you can simply run:

pgedge-docloader --config docloader.yml

The update: true setting is particularly useful. It enables upsert behaviour: if a document with the same filename already exists, it will be updated rather than causing a duplicate key error. This makes it easy to keep your database in sync as documentation changes.

Working with Different Source Formats

The loader handles format conversion automatically based on file extension:

HTML files (.html, .htm) are converted to Markdown, with the title extracted from the <title> tag. This is particularly useful if you're loading documentation generated by tools like Sphinx or MkDocs.

Markdown files (.md) are stored as-is, with the title extracted from the first level-1 heading.

reStructuredText files (.rst) are converted to Markdown, with titles extracted from underlined headings. RST directives are processed where possible.

Using Glob Patterns

For more control over which files to load, you can use glob patterns:

# Load only Markdown files
pgedge-docloader --source "./docs/**/*.md" --config docloader.yml

# Load HTML files from a specific subdirectory
pgedge-docloader --source "./output/html/*.html" --config docloader.yml

The pattern matches any number of directories, so ./docs/**/*.md will find all Markdown files anywhere under the docs directory.

Verifying the Load

After loading, you can verify your documents are in the database:

-- Connect to the database

psql -h localhost -U docuser -d ragdb

-- Check document count

SELECT COUNT(*) FROM documents;

-- View some titles

SELECT title, filename FROM documents LIMIT 10;

-- Check a specific document

SELECT title, LEFT(content, 500) AS content_preview
FROM documents
WHERE filename LIKE '%backup%';

Handling Multiple Documentation Sets

If you're loading documentation from multiple products or versions, you might want to track that metadata. You can add custom columns to your schema (note: you must run this as the postgres user, as it owns the tables):

ALTER TABLE documents ADD COLUMN product TEXT;
ALTER TABLE documents ADD COLUMN version TEXT;

CREATE INDEX idx_documents_product ON documents(product);

Then use the --set-column flag to set these values during loading:

pgedge-docloader \
    --config docloader.yml \
    --set-column product="pgAdmin" \
    --set-column version="9.10"

This allows you to load multiple documentation sets into the same table while keeping them logically separated.

Next Steps

At this point, you have your documents loaded into PostgreSQL in a clean, consistent Markdown format. The content is ready for the next stage of our RAG pipeline: chunking and embedding generation.

In Part 2, we'll use pgEdge Vectorizer to break these documents into smaller, semantically meaningful chunks and generate vector embeddings. These embeddings are what enable semantic search - finding content based on meaning rather than just keyword matching.

Stay tuned!

Building a RAG Server with PostgreSQL - Part 1: Loading Your Content

Why PostgreSQL for RAG?

The Architecture

Setting Up the Database

Installing the Document Loader

Loading Documents

Using a Configuration File

Working with Different Source Formats

Using Glob Patterns

Verifying the Load

Handling Multiple Documentation Sets

Next Steps

Dave Page

Chief Technology Officer

Get started today.

Building a RAG Server with PostgreSQL - Part 1: Loading Your Content

Why PostgreSQL for RAG?

The Architecture

Setting Up the Database

Installing the Document Loader

Loading Documents

Using a Configuration File

Working with Different Source Formats

Using Glob Patterns

Verifying the Load

Handling Multiple Documentation Sets

Next Steps

Dave Page

Chief Technology Officer

SUBSCRIBE TO BLOG

Get started today.