Building a RAG Server with PostgreSQL - Part 1: Loading Your Content
Retrieval-Augmented Generation (RAG) has become one of the most practical ways to give Large Language Models (LLMs) access to your own data. Rather than fine-tuning a model or hoping it somehow knows about your documentation, RAG lets you retrieve relevant content from your own sources and provide it as context to the LLM at query time. The result is accurate, grounded responses based on your actual content.
In this three-part series, I'll walk through building a complete RAG server using PostgreSQL as the foundation. We'll cover:
Part 1 (this post): Creating a schema and loading your documents
Part 2: Chunking documents and generating embeddings with pgEdge Vectorizer
Part 3: Deploying a RAG API server for your applications
By the end of the series, you'll have a working RAG system that can answer questions using your own documentation or knowledge base.
Why PostgreSQL for RAG?
If you're already running PostgreSQL (and let's face it, you probably are), adding RAG capabilities to your existing infrastructure makes a lot of sense. With the pgvector extension, Postgres becomes a capable vector database without requiring you to deploy and manage yet another specialised system. Your documents, embeddings, and application data can all live in one place, with the transactional guarantees and operational tooling you already know.
The Architecture
Our RAG system consists of three components:
Document Loader - Converts your source documents (HTML, Markdown, reStructuredText) into a consistent format and stores them in PostgreSQL (Github)
Vectorizer - Chunks the documents into smaller pieces and generates vector embeddings for semantic search (Github)
RAG Server - Provides an API that retrieves relevant chunks and sends them to an LLM for response generation (Github)
In this first post, we'll focus on getting your documents into the database.
Setting Up the Database
First, let's create a database and the schema we'll need. I'm assuming you have PostgreSQL 14 or later installed (I’m using PostgreSQL 18 on Debian Trixie, installed using pgEdge Enterprise Postgres):
-- Connect as a superuser
psql -U postgres-- Create the database
CREATE DATABASE ragdb;-- Connect to it
\c ragdbNow let's create our documents table. This schema is designed to support the full RAG pipeline:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT NOT NULL,
source BYTEA,
filename TEXT UNIQUE NOT NULL,
file_modified TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);-- Index for filename lookups (used during updates)
CREATE INDEX idx_documents_filename ON documents(filename);-- Full-text search index (useful for hybrid search later)
CREATE INDEX idx_documents_content_fts ON documents
USING gin(to_tsvector('english', content));A few notes on this schema:
content stores the document converted to Markdown format, which provides a clean, consistent format regardless of the source
source stores the original document as binary data, useful if you ever need to reprocess or reference the original
filename has a UNIQUE constraint, which allows us to update documents when they change rather than creating duplicates
The full-text search index isn't strictly necessary for RAG (we'll use vector search), but it's useful for hybrid search approaches and debugging
Now let's create a user for the document loader:
-- Create a user for the loader
CREATE USER docuser WITH PASSWORD 'your_secure_password';-- Grant the necessary permissions
GRANT SELECT, INSERT, UPDATE ON documents TO docuser;
GRANT USAGE, SELECT ON SEQUENCE documents_id_seq TO docuser;Installing the Document Loader
The pgEdge Document Loader is a command-line tool that handles the conversion and loading of documents. It supports HTML, Markdown, and reStructuredText files, automatically extracting titles and metadata.
To install from source:
git clone https://github.com/pgEdge/pgedge-docloader.git
cd pgedge-docloader
make build
sudo make installThis installs the pgedge-docloader binary to /usr/local/bin. You can verify the installation:
pgedge-docloader versionTo see the supported formats:
pgedge-docloader formatsNote that we will be adding packages for pgedge-docloader and the other projects used in this blog series to our pgEdge Enterprise Postgres repositories over the coming weeks.
Loading Documents
Let's say you have documentation in a docs directory. The simplest way to load it is:
export PGPASSWORD='your_secure_password'
pgedge-docloader \
--source ./pgadmin4/docs/en_US \
--db-host localhost \
--db-name ragdb \
--db-user docuser \
--db-table documents \
--col-doc-title title \
--col-doc-content content \
--col-source-content source \
--col-file-name filename \
--col-file-modified file_modified \
--col-row-created created_at \
--col-row-updated updated_atNote that for convenience, I’m using the documentation from pgAdmin 4. The tool will:
Scan the source directory for supported files
Convert each document to Markdown format
Extract the title from the document
Insert everything into the database in a single transaction
If anything fails, the entire operation is rolled back, so you won't end up with partially loaded content.
Using a Configuration File
For repeated use, a configuration file is more convenient. Create a file called docloader.yml
# Source documents
source: "./pgadmin4/docs/en_US"
# Database connection
db-host: localhost
db-port: 5432
db-name: ragdb
db-user: docuser
db-sslmode: prefer
db-table: documents
# Column mappings
col-doc-title: title
col-doc-content: content
col-source-content: source
col-file-name: filename
col-file-modified: file_modified
col-row-created: created_at
col-row-updated: updated_at
# Update existing documents when re-running
update: trueNow you can simply run:
pgedge-docloader --config docloader.ymlThe update: true setting is particularly useful. It enables upsert behaviour: if a document with the same filename already exists, it will be updated rather than causing a duplicate key error. This makes it easy to keep your database in sync as documentation changes.
Working with Different Source Formats
The loader handles format conversion automatically based on file extension:
HTML files (.html, .htm) are converted to Markdown, with the title extracted from the <title> tag. This is particularly useful if you're loading documentation generated by tools like Sphinx or MkDocs.
Markdown files (.md) are stored as-is, with the title extracted from the first level-1 heading.
reStructuredText files (.rst) are converted to Markdown, with titles extracted from underlined headings. RST directives are processed where possible.
Using Glob Patterns
For more control over which files to load, you can use glob patterns:
# Load only Markdown files
pgedge-docloader --source "./docs/**/*.md" --config docloader.yml# Load HTML files from a specific subdirectory
pgedge-docloader --source "./output/html/*.html" --config docloader.ymlThe pattern matches any number of directories, so ./docs/**/*.md will find all Markdown files anywhere under the docs directory.
Verifying the Load
After loading, you can verify your documents are in the database:
-- Connect to the database
psql -h localhost -U docuser -d ragdb-- Check document count
SELECT COUNT(*) FROM documents;-- View some titles
SELECT title, filename FROM documents LIMIT 10;-- Check a specific document
SELECT title, LEFT(content, 500) AS content_preview
FROM documents
WHERE filename LIKE '%backup%';Handling Multiple Documentation Sets
If you're loading documentation from multiple products or versions, you might want to track that metadata. You can add custom columns to your schema (note: you must run this as the postgres user, as it owns the tables):
ALTER TABLE documents ADD COLUMN product TEXT;
ALTER TABLE documents ADD COLUMN version TEXT;
CREATE INDEX idx_documents_product ON documents(product);Then use the --set-column flag to set these values during loading:
pgedge-docloader \
--config docloader.yml \
--set-column product="pgAdmin" \
--set-column version="9.10"This allows you to load multiple documentation sets into the same table while keeping them logically separated.
Next Steps
At this point, you have your documents loaded into PostgreSQL in a clean, consistent Markdown format. The content is ready for the next stage of our RAG pipeline: chunking and embedding generation.
In Part 2, we'll use pgEdge Vectorizer to break these documents into smaller, semantically meaningful chunks and generate vector embeddings. These embeddings are what enable semantic search - finding content based on meaning rather than just keyword matching.
Stay tuned!



