In Part 1 we loaded our documentation into PostgreSQL. In Part 2 we chunked those documents and generated vector embeddings. Now it's time to put it all together with an API that your applications can use.
In this final post, we'll deploy the pgEdge RAG Server to provide a simple HTTP API for asking questions about your content. By the end, you'll have a working RAG system that can answer questions using your own documentation.

What the RAG Server Does

The RAG server sits between your application and the LLM, handling the retrieval part of Retrieval-Augmented Generation. When a query comes in, it:

  • Converts the query to a vector embedding

  • Searches for relevant chunks using both semantic (vector) and keyword (BM25) matching

  • Combines and ranks the results

  • Formats the top results as context for the LLM

  • Sends the context and query to the LLM

  • Returns the generated answer

This hybrid search approach - combining vector similarity with traditional keyword matching - tends to give better results than either method alone. Vector search catches semantically related content even when the exact words differ, while BM25 ensures you don't miss obvious keyword matches.

Prerequisites

Before we start, you'll need:

  • The database we set up in Parts 1 and 2, with documents and embeddings

  • An API key for your chosen LLM provider (Anthropic, OpenAI, or local Ollama)

  • Go 1.23 or later for building from source

Installing the RAG Server

Clone and build the server:

git clone https://github.com/pgEdge/pgedge-rag-server.git
cd pgedge-rag-server
make build

This creates the binary at bin/pgedge-rag-server.

Configuration

The RAG server uses a YAML configuration file. Here's a basic setup:

server:
  listen_address: "0.0.0.0"
  port: 8080

api_keys:
  openai: "/var/lib/postgresql/.openai-api-key"
  anthropic: "/var/lib/postgresql/.anthropic-api-key"

pipelines:
  - name: "docs"
    description: "Search our documentation"
    database:
      host: "localhost"
      port: 5432
      database: "ragdb"
      username: "docuser"
      password: "your_secure_password"
      ssl_mode: "prefer"
    tables:
      - table: "documents_content_chunks"
        text_column: "content"
        vector_column: "embedding"
    embedding_llm:
      provider: "openai"
      model: "text-embedding-3-small"
    rag_llm:
      provider: "anthropic"
      model: "claude-sonnet-4-20250514"
    token_budget: 4000
    top_n: 10

Save this as config.yaml. Let's break down the key sections:

server - Where the API listens. Default is port 8080 on all interfaces.

api_keys - Paths to files containing your API keys. Each file should contain just the key, nothing else. Make sure they have restrictive permissions (chmod 600).

pipelines - This is where it gets interesting. A pipeline defines a complete RAG configuration: which database to query, which tables to search, and which LLM providers to use. You can define multiple pipelines for different use cases.

tables - Points to our chunk table from Part 2. The text_column is used for BM25 keyword search, and the vector_column is used for semantic search.

embedding_llm - The model used to convert queries into vectors. This must match what you used in Part 2 to generate the document embeddings.

rag_llm - The model used to generate answers. This can be different from the embedding model.

token_budget - How many tokens of context to send to the LLM. More context means more information but higher costs and slower responses.

top_n - How many chunks to retrieve before applying the token budget.

Setting Up API Keys

Create key files with appropriate permissions:

# OpenAI (for embeddings)
echo "sk-your-openai-key" > ~/.openai-api-key
chmod 600 ~/.openai-api-key

# Anthropic (for completions)
echo "sk-ant-your-anthropic-key" > ~/.anthropic-api-key
chmod 600 ~/.anthropic-api-key

You can also use environment variables:

export OPENAI_API_KEY="sk-your-openai-key"
export ANTHROPIC_API_KEY="sk-ant-your-anthropic-key"

Running the Server

Start the server:

./bin/pgedge-rag-server -config config.yaml

You should see output indicating the server is running. Test it with the health endpoint:

curl http://localhost:8080/v1/health

You should get:

{"status": "healthy"}

Making Your First Query

Now let's ask a question. The main endpoint is POST /v1/pipelines/{pipeline-name}

curl -X POST http://localhost:8080/v1/pipelines/docs \
  -H "Content-Type: application/json" \
  -d '{"query": "What is pgAdmin?"}'

The server will:

  • Convert your question to a vector using OpenAI

  • Search the chunks table for relevant content

  • Send the best matches to Claude

  • Return the generated answer

You'll get a response like:

{
    "answer":"Based on the provided context, pgAdmin is the leading Open Source management tool for Postgres (PostgreSQL), the world's most advanced Open Source database. \n\nKey characteristics of pgAdmin include:\n\n- **Purpose**: It's designed to meet the needs of both novice and experienced Postgres users alike\n- **Interface**: Provides a powerful graphical interface that simplifies the creation, maintenance and use of database objects\n- **Functionality**: Supports all PostgreSQL features, from writing simple SQL queries to developing complex databases\n- **Real-time capability**: Designed to query an active database in real-time, allowing you to stay current with modifications and implementations\n\nThe tool features a highly-customizable display with drag-and-drop panels, a tree control for managing servers and database objects, and includes powerful developer tools for executing complex SQL commands, managing data, and debugging code.\n\npgAdmin can be deployed in multiple ways: as a desktop application, in server mode behind a webserver, or as a container deployment.",
  "tokens_used":3654
}

Including Source Documents

If you want to see which documents were used to generate the answer:

curl -X POST http://localhost:8080/v1/pipelines/docs \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I configure a database connection?",
    "include_sources": true
  }'

The response includes the source chunks:

{
  "answer": "Based on the provided context, here's how to configure a database connection:\n\n## Server Connection Configuration...",
  "sources": [
    {
      "id": "(2,9)",
      "content": "\n# Database Dialog\n\nUse the *Database* dialog to define or modify a database...",
      "score": 0.016129032258064516
    },
    {
      "content": "file* field to specify the location of the key file.\n* If the SSH host...",
      "score": 0.015625
    }
  ],
  "tokens_used": 3373
}

This is useful for debugging, showing citations to users, or building UIs that let users explore the source material.

Streaming Responses

For chat-style interfaces, you probably want streaming responses so users see the answer as it's generated:

curl -N -X POST http://localhost:8080/v1/pipelines/docs \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain the architecture", "stream": true}'

The response uses Server-Sent Events:

data: {"type":"chunk","content":"The architecture "}

data: {"type":"chunk","content":"consists of three "}

data: {"type":"chunk","content":"main components..."}

data: {"type":"done"}

In JavaScript, you'd consume this with the EventSource API or a fetch with streaming.

Filtering Results

If you have multiple products or versions in your database (remember the --set-column option from Part 1?), you can filter results using a structured filter format. Note that for this example to work, we need to create a view that includes the product name and version along with the chunks:

CREATE VIEW product_docs AS 
  SELECT c.*, d.product, d.version 
    FROM documents_content_chunks c, documents d 
    WHERE c.source_id = d.id;

GRANT SELECT ON product_docs TO docuser;

Then, we can use the view in our pipeline configuration and run the query:

curl -X POST http://localhost:8080/v1/pipelines/docs \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I enable SSL?",
    "filter": {
      "conditions": [
        {"column": "product", "operator": "=", "value": "pgAdmin"},
        {"column": "version", "operator": "=", "value": "9.10"}
      ],
      "logic": "AND"
    }
  }'

The API filter uses a structured format with explicit conditions, operators, and logic. This prevents SQL injection by only allowing whitelisted operators (=, !=, <, >, <=, >=, LIKE, ILIKE, IN, NOT IN, IS NULL, IS NOT NULL) and safely parameterizing values. The filter is applied to both the vector search and the BM25 search.

You can also set a default filter in the configuration:

tables:
  - table: "documents_content_chunks"
    text_column: "content"
    vector_column: "embedding"
    filter:
      conditions:
        - column: "status"
          operator: "="
          value: "published"

Note: Configuration file filters can use either the structured format shown above or raw SQL for complex queries (like subqueries). Since config files are controlled by administrators, raw SQL is safe there. API request filters must always use the structured format for security.

Conversation History

For multi-turn conversations, you can include previous messages:

curl -X POST http://localhost:8080/v1/pipelines/docs \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I configure that?",
    "messages": [
      {"role": "user", "content": "What authentication methods are supported?"},
      {"role": "assistant", "content": "The system supports OAuth2, LDAP, and internal database authentication."}
    ]
  }'

The conversation history gives the LLM context about what "that" refers to, enabling natural follow-up questions.

Multiple Pipelines

One server can host multiple pipelines for different use cases:

pipelines:
  - name: "pgadmin-docs"
    description: "pgAdmin documentation"
    database:
      database: "ragdb"
    tables:
      - table: "documents_content_chunks"
        text_column: "content"
        vector_column: "embedding"
        filter: "source_id IN (SELECT id FROM documents WHERE product='pgAdmin')"
    # ... LLM config

  - name: "pgedge-docs"
    description: "pgEdge platform documentation"
    database:
      database: "ragdb"
    tables:
      - table: "documents_content_chunks"
        text_column: "content"
        vector_column: "embedding"
        filter: "source_id IN (SELECT id FROM documents WHERE product='pgEdge')"
    # ... LLM config

  - name: "internal-kb"
    description: "Internal knowledge base"
    database:
      database: "internal_kb"
    tables:
      - table: "kb_chunks"
        text_column: "content"
        vector_column: "embedding"
        filter:
          conditions:
            - column: "status"
              operator: "="
              value: "published"
    # ... LLM config

Note that the IN filters in the examples above are illustrative, but sub-optimal performance-wise. You may prefer to create a view that includes the product name or other information with each chunk, and use the view as the pipeline source table.

List available pipelines:

curlhttp://localhost:8080/v1/pipelines
{"pipelines":[{"name":"pgadmin-docs","description":"pgAdmindocumentation"},{"name":"pgedge-docs","description":"pgEdgeplatformdocumentation"},{"name":"internal-kb","description":"Internalknowledgebase"}]}

Alternative LLM Providers

The configuration I showed uses OpenAI for embeddings and Anthropic for completions, but you have options.

All OpenAI

embedding_llm:
  provider: "openai"
  model: "text-embedding-3-small"
rag_llm:
  provider: "openai"
  model: "gpt-4o-mini"

Voyage AI for Embeddings

Voyage offers high-quality embeddings, often at lower cost:

api_keys:
  voyage: "/home/user/.voyage-api-key"

# In your pipeline:
embedding_llm:
  provider: "voyage"
  model: "voyage-3"

Note that Voyage embeddings have 1024 dimensions, so you'd need to adjust your vectorizer configuration in Part 2 accordingly.

Local with Ollama

For complete privacy and no API costs:

embedding_llm:
  provider: "ollama"
  model: "nomic-embed-text"
rag_llm:
  provider: "ollama"
  model: "llama3.2"

No API keys needed. You'll need Ollama running locally with the models pulled:

ollama pull nomic-embed-text
ollama pull llama3.2

Local models are usually slower than API calls (depending on your hardware) but give you complete control over your data.

Production Deployment

For production use, you'll want to consider a few things:

TLS/HTTPS

Enable TLS in the configuration:

server:
  port: 443
  tls:
    enabled: true
    cert_file: "/etc/ssl/certs/server.pem"
    key_file: "/etc/ssl/private/server.key"

Authentication

The RAG server doesn't include authentication - it's designed to sit behind your infrastructure. Common approaches:

  • Put it behind a reverse proxy (nginx, Caddy) with authentication

  • Use an API gateway

  • Run it on a private network accessible only to your application servers

Systemd Service

Create a service file at /etc/systemd/system/pgedge-rag-server.service

[Unit]
Description=pgEdge RAG Server
After=network.target postgresql.service

[Service]
Type=simple
User=pgedge
ExecStart=/usr/local/bin/pgedge-rag-server -config /etc/pgedge/rag-server.yaml
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Note that the “ExecStart” line should include the path to the config file on the same line, in case it wraps in your browser!

Then:

sudo systemctl daemon-reload
sudo systemctl enable pgedge-rag-server
sudo systemctl start pgedge-rag-server

Tuning Performance

A few parameters affect performance and quality:

token_budget - Higher values give the LLM more context but increase latency and cost. Start with 4000 and adjust based on your content and response quality.

top_n - How many chunks to retrieve. The token budget will ultimately determine how many are actually sent to the LLM, but retrieving more candidates can improve result quality. 10-20 is usually sufficient.

Chunk size (from Part 2) - Smaller chunks give more precise retrieval but may lack context. Larger chunks provide more context but may include irrelevant content. The 400-token default is a reasonable starting point.

Putting It All Together

We now have a complete RAG system:

The entire pipeline runs on PostgreSQL plus a single Go binary. No message queues, no separate vector databases, no complex orchestration. Just SQL and HTTP.

To update your knowledge base, re-run the document loader. The vectorizer automatically processes changes, and the RAG server immediately serves the updated content.

Example Integration

Here's a simple Python client:

import requests

RAG_URL = "http://localhost:8080/v1/pipelines/docs"

def ask(question: str, include_sources: bool = False) -> dict:
    response = requests.post(RAG_URL, json={
        "query": question,
        "include_sources": include_sources
    })
    response.raise_for_status()
    return response.json()

# Simple usage
result = ask("How do I configure backups?")
print(result["answer"])

# With sources for citations
result = ask("What are the system requirements?", include_sources=True)
print(result["answer"])
for source in result.get("sources", []):
    print(f"  - {source['content'][:100]}... (score: {source['score']:.2f})")

What's Next?

You now have a working RAG system. Some ideas for extending it:

  • Add a web UI for interactive querying

  • Integrate with your existing chatbot or support system

  • Set up scheduled document loading to keep content fresh

  • Add logging and monitoring for production observability

  • Experiment with different chunk sizes and token budgets

The beauty of this approach is that it's all built on PostgreSQL. You can use standard database tooling for backups, replication, and monitoring. Your documents, embeddings, and application data can all live together, benefiting from PostgreSQL's reliability and your existing operational expertise.

Happy querying!