Semantic Caching in PostgreSQL: A Hands-On Guide to pg_semantic_cache
Your LLM application is probably answering the same question dozens of times a day. It just doesn't realize it because the words are different each time.
The Problem with Exact-Match Caching
If you're running an AI-powered application like a chatbot, a RAG pipeline, an analytics assistant, or others, you've likely added a cache to cut down on expensive LLM calls. Most caches work by matching the exact query string. Same string, cache hit. Different string, cache miss.
The trouble is that humans don't repeat themselves verbatim. These three queries all want the same answer:
"What was our Q4 revenue?"
"Show me last quarter's sales numbers"
"Q4 2024 revenue figures"A traditional cache sees three unique strings and triggers three separate LLM calls. In production AI applications, research shows that 40-70% of all queries are semantic duplicates: different words, same intent. That translates directly into wasted API calls, wasted latency, and a bloated cloud bill.
Semantic caching fixes this by matching on meaning instead of text. It uses vector embeddings to recognize that "Q4 revenue" and "last quarter's sales" are asking for the same thing, and serves the cached result in milliseconds instead of making another round trip to the LLM.
pg_semantic_cache is a PostgreSQL extension that brings this capability directly into your database. In this post, we'll set it up from scratch in a Docker container using pgEdge Enterprise Postgres 17 and walk through working examples you can run yourself.
What You'll Build
By the end of this post, you'll have:
A Docker container running pgEdge Enterprise Postgres 17 with pgvector and pg_semantic_cache.
A working semantic cache that matches queries by meaning.
Hands-on experience with caching and retrieval.
A clear understanding of how semantic similarity matching works in practice.
Setting Up the Environment
For our example, we'll use a Rocky Linux 9 container with pgEdge Enterprise Postgres 17, which bundles pgvector out of the box.
Dockerfile
First, we create a file called Dockerfile that defines the content of our container:
FROM rockylinux:9
# Install build tools and EPEL
RUN dnf install -y epel-release && \
dnf config-manager --set-enabled crb && \
dnf install -y gcc make git redhat-rpm-config
# Install pgEdge Enterprise Postgres 17 and pgvector
RUN dnf install -y https://dnf.pgedge.com/reporpm/pgedge-release-latest.noarch.rpm && \
dnf install -y pgedge-enterprise-postgres_17 pgedge-pgvector_17 pgedge-postgresql17-devel
# Add PostgreSQL binaries to PATH
ENV PATH="/usr/pgsql-17/bin:${PATH}"
ENV PG_CONFIG="/usr/pgsql-17/bin/pg_config"
# Clone and build pg_semantic_cache
RUN git clone https://github.com/pgedge/pg_semantic_cache.git /tmp/pg_semantic_cache && \
cd /tmp/pg_semantic_cache && \
make PG_CONFIG=/usr/pgsql-17/bin/pg_config && \
make PG_CONFIG=/usr/pgsql-17/bin/pg_config install && \
rm -rf /tmp/pg_semantic_cache
# Initialize the database as postgres user
USER postgres
RUN /usr/pgsql-17/bin/initdb -D /var/lib/pgsql/17/data
# Configure PostgreSQL to accept local connections
RUN echo "host all all 0.0.0.0/0 trust" >> /var/lib/pgsql/17/data/pg_hba.conf && \
echo "listen_addresses = '*'" >> /var/lib/pgsql/17/data/postgresql.conf
EXPOSE 5432
CMD ["/usr/pgsql-17/bin/postgres", "-D", "/var/lib/pgsql/17/data"]Build and Run
Use the following commands to build and run the container:
docker build -t pg-semantic-cache .
docker run -d --name semcache -p 5432:5432 pg-semantic-cacheAfter waiting a few seconds for pgEdge Enterprise Postgres 17 to start, use the following command to connect to the Postgres server:
docker exec -it semcache psql -U postgresEnable the Extensions
Next, we'll use the psql command line to create a database and the extensions that we'll be using:
CREATE DATABASE semcache_demo;
\c semcache_demo
-- pgvector is bundled with pgEdge Enterprise
CREATE EXTENSION IF NOT EXISTS vector;
-- Enable semantic cache
CREATE EXTENSION IF NOT EXISTS pg_semantic_cache;After creating objects, query the semantic_cache.cache_stats() function to verify that the cache is empty:
-- Verify
SELECT * FROM semantic_cache.cache_stats();You should see:
total_entries | total_hits | total_misses | hit_rate_percent
--------------+------------+--------------+------------------
0 | 0 | 0 | 0The cache is ready - now, let's put something in it.
How It Works: A 60-Second Overview
pg_semantic_cache stores query results alongside their vector embeddings in a Postgres database. When a new query comes in, the extension uses pgvector's cosine distance operator (<=>) to find the closest match in the cache. If the similarity exceeds your threshold (the default threshold is 0.95, meaning 95% similar), it returns the cached result. If not, it's a miss, and your application computes the result the normal way and stores it for next time.
The key insight: vector embeddings capture semantic meaning. Two sentences that mean the same thing produce vectors that are geometrically close together, regardless of the exact words used. This is a property learned by modern embedding models from billions of text examples. No hand-crafted rules, no synonym dictionaries needed.
Similarity Score = 1 - cosine_distance(embedding_a, embedding_b)
0.98+ -> Near-identical meaning ("reset password" vs "password reset")
0.93-0.97 -> Same intent, different phrasing ("Q4 revenue" vs "last quarter sales")
0.85-0.92 -> Related topic, possibly different intent
< 0.85 -> Different topic entirelyWorking Example: Caching AI Responses
Let's simulate a real scenario. In production, your application generates embeddings using a model like OpenAI's text-embedding-3-small or a local model via Ollama. For this tutorial, we'll use small 8-dimensional vectors to keep things readable. The same principles apply whether your vectors have 8 dimensions or 3,072.
Step 1: Configure for Small Vectors
-- Set dimension to 8 for our demo
SELECT semantic_cache.set_vector_dimension(8);
-- Apply the dimension change (clears cache, alters column, rebuilds index)
SELECT semantic_cache.rebuild_index();Step 2: Cache Some Query Results
Imagine three users asked questions about PostgreSQL, and your LLM generated answers. We'll store those results, as well as the vector string associated with the results:
-- First user asks about PostgreSQL transactions
SELECT semantic_cache.cache_query(
'How does PostgreSQL handle transactions?',
'[0.12, 0.85, 0.44, 0.31, 0.67, 0.22, 0.91, 0.15]',
'{"answer": "PostgreSQL implements MVCC (Multi-Version Concurrency
Control) for transaction management. Each transaction sees a snapshot
of the database, providing isolation without blocking readers."}'::jsonb,
7200,
ARRAY['postgres', 'concepts']
);
-- Second user asks about indexing
SELECT semantic_cache.cache_query(
'What types of indexes does PostgreSQL support?',
'[0.33, 0.18, 0.72, 0.55, 0.11, 0.88, 0.29, 0.64]',
'{"answer": "PostgreSQL supports B-tree (default), Hash, GiST,
SP-GiST, GIN, and BRIN indexes. Each type is optimized for different
query patterns and data types."}'::jsonb,
7200,
ARRAY['postgres', 'indexing']
);
-- Third user asks about replication
SELECT semantic_cache.cache_query(
'How do I set up replication in PostgreSQL?',
'[0.78, 0.42, 0.15, 0.93, 0.37, 0.56, 0.08, 0.71]',
'{"answer": "PostgreSQL supports streaming replication (physical) and
logical replication. For HA, configure a primary with one or more
standbys using pg_basebackup and streaming replication slots."}'::jsonb,
7200,
ARRAY['postgres', 'replication']
);When we check what's in the cache, it lists the three entries:
SELECT * FROM semantic_cache.cache_stats();
total_entries | total_hits | total_misses | hit_rate_percent
--------------+------------+--------------+------------------
3 | 0 | 0 | 0Three entries, no lookups yet.
Step 3: Query with a Semantically Similar Embedding
When a fourth user comes along and asks: "Explain ACID and transactions in Postgres." This is a different question from the first one, but semantically very close; your embedding model should produce a vector close to the one we stored when our earlier user asked about Postgres transactions.
-- This embedding is close to our "transactions" entry
SELECT * FROM semantic_cache.get_cached_result(
'[0.13, 0.83, 0.46, 0.29, 0.65, 0.24, 0.89, 0.17]',
0.95
);
found | result_data | similarity_score | age_seconds
-------+--------------------------------------------------------+------------------+-------------
t | {"answer": "PostgreSQL implements MVCC ..."} | 0.9991 | 42This is a cache hit. The similarity score is 0.999, meaning these two embeddings are nearly identical. The user gets the cached answer in a couple of milliseconds instead of waiting for an LLM round trip to the database.
Step 4: Query for Something Different
What happens when we look up an embedding that doesn't match anything in the cache?
-- An embedding about a completely different topic
SELECT * FROM semantic_cache.get_cached_result(
'[0.95, 0.05, 0.10, 0.02, 0.88, 0.03, 0.50, 0.40]',
0.95
);
found | result_data | similarity_score | age_seconds
-------+-------------+------------------+-------------
f | | 0.6821 |This is a cache miss. The closest match is only 68% similar… well below our 95% threshold. The extension still tells you the closest similarity score, which is useful for tuning your threshold. The new query, as well as its cache string is added to our cache table though, so it's ready for the next user with a similar vector string.
Step 5: Check the Stats
SELECT * FROM semantic_cache.cache_stats();
total_entries | total_hits | total_misses | hit_rate_percent
--------------+------------+--------------+------------------
3 | 1 | 1 | 50When we check out caching statistics, we have one hit, one miss, a 50% hit rate. In production with real user traffic, semantic caches typically achieve 60-80% hit rates compared to the 15-25% typical of exact-match caching. That's a 3-4x improvement in cache effectiveness, which directly translates to the same proportion of savings on LLM API costs.
As our cache develops with use, we can even see improved cache efficiency.
The Real-World Impact of Semantic Caching
Rather than projecting specific dollar amounts (which vary wildly based on your model, volume, and pricing tier), here's how to think about the impact:
Without semantic caching, your traditional cache catches only queries that are character-for-character identical. That's typically 15-25% of traffic. The remaining 75-85% of requests hit your expensive backend every time.
With semantic caching, you catch the semantic duplicates too. Hit rates jump to 60-80%, meaning only 20-40% of traffic reaches your backend.
The backend call reduction is the key number. If 65-75% fewer requests reach your LLM provider, your API spend drops by the same percentage. And every cache hit returns in 2-3ms instead of the 500ms-2s typical of an LLM call, so users get dramatically faster responses.
For RAG pipelines, the impact compounds further. Each query directed to the cache skips the most expensive parts of the pipeline: vector retrieval from the knowledge base, and LLM completion. Embedding generation still occurs (you need the embedding to query the cache), but at a fraction of the cost… typically ~$0.0001 per embedding vs $0.01–0.03 for an LLM call. The LLM call dominates the cost, so avoiding it on cache hits still delivers significant savings.
What's Next
We've seen how semantic caching works and why it matters. But a cache is only as good as its management. In the next post, we'll go deeper into the operational side: organizing entries with tags, invalidation strategies, eviction policies, cost tracking, vector dimension tuning, HNSW indexes for large-scale caches, and a complete Python integration example. Everything you need to take semantic caching from a demo to production.
pg_semantic_cache is open source under the PostgreSQL license.
Repository: github.com/pgedge/pg_semantic_cache


