Lessons Learned Writing an MCP Server for PostgreSQL
Over the past few months or so, we've been building the pgEdge Postgres MCP Server, an open source tool that lets LLMs talk directly to PostgreSQL databases through the Model Context Protocol. It supports Claude, GPT, local models via Ollama, and pretty much any MCP-compatible client you can throw at it. Along the way, we've learned quite a lot about what it takes to make AI and databases work well together, and the single biggest lesson has been about tokens.
If you've used an LLM for any length of time, you'll know that context windows are finite and tokens cost money. When you're working with a database, however, the problem becomes acute in a way that catching up on email or writing prose simply doesn't prepare you for. A single SELECT * on a modest table can return tens of thousands of rows, each with a dozen columns, and every character of that output consumes tokens. Multiply that across a conversation where the LLM is exploring a schema, running queries, and refining its understanding, and you can burn through a context window before anything genuinely useful has been accomplished.
This post covers the strategies we developed to keep token usage under control whilst still giving the LLM enough information to be helpful. If you're building an MCP server, or just curious about the practicalities of connecting LLMs to structured data, I hope some of these lessons will save you a few wrong turns.
Choosing the right output format for tabular data
When we first built the query_database tool, we returned results as JSON. It seemed like the obvious choice since every LLM understands JSON and it's what most APIs speak natively. The problem became apparent almost immediately: JSON is extraordinarily wasteful for tabular data.
Consider a simple query returning employee records. In JSON, every single row repeats the column names as keys, wraps every value in quotes, and adds colons, commas, and braces as structural overhead. For a table with columns id, name, email, and created_at, a ten-row result might look something like this:
[
{"id": 1, "name": "Alice Smith", "email": "[email protected]",
"created_at": "2025-03-15T10:30:00Z"},
{"id": 2, "name": "Bob Jones", "email": "[email protected]",
"created_at": "2025-03-16T14:22:00Z"}
]Every row carries the full weight of those repeated keys and the surrounding punctuation. At ten rows that's merely annoying, but at a hundred rows it adds up to a significant number of wasted tokens.
We considered CSV as an alternative. It eliminates the repeated keys by using a header row, which is a substantial improvement, but it introduces its own overhead. Values containing commas need quoting, which means you end up with quotes around many string values, and any value that itself contains a quote needs escaping with doubled quotes. For database output, which frequently contains commas in text fields and sometimes even embedded quotes, CSV can get messy quickly.
We settled on TSV (tab-separated values), and it turned out to be a surprisingly good fit. Tabs almost never appear in database values, so quoting is rarely needed. The format is dead simple: a header row with column names separated by tabs, followed by data rows in the same format. The result is compact, unambiguous, and easy for both humans and LLMs to parse:
id name email created_at
1 Alice Smith [email protected] 2025-03-15T10:30:00Z
2 Bob Jones [email protected] 2025-03-16T14:22:00ZIn our testing, TSV typically uses 30 to 40 percent fewer tokens than the equivalent JSON representation. For large result sets, that saving is the difference between fitting the data into the context window and blowing right past the limit. The rare edge cases where a value does contain a tab or newline are handled by simple escape sequences (\t and \n), which the LLM has no trouble understanding.
One thing worth noting is that LLMs are perfectly capable of reading TSV without any special prompting. We had initially worried that the format might confuse models that are more accustomed to JSON, but in practice every model we tested, from Claude to GPT to local Ollama models, parsed TSV correctly without any additional guidance.
Pagination and filtering: don't send what you don't need
Even with an efficient output format, sending a thousand rows to an LLM is rarely a good idea. Most of those rows won't be relevant to the question being asked, and the LLM will struggle to extract the signal from all that noise. The solution is to prevent the data from being sent in the first place.
Our query_database tool defaults to returning 100 rows, with a configurable limit that can go up to 1,000. We implement this by injecting a LIMIT clause into SELECT queries that don't already have one, and we fetch one extra row beyond the limit so we can tell the LLM whether more data exists. The response includes a helpful nudge like "100 rows shown, more available - use offset=100 for next page or count_rows for total", which gives the LLM the information it needs to request additional pages if it genuinely needs them.
The offset parameter enables proper pagination, so the LLM can work through a large result set in manageable chunks rather than trying to swallow it whole. In practice, we find that LLMs rarely need more than the first page of results. They tend to refine their queries with better WHERE clauses rather than paging through thousands of rows, which is exactly the behaviour you want.
For schema exploration, the savings are even more dramatic. Our get_schema_info tool supports several filtering parameters that let the LLM ask for exactly what it needs. Passing a schema_name can reduce the output by 90 percent compared to dumping the entire database structure. Adding a table_name narrows it further to just the columns of a single table, which can cut the output by 95 percent. There's also a compact mode that returns only table and column names without types, constraints, or other details, and a vector_tables_only filter for when the LLM is specifically interested in pgvector-enabled tables.
When a database has more than ten tables and the LLM hasn't applied any filters, the tool automatically switches to a summary mode. It returns a compact overview showing the first few tables per schema with a count of how many more exist, rather than dumping the full details of every table. This nudges the LLM to narrow its focus before requesting the details it actually needs.
We also built a dedicated count_rows tool that returns nothing but a single integer (plus a tiny amount of metadata). Before querying a large table, the LLM can check how many rows it contains and plan an appropriate LIMIT accordingly. It's a trivially simple tool, but it prevents the single most common source of token waste: an exploratory query on a table that turns out to have a million rows.
Progressive disclosure for search results
Our similarity search tool takes the filtering concept a step further with three distinct output formats. The full format returns complete text chunks with metadata, which typically runs to around a thousand tokens. The summary format returns only titles and short snippets, compressing the output to roughly fifty tokens. And the ids_only format returns just row identifiers and distance scores, weighing in at around ten tokens.
This progressive disclosure pattern lets the LLM start with a lightweight scan, identify the results that look promising, and then request full details only for those specific items. In a typical session, the LLM might run a summary search first, decide that results 2 and 5 look relevant, and then fetch only those two in full. The token savings compared to always returning full results are substantial, often in the range of 90 to 99 percent.
Conversation compaction: keeping a long conversation useful
Database work tends to involve long, exploratory conversations. The user asks the LLM to look at the schema, run a few queries, adjust the approach based on what the data reveals, and iterate until they get the answer they need. These conversations accumulate context rapidly, and without intervention the context window fills up with stale query results and superseded schema information that the LLM no longer needs.
We built a conversation compaction system that addresses this problem by classifying each message according to its long-term value. The classifier assigns every message to one of five categories:
Anchor messages contain schema information or other structural context that should almost always be preserved.
Important messages include substantial query results and analysis.
Contextual messages provide useful background that can be summarised if space is tight.
Routine messages are ordinary conversational turns.
Transient messages are short acknowledgements and similar low-value content that can be dropped without loss.
When the conversation exceeds the token budget (which defaults to 100,000 tokens), the compactor kicks in. It always preserves a window of recent messages so the LLM doesn't lose track of the current thread, and it keeps all anchor messages regardless of age. Everything else is evaluated by importance, with lower-value messages being dropped or summarised first. Tool call and result pairs are always kept together; separating them causes API errors since the LLM expects to see both halves of every tool interaction.
One subtlety worth mentioning is that we maintain different token estimation parameters for different LLM providers. Claude tokenises text at roughly 3.8 characters per token, whilst GPT uses closer to 4.0, and Ollama models vary but tend toward 4.5. SQL content tokenises less efficiently than natural language because of all the keywords and punctuation, so we apply a multiplier for SQL-heavy messages. These adjustments might seem like overkill, but when you're trying to maximise the use of a fixed context window, the precision matters.
The compaction results are cached using a SHA-256 hash of the message history and configuration, so repeated compaction of the same conversation state is effectively free. In practice, this means the system can check proactively whether compaction is needed without worrying about the cost of doing so.
Rate limits: the other token problem
Token efficiency isn't just about context windows. Most LLM providers impose rate limits measured in tokens per minute, and a single large query result can consume a substantial fraction of your allowance. We found that 30,000 input tokens per minute is a common threshold, and it's surprisingly easy to hit when you're working with databases.
Rather than relying solely on server-side controls, we embed rate-limit guidance directly into our tool descriptions. The query_database tool, for instance, includes advice telling the LLM to start with limit=10 for exploratory queries and to use WHERE clauses to filter results rather than fetching everything and sifting through it. The similarity search tool recommends starting with the summary output format. This approach works because LLMs actually read their tool descriptions and (usually!) follow the guidance, so you can influence their behaviour without any hard restrictions.
What we'd do differently
If I were starting this project from scratch, I'd design for token efficiency from day one rather than re-engineering it after the fact. Our initial prototype returned JSON with no pagination and no filtering, and whilst it made for impressive demos on small databases, it fell apart the moment we pointed it at anything resembling a production dataset.
I'd also invest in better observability earlier. We added token estimation logging that records the approximate token count for every tool result, and it's been invaluable for identifying wasteful patterns. Knowing that a particular tool call consumed an estimated 2,500 tokens makes it much easier to decide whether the output format needs tightening or whether a new filtering parameter would help.
Try it yourself
The pgEdge Postgres MCP Server is open source under the PostgreSQL licence. It works with Claude Desktop, Claude Code, Cursor, and any other MCP-compatible client, and it connects to pgEdge Enterprise Postgres, standard community PostgreSQL, Amazon RDS, and pretty much any Postgres variant running version 14 or newer. Full documentation is available at docs.pgedge.com.
If you're building your own MCP server for database access, I hope some of these lessons are useful. The fundamental challenge of connecting LLMs to databases isn't the protocol or the connectivity; it's managing the sheer volume of data that databases can produce, and ensuring that the tokens you spend are spent on information the LLM actually needs.



