Introducing pg-healthcheck: PostgreSQL Health Diagnostics

After more than 20 years working with PostgreSQL, I keep seeing the same problems surface at the worst possible times - bloat that sneaks up on you, replication slots quietly holding back WAL, transaction ID wraparound that nobody caught in time, backups that silently stopped working weeks ago. There are also data and catalog corruption issues like TOAST table corruption or a mismatch between heap state and VM state causing problems with vacuum operations. What I always wanted was a single tool I could point at any PostgreSQL instance and get a clear, actionable picture of its health. So I built one.

pg-healthcheck is an open source utility written in Go that runs 180+ checks across 14 groups, querying live PostgreSQL system catalog views directly… no estimates, no simulated data. It works against single PostgreSQL instances as well as pgEdge multi-node Spock clusters, and gives you either coloured terminal output or structured JSON you can feed into a monitoring pipeline.

Getting Started

Pre-built binaries for Linux (amd64), macOS, and Windows are on the releases page. Linux ARM64 users should build from source - see below.

Linux amd64:

curl -L https://github.com/pgEdge/pg-healthcheck/releases/latest/download/pg-healthcheck_linux_amd64.tar.gz | tar xz

Linux ARM64 — build from source:

git clone https://github.com/pgEdge/pg-healthcheck.git
cd pg-healthcheck
go build -o pg-healthcheck ./cmd/...
chmod +x pg-healthcheck

Run a full health check:

./pg-healthcheck --host localhost --port 5432 --dbname mydb --user postgres

If your PostgreSQL user requires a password, prefix commands with PGPASSWORD=yourpassword or set it in your environment. pg-healthcheck uses the standard PostgreSQL connection environment variables (PGPASSWORD, PGHOST, PGPORT, PGUSER, PGDATABASE), so any of the usual approaches work.

On first run you immediately get a colour-coded report across all 14 check groups. Each finding shows severity (OK, INFO, WARN, or CRITICAL), what was observed, what is recommended, and a link to the relevant PostgreSQL documentation. Exit codes follow the standard convention: 0 for all clear, 1 for warnings, 2 for critical findings, which makes it straightforward to use in scripts or CI pipelines.

What Gets Checked

The 14 check groups cover everything I would look at during a health review or a production incident. Rather than walking through all of them, here are the ones I find most valuable in practice.

Vacuum & Bloat (G05)

Dead tuple accumulation, table and index bloat, and TXID wraparound proximity. Wraparound is one of those things that can take down a production database if you miss it, and it’s easy to miss on a busy system. This group surfaces it early.

WAL & Replication Slots (G09)

Slot lag, inactive and invalidated slots, logical replication worker status, and subscription sync state. A stuck replication slot holding back WAL is one of the most common causes of disk exhaustion I see; this catches it before it becomes a problem.

WAL Growth & Generation Rate (G14)

Rather than just checking the current pg_wal directory size, G14 maintains a rolling baseline and fires an alert when the current rate is more than 3x the historical average. A static threshold alone gives you too many false positives on busy systems; the spike detection approach is much more useful in practice.

TOAST & Data Integrity (G07)

TOAST access patterns and amcheck B-tree structural verification. Corruption here is rare but when it happens it’s serious, and it’s not something most monitoring tools check for.

Visibility Map (G08)

This one I’m particularly happy with. G08-006 uses pg_check_visible() and pg_check_frozen() from the pg_visibility extension to detect VM/heap mismatches, specifically pages marked ALL_FROZEN in the VM that still contain unfrozen tuples. This is the kind of corruption that VACUUM cannot detect and can persist silently through major version upgrades. It’s exactly the sort of thing a health check tool should surface.

Composite Alerts

The tool produces composite alerts when multiple related findings are simultaneously critical. If G02 archiving and G14 WAL growth are both CRITICAL at the same time, for example, it prints a combined banner explaining why the two together are especially dangerous. That kind of context is something you normally only get from an experienced DBA who knows how the pieces fit together.

pgEdge / Spock Cluster (G12)

For pgEdge environments, G12 checks Spock exception logs, conflict counters and DCA counters from spock.channel_summary_stats, and replication LSN lag in MB between each node pair from spock.progress. All Spock catalog queries are verified against the live pgEdge Spock schema, and checks that reference tables or columns not present on the installed version skip gracefully with an INFO message.

Targeting Specific Groups

You can run all groups or target specific ones - useful when you want to focus on a particular area without running the full suite:

# Vacuum, WAL slots, and WAL growth only
./pg-healthcheck --groups G05,G09,G14
# Upgrade readiness check against PostgreSQL 17
./pg-healthcheck --groups G10 --target-version 17
# Full check against a pgEdge cluster
./pg-healthcheck --mode cluster --nodes node1:5432,node2:5432,node3:5432 --dbname mydb

JSON output makes it easy to pipe results into dashboards or alerting systems:

./pg-healthcheck --output json | jq '.summary'

The ask Subcommand

This is the part I’ve been most excited about. The ask subcommand - built by Kazim Hussain - lets you describe what you want to check in plain English. The tool maps it to the right check groups and runs them:

pg-healthcheck ask "check for dead tuples and bloat" --host db1 --dbname mydb
pg-healthcheck ask "how is WAL disk usage?" --output json
pg-healthcheck ask "are there any lock contention issues?"
pg-healthcheck ask "how safe are we from transaction ID wraparound?"
pg-healthcheck ask "check WAL health and replication lag"

Under the hood, ask maps the query to one or more check groups using an LLM. Three providers are supported: Ollama (default, runs locally, no API key needed, works in air-gapped environments), OpenAI, and Google Gemini.

Ollama is what I use myself:

ollama pull llama3.2
pg-healthcheck ask "check index health" --host db1 --dbname mydb

If the LLM provider is unreachable or no API key is found, ask falls back to built-in keyword matching automatically… no error, no crash. The status line tells you which path was used. Note that query-to-group mapping can differ slightly between the LLM and keyword fallback paths; the LLM tends to resolve multi-concept queries more broadly.

Multi-group queries work naturally too:

pg-healthcheck ask "check corruption, bloat, and index issues"
# → G05 (Vacuum & Bloat), G06 (Indexes), G07 (TOAST & Corruption)

Configuration

All thresholds have sensible defaults but are tunable through a healthcheck.yaml file. The tool loads it from the current directory automatically, and you can maintain one per environment:

./pg-healthcheck --config prod.yaml

CLI flags always win over YAML, which wins over built-in defaults. Simple and predictable.

Get It

pg-healthcheck is open source under the PostgreSQL License - permissive, compatible with commercial use, and embeddable in your own tooling - and is available at github.com/pgEdge/pg-healthcheck. Pre-built binaries for Linux amd64, macOS, and Windows are on the releases page; Linux ARM64 users should build from source as described above. There’s also a full check reference PDF in the repository if you want a printable summary of all 180+ checks.

If you run it against your PostgreSQL environment and have feedback, or if there are checks you think are missing, I’d love to hear from you - feel free to open up an issue or pull request on GItHub.

Ahsan Hadi is a Director of Customer Success at pgEdge with 20+ years of database experience, 15+ of those with PostgreSQL.