What is a Collation, and Why is My Data Corrupt?

The GNU C Library (glibc) version 2.28 entered the world on August 1st, 2018 and Postgres hasn't been the same since. Among its many changes was a massive update to locale collation data, bringing it in line with the 2016 Edition 4 release of the ISO 14651 standard and Unicode 9.0.0. This was not a subtle tweak. It was the culmination of roughly 18 years of accumulated locale modifications, all merged in a single release.

Nobody threw a party.

What followed was one of the most significant and insidious data integrity incidents in the history of Postgres. Indexes silently became corrupt, query results changed without warning, and unique constraints were no longer trustworthy. The worst part? You had to know to look for it. Postgres didn't complain. The operating system didn't complain. Everything appeared normal, right up until it wasn't.

This is the story of how a library upgrade quietly corrupted databases around the world, what the Postgres community did about it, and how to make sure it never happens to you again.

What even is a Collation?

Before we can understand what broke, we need to understand what a collation actually does. At its core, a collation defines how text is compared and sorted. That sounds simple enough, but collation rules become much more turbulent outside of the English alphabet.

Consider the German letter ß. Does it sort the same as "ss"? Usually. What about accented characters like é and è? Should they be treated as equivalent to "e" for sorting purposes, or should they have their own distinct positions? What about the Swedish alphabet, where ä and ö come after z rather than being treated as variants of a and o?

Every language has its own answer to these questions, and a collation encodes those answers into a set of rules for a database to follow. When Postgres needs to sort a column of text, enforce a unique constraint, or build a B-tree index, it asks the collation: "Which of these two strings comes first?" The collation's answer determines everything from query results to whether an index lookup finds any data at all.

Historically, Postgres delegated that question to the operating system's C library. Postgres doesn't have its own implementation of American English sorting rules baked in, so databases created with the en_US.UTF-8 locale rely on external libraries. More specifically, Postgres used to simply invoke strcoll() from glibc (on Linux systems) and trusted whatever answer came back.

That trust worked fine for years. And then one day it didn't.

The Day the World Changed

So what actually changed in glibc 2.28? Consider a simple example using the en_US.UTF-8 locale. Before the update, strings containing special characters sorted like this:

a, $a, a$, A, b, $b, b$, B

After glibc 2.28, those same strings sorted like this:

$a, $b, a, A, a$, b, B, b$

That's not a minor adjustment. The relative positions of strings containing punctuation, mixed case, and special characters shifted dramatically. Another well documented example: the sort order of 'a-a' and 'a+a' simply flipped between the old and new versions. Data containing strings with hyphens, underscores, or currency symbols now experienced seemingly inconsistent sorting rules.

This wasn't a bug in glibc. The glibc developers were correcting years of accumulated divergence from the Unicode standard. The new sort orders were arguably more correct. But correctness is cold comfort for indexes founded upon faulty assumptions.

Anatomy of a Silent Catastrophe

Why did this cause so much trouble, though? Let's examine how Postgres B-tree indexes work with text. When Postgres builds an index on a text column, it sorts the values according to the active collation and stores them in that order. Postgres builds the literal on-disk tree structure based on these results. Later, Postgres navigates the B-tree by comparing search terms against stored keys, following the tree left or right based on which string "comes first" according to the collation.

Imagine the library underneath changes its mind about which string comes first. The physical layout of the index reflects the old sort order, but every new comparison uses the new sort order. Postgres navigates right when it should go left, or left when it should go right, and previously valid data becomes invisible. The row is still in the table, a sequential scan still finds it, but the index lookup misses it entirely.

The consequences cascaded outward from there:

Invisible rows. Queries using index scans could silently skip existing rows. A SELECT might return 999 rows when 1,000 actually matched the predicate.
Phantom uniqueness violations. Unique constraints backed by B-tree indexes could fail to detect actual duplicates because the index traversal couldn't find the existing entry. Alternatively, they might wrongly reject valid entries because the traversal landed on the wrong node.
Wrong query results. Any query relying on ORDER BY with a text column would produce different results before and after the upgrade. Merge joins, which depend on both inputs being sorted identically, could produce silently incorrect results.
Replication divergence. If a primary ran one glibc version and a replica ran another, identical queries against identical data produced different results. The 2014 TripAdvisor incident demonstrated this exact scenario with streaming replicas, though few understood the full implications at the time.

All of this happened silently. Nobody knew (or could know) anything was wrong until it was already too late.

The Domino Effect

The glibc 2.28 release didn't hit every Linux distribution simultaneously. Instead, it rippled outward over the course of about a year as each distribution adopted it on their own schedule:

October 2018: Fedora 29 and Ubuntu 18.10 shipped with glibc 2.28.
March 2019: Christoph Berg, the Debian PostgreSQL maintainer, raised the alarm on the debian-glibc mailing list, describing the situation as critical. He proposed automated warnings for Debian users with PostgreSQL clusters.
May 2019: RHEL 8 and CentOS 8 arrived with glibc 2.28, making the leap from glibc 2.17 in RHEL 7. That's an 11-version jump in a single upgrade cycle.
July 2019: Debian 10 (Buster) followed suit.

The RHEL jump was particularly brutal. Many enterprise shops run on CentOS or RHEL, and an OS upgrade from version 7 to 8 was just a common eventuality. Nobody expected that a routine distribution upgrade would quietly corrupt their database indexes. Arch Linux users, running the bleeding edge as always, were among the first canaries in the coal mine.

Daniel Verite published "Beware of your next glibc upgrade" in August 2018, one of the earliest public warnings. The Postgres Wiki created dedicated pages for Locale data changes and Collations to track the evolving situation. Blog posts from Crunchy Data, Citus Data, and CYBERTEC followed, each emphasizing the same uncomfortable truth: if you upgraded glibc and didn't rebuild your indexes, your data might already be corrupt.

Finding the Damage

The first step for any affected system was identifying at-risk indexes. Any B-tree index on a text, varchar, char, or citext column using a locale-dependent collation (anything other than C or POSIX) was potentially corrupt. The Postgres community settled on a diagnostic query that looked something like this:

SELECT indrelid::regclass AS table_name,
       indexrelid::regclass AS index_name,
       collname AS column_name,
       pg_get_indexdef(indexrelid) AS index_definition
  FROM (SELECT indexrelid, indrelid,
               indcollation[i] AS coll
          FROM pg_index,
               generate_subscripts(indcollation, 1) g(i)
       ) s
  JOIN pg_collation c ON coll = c.oid
 WHERE collprovider IN ('d', 'c')
   AND collname NOT IN ('C', 'POSIX');

Every index returned by that query needed to be rebuilt. For Postgres 12 and later, that meant:

REINDEX INDEX CONCURRENTLY index_name;

The CONCURRENTLY option proved to be a lifesaver for production systems, as it allows the rebuild to happen without locking the table for the entire rebuild duration. For those stuck on Postgres 11 or earlier, the workaround was uglier: create a replacement index concurrently, drop the old one, and rename the new one. Primary key indexes and unique constraints made this particularly painful.

The scale of the problem was staggering. A database with hundreds of tables and thousands of text indexes needed every single one rebuilt. And this wasn't a one-time fix. Any future glibc upgrade that changed collation rules would require a repeat performance.

Déjà vu

Ironically, the glibc 2.28 incident wasn't even the first time glibc caused index corruption in Postgres. Several glibc versions in 2015 shipped with a buggy strcoll() implementation that produced results inconsistent with strxfrm(), violating both ISO C90 and POSIX standards.

Postgres 9.5 had introduced "abbreviated keys" to speed up text index builds, and the glibc bugs caused these to produce corrupt indexes. The fix in Postgres 9.5.2 was to disable abbreviated keys for non-C locales entirely, a performance regression that persists to this day for libc-based collations. Users had to REINDEX then, too.

Two major incidents in three years, both caused by the same fundamental problem: Postgres trusted an external library for a critical operation, and that library's behavior was neither stable nor guaranteed. The writing was on the wall.

The Long March Away from glibc

The Postgres community's response to these incidents was measured but determined, playing out over nearly a decade of incremental progress.

Postgres 10 (2017) introduced initial support for ICU (International Components for Unicode) as an alternative collation provider. Peter Eisentraut's work here was prescient, landing a full year before glibc 2.28 shipped. For the first time, it was possible to create a collation backed by ICU instead of libc:

CREATE COLLATION german (provider = icu, locale = 'de-DE');

ICU maintains its own collation data independent of the operating system. ICU updates its rules through a strict versioning system, meaning Postgres can detect the change and emit a warning.

Postgres 13 (2020) added collation version tracking for glibc. Postgres began recording the collation version during index creation, and issuing warnings when the underlying version changed. This was the first real "early warning system" for the problem. It couldn't prevent corruption, but at least the logs told the full story.

Postgres 15 (2022) was an event many anticipated: ICU could now be used as the default collation provider for an entire database cluster:

initdb --locale-provider=icu --icu-locale=en-US

Before this, ICU was only available for individual collation objects, which was both inconvenient and prone to mistakes.

Postgres 17 (2024) delivered what many consider the real solution: the builtin collation provider. This provider compiles collation logic directly into Postgres itself, with no external glibc or ICU dependency. The builtin provider ships with two primary collations:

pg_c_utf8: Unicode code point sorting with POSIX-compatible pattern matching and simple case mapping.
pg_unicode_fast: Unicode code point sorting with full Unicode case mapping and standard pattern matching behavior.

Both are guaranteed to be immutable within a major Postgres version. The entire class of "the OS changed my sort order" bugs simply cannot happen with these collations.

The Right Way to Initialize a Cluster (Today)

Despite all of this history, glibc is still the default collation provider as of Postgres 18. Newly initialized clusters will use glibc libraries unless told otherwise. That means every new database is potentially vulnerable to the same class of problem that glibc 2.28 caused, just waiting for the next major library update to trigger it.

As a result, I recommend always specifying the builtin provider when creating a new cluster:

initdb --locale-provider=builtin --locale=C.UTF-8

Two flags to eliminate an entire category of data corruption risk. The C.UTF-8 locale provides proper UTF-8 character handling while sorting by Unicode code point order without any lingering surprises.

For those creating databases within an existing cluster, the same principle applies:

CREATE DATABASE mydb
  LOCALE_PROVIDER = builtin
  BUILTIN_LOCALE = 'C.UTF-8'
  TEMPLATE = template0;

It’s necessary to specify template0 as the template database when doing this, as it’s normally impossible to use a collation different from the source database.

Until the Postgres project changes the default (and there are ongoing discussions about exactly that), every DBA needs to make this a conscious choice for every new cluster or database.

Never a Free Lunch

If the new internally-provided collations are so great, why isn't everyone using them?

The first reason is that few stop to consider the topic at all. They may trust that the problem is being handled and addressed by future versions, and it'll happen as if by magic. Nobody wants to face the ugly reality of a data migration. Perhaps they’re simply new users who missed the uproar.

The other reason is more subtle. Both pg_c_utf8 and pg_unicode_fast sort by Unicode code point value. This is essentially a byte-order sort for UTF-8 encoded text. It's deterministic, fast, and perfect for indexes. But it doesn't match what a human would expect for linguistically correct sorting in most languages.

Consider German names:

SELECT name FROM (VALUES
         ('Müller'), ('Muller'), ('Ötzi'), ('Ozzy'), ('Über'), ('Upper')
       ) AS t(name)
 ORDER BY name COLLATE "pg_c_utf8";
 
  name  
--------
 Muller
 Müller
 Ozzy
 Upper
 Ötzi
 Über

Yet a native German speaker would expect this instead:

SELECT name FROM (VALUES
         ('Müller'), ('Muller'), ('Ötzi'), ('Ozzy'), ('Über'), ('Upper')
       ) AS t(name)
 ORDER BY name COLLATE "de-x-icu";
 
  name  
--------
 Muller
 Müller
 Ötzi
 Ozzy
 Über
 Upper

For most application workloads, this doesn't actually matter. APIs return JSON, frontends sort data client-side, and search operations care about matching rather than ordering. But linguistic sort order is still relevant for applications that display sorted lists directly to users, such as a directory, a catalog, a report, and so on.

Applying Linguistic Sorting

This is where ICU earns its keep. Clusters can run on the builtin provider for safety and performance, and then apply ICU collations precisely where necessary. There are two approaches.

Column-level collation is ideal when a particular column always needs linguistic sorting:

CREATE TABLE customers (
    id serial PRIMARY KEY,
    name text COLLATE "de-x-icu",
    email text
);

Now the name column always sorts according to German linguistic rules, while every other text column in the database uses the safe, deterministic builtin collation. Indexes on the name column will use the ICU collation, so becomes dependent on ICU's versioning for that specific index. ICU tends to exercise more discipline about version management than glibc, so this isn't quite as risky.

Expression-level collation is better for occasional linguistic sorting:

SELECT name
  FROM customers
 ORDER BY name COLLATE "de-x-icu";

This applies the German ICU collation just for this specific sort operation. The underlying column and its indexes remain on the builtin collation. This results in linguistic sorting without altering the storage or index behavior. This is the safest approach as a result, though admittedly more inconvenient due to the additional syntax.

The Ghost of Clusters Past

New clusters are easy. Specify the builtin provider at init time and move on with your life. But what about the millions of existing Postgres clusters already running on libc collations? Those don't just go away, and it's not possible to change a database's default collation after creation. That said, there are a few options:

Migrate to a new cluster. Use pg_dump and pg_restore or logical replication to move data into a fresh cluster initialized with the builtin provider. This is the cleanest approach, but requires planning and potential downtime.
Move to ICU or builtin collations manually. Individual columns can use a different collation within a libc-based cluster. Perform a staged rollout by moving tables or affected columns to a safer collation, and always create new columns with that collation. A future cluster migration is still necessary for a permanent fix, but this process provides a path to safety.
Monitor collation versions. Postgres 13+ will log warnings when a collation's underlying version changes. Pay attention to those warnings; they're screaming that a REINDEX is necessary. Look for these messages:

WARNING: collation "xx-x-icu" has version mismatch
DETAIL: The collation in the database was created using version 1.2.3.4, but the operating system provides version 2.3.4.5.
REINDEX after every major OS upgrade. The safest approach for glibc-based clusters is to rebuild every text index following a major distribution upgrade. No exceptions. Use the diagnostic query from earlier to identify affected indexes, and REINDEX INDEX CONCURRENTLY for identified candidates.

Moving On

The glibc 2.28 incident changed how the entire Postgres community thinks about external dependencies. Before 2018, the idea that an OS library update could result in database corruption was something only a handful of people worried about. That it did so silently—allowing the corruption to fester for weeks, months, or even years—just poured extra salt in the wound.

The Postgres community responded in its typical heroic fashion. Collation version tracking, ICU provider support, and ultimately builtin collations, show just how far the Postgres devs are willing to go to solve a problem. It's not a complete rebellion against trusting OS-provided libraries, but decoupling from external collation resources remains a prudent reaction given the circumstances.

And yet we shouldn't rest on our laurels; glibc remains the default even now, suggesting the lesson hasn't fully sunk in for everyone. Every initdb run without --locale-provider=builtin is another cluster carrying the same risk that nearly spelled disaster for many. I've personally encountered clusters with this kind of corruption as recently as 2025, fully seven years since everything went wrong. Why propagate that mistake?

So the next time you spin up a Postgres cluster, do yourself a favor and use the builtin locale provider. Your future self, the one who just upgraded to the latest Ubuntu LTS without thinking twice about it, will thank you.