Understanding Snowflake Sequences in a Postgres Extension

In a PostgreSQL database, sequences provide a convenient way to generate a unique identifier, and are often used for key generation. From the community, PostgreSQL provides functions and SQL language to help manage sequence generation, but the sequences themselves are not without limitations in a multi-master environment. As a result, we introduced 100% open-source Snowflake sequences that work seamlessly in a multi-master PostgreSQL cluster to remove those limitations and ensure your data can thrive at the network edge. (Check out the extension on GitHub!)

The easiest way to get started using Snowflake sequences is to try out pgEdge Postgres, either the Enterprise or the Distributed edition, available for self-hosting or fully managed hosting on virtual machines, containers, or the cloud. The pgEdge Control Plane (also 100% open source and available on GitHub) makes it even easier to deploy and orchestrate Postgres clusters, and comes with the snowflake extension by default (amongst others).

Why are Sequences an Issue?

In a distributed multi-master database system, sequences can get complicated. Ensuring consistency and uniqueness across the nodes in your cluster is a problem if you use PostgreSQL sequences. The Snowflake extension is designed to step up to automatically mitigate this issue.

PostgreSQL sequence values are prepared for assignment in a table in your PostgreSQL database; as each sequence value is used, the next sequence value nextval is incremented. Changes to the next available sequence value are not replicated to the other nodes in your replication cluster.

In a simple example, you might have a table on node n1, with 10 rows, each with a primary key that is assigned a sequence value from 1 to 10; the next prepared sequence value nextval on n1 will be 11. Rows are replicated from n1 to n2 without issue until you add a row on n2.

The PostgreSQL sequence value table on n2 has not been incrementing sequence values in step with the sequence value table on n1. When you add a row on n2, it will try to use the next available sequence value (nextval will be 1 if you haven't added a row on n2), and the INSERT will fail because a row with the primary key value of 1 already exists.

This disorder can be monitored and corrected by manually coordinating the PostgreSQL sequences between nodes in the cluster, but that quickly becomes complicated and potentially impacts the user experience as you add more nodes to the cluster.

Introducing Snowflake Sequences

An alternative to using PostgreSQL sequences is to use a guaranteed unique Snowflake sequence. Snowflake sequences are represented externally as bigint values. A Snowflake sequence is made up of:

The timestamp is a 41-bit unsigned value representing millisecond precision and an epoch of 2023-01-01.
A unique ID is allocated as a 12-bit unsigned value. This provides for 4096 unique IDs per millisecond, or 4 million IDs per second.
The node number is a 10-bit unique identifier of the PostgreSQL instance in a global cluster. This value must be set with the GUC snowflake.node in the postgresql.conf file.

This combination ensures that a unique identifier is always available; even the most aggressive allocation of sequences cannot exceed the current assignment capabilities. Should it be possible in the future to require more than 4096 Snowflakes per millisecond, the algorithm will bump the timestamp one millisecond into the future to keep Snowflake IDs unique.

You can use Snowflake functions from pgEdge to extrapolate data from a Snowflake sequence for auditing or for use in other transaction processing needs. Information about the node on which a transaction occurred, or the specific time a transaction occurred is built into your data and easily accessed when you use a Snowflake sequence.

Using Snowflake Sequences with pgEdge

pgEdge automatically installs and creates the snowflake extension when you install pgEdge Postgres (both the Distributed and the Enterprise packages). It is also automatically installed on all pgEdge Cloud Developer Edition databases.

You can also follow the instructions in the README to build Snowflake from source code.

The spock node-create command has been updated to set the Snowflake node id for you if you use the node naming convention n1, n2, n3 (up to n9). If you use another node naming convention, you will need to manually set snowflake.node to a unique value:

./pgedgespocknode-createn2'host=10.2.1.2user=pgedgeport=5432dbname=demo'demonode_create:560818415snowflake.node=2

After creating all of your database objects, you can use a pgEdge CLI command to convert your existing sequences into snowflake sequences. When calling the spock sequence-convert function, pass the name of the sequence to be converted and the database name. Like other pgEdge CLI functions, you can include quoted wildcards to include all sequences in that database (‘*’) or to match a specific pattern like all sequences in a schema (‘public.*’).

./pgedgespocksequence-convert‘public.*’postgres

During the conversion, the data type of your sequence fields are changed to bigint; any existing values in your database remain the same. When you add new rows to your database, the new sequences will use the format of a snowflake sequence:

SELECT snowflake.format(id) FROM table1;
           id         |                          format
-------------------+-----------------------------------------------------------
 18014518154714241 | {"node": 1, "ts": "2023-10-16 18:47:12.257+00", "count": 0}
 18014518154714242 | {"node": 1, "ts": "2023-10-16 18:47:12.258+00", "count": 0}
(2 rows)

Getting Started

Snowflake sequences are just one benefit you get out-of-the-box when using pgEdge Enterprise Postgres or pgEdge Distributed Postgres to construct, manage, and host your distributed PostgreSQL databases.

Want to try out the complete experience for free? Download the self-hosted edition for bare metal, virtual machine or containers (or check out the fully-managed cloud edition) here: https://www.pgedge.com/get-started