pgEdge Posts from Asif Naeem

Simplifying Cluster-Wide SQL Execution in pgEdge with exec_node()

Fri, 14 Nov 2025 05:02:00 GMT

pgEdge Distributed Postgres is a database system built on top of standard open‑source Postgres, extended to support global, distributed, and multi‑master (active‑active) deployments.In the evolving landscape of distributed databases, efficient query execution across nodes is essential to leverage the full power of a distributed architecture. Specifically for distributed Postgres environments, managing a multi-node cluster often requires executing SQL commands that don’t automatically replicate. This includes critical operations like executing DDL statements, performing administrative tasks, and altering cluster configuration - actions that must be applied only on specific nodes.To solve this operational challenge, I created the function: a utility designed to make remote SQL execution across pgEdge nodes simple, consistent, and scriptable directly from within the database.

Why I Created exec_node()

As part of general administration tasks, there is commonly a need to execute SQL commands on a specific node or all nodes within a pgEdge distributed cluster. pgEdge leverages Spock for logical replication, but many important SQL commands—particularly the following DDL and Spock-specific cluster management functions—do not replicate by design. This includes operations like:

-
VACUUM

Spock-specific commands like
spock.repset_add_table
,
spock.node_add_interface

In a traditional setup, executing these commands safely and consistently across all or specific nodes requires manual logins, scripts, or orchestration tools. This is time-consuming and requires additional steps.

Using exec_node()

With , you can issue commands directly from the database—through SQL—and target the exact node you want. The function signature is:

With you can:Run SQL on Any or All Nodes — Remotely and NativelyWhether you're running a maintenance command, executing DDL statements, or configuring Spock, you can use to do it all from a single SQL interface.Example: Running a data maintenance command statement only on node1:;Example: Applying a maintenance operation across all nodes:;Execute Non-Replicating Commands Where They BelongSome SQL commands are intentionally not replicated in pgEdge; this is either to avoid conflicts or is because they are inherently local. allows these commands to be sent only to the relevant node(s), avoiding misconfiguration or inconsistencies.Common non-replicating commands include:

Example: Setting a GUC value on node 3 (only):Example: Changing a GUC on all the cluster nodes:Execute Spock Cluster Management FunctionspgEdge clusters are built on top of the Spock extension for logical replication, but Spock management commands must be run on a specific node—and they don’t replicate. makes this easy to automate and manage.Example: Adding a table to a replication set:Example: Creating a table on all the pgEdge nodes without adding them to the replication set:Without exec_node(), these operations would require logging into each node or writing external scripts; now they can be run as SQL from any connected client or script.Supports Targeted DDL DeploymentSometimes not every function or schema change is required to happen on every node. Instead, you might want:Example: Targeted function deployment:You now have precise control over where that function lives.

Improves Automation and Operational Safety

Because works like any SQL function, it integrates seamlessly into:It removes the need for external scripting or SSH automation and reduces the risk of human error by centralising command execution in a controlled and auditable way.

Use Cases

Use to help with:

deploying a non-replicated data maintenance command,
DDL
or
functions
to a specific node.

running Spock configuration commands (
repset_add_table, node_add_interface
, etc.).

executing maintenance commands (
VACUUM, REINDEX, ANALYZE
) cluster-wide.

setting or altering system parameters per node (
ALTER SYSTEM
).

Creating or dropping databases on individual nodes.

Controlled rollout of feature flags or logic to subset of nodes.

Best Practices

Use with awareness
: While powerful, be careful when executing write operations across all nodes. Ensure commands are safe / secure and do not pose risks.

Log executions
: In automation scripts, consider logging the use of
exec_node()
for auditability.

Validate SQL
: Especially when executing SQL, validate the structure and scope to avoid unintended changes.

Test on dev/staging
: For complex cluster operations, test
exec_node()
in non-production environments before rollout.

Using Snowflake Sequences with the Hibernate IDENTITY column

Tue, 27 May 2025 05:28:00 GMT

A PostgreSQL native sequence is a great way to generate unique IDs, but there is risk of duplicate ID generation if you use it in a distributed cluster environment. The pgEdge Snowflake extension is designed for use in a distributed environment, helping you generate truly unique identifiers with ease.Snowflake functions create unique identifiers that are designed to support parallel execution across multiple nodes. Using a Snowflake sequence ensures that each generated number is consistent and addresses the need for uniqueness across all of the pgEdge cluster nodes.This is the first in a series of blogs in which I will share different ways to adapt sequences in the pgEdge cluster environment.

Snowflake Sequence Overview

The Snowflake extension creates a replication-friendly identifier that you can use to replace problematic PostgreSQL sequence definitions in your tables that reside in a distributed replication environment. pgEdge PostgreSQL automatically installs and creates the extension in each pgEdge cluster.Snowflake sequences let you:

Add or modify data in different regions while ensuring a unique transaction sequence.

Preserve unique transaction identifiers without manual/administrative management of a numbering scheme.

Accurately identify the order in which globally distributed transactions are performed.

Each Snowflake ID is a 64-bit value that is composed of the multiple parts:

Timestamp: 41 bits that contains the number milliseconds since 2023-01-01

Node number: 10 bits identifying a unique node number (set as a PostgreSQL GUC)

Counter Number: 12 bits that increment the value for handling multiple IDs that might be generated in the same millisecond

Hibernate software is an object-relational mapping (ORM) tool for Java developers that helps map Java objects (classes) to database tables, and automatically handles converting data between the two. It’s one of the most popular ORM frameworks in the Java ecosystem and is part of the Java Persistence API (JPA) standard. In our examples, we'll use a Snowflake sequence in a Hibernate environment.If you're working in a traditional replication environment (without the benefits of multi-master replication), using a Hibernate identity column as an auto-generated primary key implies that the database will automatically generate IDs. However, in distributed systems like pgEdge multi-master replication, Hibernate IDs can lead to collision issues if multiple nodes try to generate the same ID concurrently.The following code snippets implement Snowflake sequences in a Hibernate environment hosted on a pgEdge cluster: relies on PostgreSQL's SERIAL or BIGSERIAL type. Hibernate will create the table with a 64 bit ID; in SQL, the table would look like:In a distributed environment, adding data to this table would have the potential to create conflicting identifiers (which in turn result in replication interruption and maintenance overhead). The collision issue can be avoided with minimal changes by adopting Snowflake sequences. As shown in the following code snippets, it's simple to replace the PostgreSQL style sequences with Snowflake sequences; simply:

Automating Snowflake Sequence Implementation

You can use a PostgreSQL event trigger to ensure the use of Snowflake sequences when you create a table. The following example defines a PostgreSQL event trigger that automatically replaces column sequence definitions, replacing them with a Snowflake sequence.After creating this trigger in your PostgreSQL database, any time you run a command, this trigger fires, looking for any syntax that is part of the CREATE TABLE command; if it finds any, it replaces the column's default value (usually ) with .

Embedding near the edge: pgEdge Distributed PostgreSQL with pgVector

Wed, 20 Sep 2023 06:14:00 GMT

Introduction

We are excited to be announcing that we now support the increasingly popular pgVector Postgres extension for storing and searching vector embeddings in AI-powered applications. Bringing pgVector and pgEdge’s distributed capabilities together makes for a powerful combination that greatly improves performance for users regardless of their geographic location.In this blog we'll demonstrate how to configure pgVector with pgEdge to provide similarity search functionality across a pgEdge Distributed PostgreSQL cluster. I will start with brief summary of the products mentioned in the title of this blog: pgEdge is fully-distributed PostgreSQL, optimized for the network edge and deployable across multiple cloud regions or data centers. pgEdge is available as pgEdge Platform, self-hosted software available for download from [download link]; or as pgEdge Cloud, a fully managed service. This blog is applicable to both pgEdge Cloud and pdEdge Platform.pgvector is an open source extension for PostgreSQL that enables efficient similarity search and other vector-based operations. It's often used for applications like recommendation systems and image search. The pgvector extension provides an indexable vector data type that stores vectors in a PostgreSQL database. pgvector supports the index, which implements the method of indexing.

Vector Database

Vector data stores data as high-dimensional vectors, which are mathematical representations of features or attributes. The number of dimensions in a vector ranges from tens to thousands, depending on the complexity and granularity of the data. The main advantage of a vector database is that it allows for fast and accurate similarity search and retrieval of data based on their vector distance or similarity. So instead of using the conventional methods for searching data using predefined criteria or exact matches or wildcards, one can use the vector database to find similar or relevant data based on semantic or contextual meaning.Vector databases enable accurate and efficient search and analysis of large datasets by utilizing the characteristics of vectors. A vector database's capacity to locate comparable items is its key benefit. For example, two statements with comparable meanings will produce vectors that are close to one another. This allows you to use the vector database to locate all the vectors that are near to one another. For example, a vector database can be used to find:

images that are similar to a given image based on visual content and style.

documents that are similar to a given document based on topic and content.

products that are similar to a given product based on features and ratings.

Vector databases are currently the popular choice. With the rise of large-language AI models (LLMs), efficiently managing and searching large-scale, high-dimensional data has become a tremendously important use case. The solution to this challenge lies in vector databases – a powerful and increasingly popular data storage technology that enables faster and more accurate searches.With the addition of the open-source pgvector extension, PostgreSQL is being used as a vector database. There is a lot of excitement about using PostgreSQL as a vector database, but there is more innovation to come, and work to be done to make the vector workload more secure, performant, and scalable.

Vector Data

Before showing an example of how pgEdge works with pgvector extension, it is important to understand the dynamics of vector data, and how it is stored in the database. Vector data refers to a type of data representation where each data point is described by a set of numerical values arranged in a specific order. These values are usually referred to as components or features and they capture different aspects or attributes of a data point. Vectors are commonly used to represent a wide range of information in many fields: mathematics, computer science, data science, and machine learning.Real-world applications utilize far more than just two dimensions; OpenAI embeddings may use more than a thousand dimensions to vectorize data. One method for converting high-dimensional data into a low-dimensional space is embedding. Embedding allows us to extract data from multiple dimensions and sources, including text, photos, audio, and video, and convert it into vectors. Embedding is a widely-used technique in machine learning and natural language processing (NLP) to represent sparse symbols or objects as continuous vectors.For example, tree data like a car, truck, cycle, helicopter, or hoverboard object may all be converted into vectors using embeddings. Two-dimensional embeddings are shown behind the object they describe in the following list:

car: embedding [2.0,2.3]

truck: embedding [3.4, 5.9]

motorcycle: embedding [0.5,1.2]

bicycle: embedding [0.2,0.8]

helicopter : embedding [13.2,19.8]

hoverboard: embedding [0.1,0.2]

A review of the result set shows us that a bicycle and a motorcycle are similar and that their vectors (if charted) would be fairly close in distance. Vehicle characteristics can also be categorized along dimensions that include color, model, year, and manufacturer. The finer-grained your data is when describing an object, the more precise your results will be in the resulting vehicle grouping.Vector databases can efficiently find items that satisfy a query using vector representations. They use similarity metrics like Euclidean distance, Cosine similarity, or Manhattan distance to determine data point proximity, resulting in relevant and similar results.

pgvector syntax

The pgvector extension introduces a vector data type that can be used as the column type in a PostgreSQL database. The simple examples that follow show how to use the vector data type in statements, and search the vector data. Invoke the following commands with the psql client:Creating a Sample TableRetrieving DataManaging Data OperationsQuerying AggregatesCreating IndexesPostgreSQL can create indexes for vectors that hold up to 2000 dimensions.You can create embeddings using tools like the OpenAI API client. Similarity searches of vector embeddings have a variety of commercial uses like fraud detection, food industry use, security systems.

pgvector real world example

The following example is a real world sample code of an AI based enquiry system that tries to automatically answer client queries. It has a limited knowledge base, if it doesn't know the answer, it replies appropriately.This generates the following log:The above sample code elaborates the use of pgvector extension for a real world example of AI based enquiry system that tries to automatically answer client queries. We can divide the application into four sections:

Questions mimic client queries to drive learning. Since it is an intelligent automatic reply enquiry system, we have fed all the client queries in
QUERIES
array.

The system has a knowledge base that contains all the information that we want the system to learn. The knowledge base grows.

We perform a similarity search that exercises pgvector/PostgreSQL capabilities. We iterate and get responses from the system for each query.

We generate a response from an AI model. Since we have a limited knowledge base, if our AI model doesn't know the answer, it replies accordingly. We expect that it will reply to all the related queries.

Exercising the example

Query:The knowledge base contains the following entry to educate the automatic system to answer correctly:Enquiry System Response:The system has capability to do similarity search to correctly answer the posted query by the client. This is possible with the help of the PostgreSQL pgvector extension and the OpenAI embedding generation feature. When we use PostgreSQL with pgvector, not only does it provide vector search, but it helps with storage and other RDBMS features that help us develop a professional and industrial quality application.To generate a good reasonable response to the client, we used the OpenAI model () to generate an answer to the query. If the knowledge base provides no related knowledge, it will reply with This application is written in basic python code to demonstrate the real world use of the pgvector extension. It was tested with PostgreSQL 15 (with pgvector extension installed), OpenAI (via online internet access), and Python 3.9.