PostgreSQL®, Tech Tips & Insights

PostgreSQL High Availability Solutions – Part I: PostgreSQL Automatic Failover

Jul 10, 2026

11 min read

SHARE THIS ARTICLE

Managing High Availability (HA) in your PostgreSQL ® hosting is very important to ensuring your database deployment clusters maintain exceptional uptime and strong operational performance so your data is always available to your application. In an earlier blog post, we introduced you to configure high availability for PostgreSQL using streaming replication, and now we’re going to show you how to best manage client-side HA.

Key Takeaways

PostgreSQL Automatic Failover (PAF) enhances high availability by seamlessly switching to standby servers during primary server failures, minimizing downtime, and maintaining business continuity.
The key components of PAF include the primary server for write operations, standby servers for redundancy, and a monitor node for health checks and coordination of failover events.
Implementing robust tools such as pg_auto_failover, repmgr, and PAF is essential for managing high availability and ensuring reliable PostgreSQL operations.

Understanding PostgreSQL Automatic Failover and High Availability Architecture

Achieving PostgreSQL High availability is essential to maintain exceptional uptime and robust performance. It reduces downtime and supports business continuity. Automatic failover is a critical strategy to achieve this.

In PostgreSQL, this involves redirecting database workloads to a standby server if the primary fails, preventing costly and disruptive downtimes. Effective management of failover and switchover operations is crucial for high availability.

Leveraging PostgreSQL HA tools ensures resilient and reliable clusters. The following section covers the key components of automatic failover.

Key Components of PostgreSQL Automatic Failover

Grasping the essential elements—the primary server, standby, and monitor node—is crucial for establishing high availability in PostgreSQL. Each component has a unique function that contributes to uninterrupted service and efficient transition during failover scenarios.

The primary server manages all reads and writes while maintaining data integrity. In the event of failure, standby servers are ready to assume control, helping reduce system downtime. Meanwhile, the monitor node keeps an eye on the health of the primary and any standby nodes, orchestrating failovers effectively when necessary.

Now, let’s delve into greater detail regarding each part.

Primary Database Server

In a Postgres cluster, the primary server manages all write operations and maintains data integrity. It handles every transaction, ensuring that data modifications are correctly processed for the application’s workload. Ensuring the health and performance of this central component is crucial, as it represents a potential single point of failure in the absence of redundant standby servers.

Standby Servers

In the event of a failover, standby servers within PostgreSQL swiftly take on the primary role to reduce downtime. Continuous data replication from the primary server keeps these standbys up-to-date. By transitioning responsibilities to a standby, system availability is preserved if the primary fails.

The Postgres service and standby servers support the principal PostgreSQL instance by continuously replicating data, maintaining backups available at all times for business continuity, and safeguarding against any potential data loss.

Monitor Node

The monitor node is essential for performing health checks and coordinating failover processes in the PostgreSQL automatic failover system. By continuously monitoring both the primary and standby servers, it quickly detects failures and initiates the transition to a standby node when necessary, helping maintain high availability and system reliability.

Implementing Automatic Failover in a PostgreSQL Deployment

The process of setting up PAF involves using specific tools and methods to identify service breakdowns and switch over to standby servers without interruption. Several different options exist, each with its own distinct attributes and advantages.

PostgreSQL Automatic Failover by ClusterLabs
Replication Manager for PostgreSQL Clusters by repmgr (2ndQuadrant)
Patroni

In our three-part series of posts on HA for PostgreSQL, we’ll share an overview, the prerequisites, and the working and test results for each tool. In Part 1, we’ll take a deep dive into ClusterLabs’ PAF solution.

pg_auto_failover

The pg_auto_failover extension automates failover management in PostgreSQL, helping maintain high availability with minimal manual intervention. It streamlines the setup of PostgreSQL clusters capable of automatic failover using commands such as ‘pg_autoctl create’ and ‘pg_autoctl run’.

To implement ‘pg_auto_failover’, you must establish a monitor and initialize a primary node in your PostgreSQL instance. The ‘pg_autoctl run’ service must run on every participating node to continuously monitor and manage the cluster. If the primary node fails, synchronous replication keeps data consistent during failover to the standby.

‘pg_auto_failover’ provides fault tolerance by preserving data integrity across nodes while automating high availability with minimal administrative overhead.

repmgr

In PostgreSQL clusters, repmgr takes the helm in handling both replication and automatic failover processes. It demands a unique superuser account and specific adjustments within the pg_hba.conf file for the effective functioning of the replication user. The installation process encompasses establishing PostgreSQL on all participating servers and crafting a repmgr.conf file specifically on the primary server.

Once you’ve set up repmgr.conf, you need to conduct an initial dry run to validate its configuration prior to actual implementation. If this trial goes without issues, proceed with cloning operations for your standby server. Enable automatic failover by deploying the repmgr daemon across both primary and standby servers.

These capabilities help maintain database availability during server failures while minimizing manual intervention.

PostgreSQL Automatic Failover

PAF is a high-availability management solution for PostgreSQL by ClusterLabs. It uses Postgres synchronous replication to guarantee that no data is lost at the time of the failover operation. It makes use of the popular, industry-standard Pacemaker and Corosync stack. With Pacemaker and Corosync applications, you can detect PostgreSQL database failures and act accordingly.

Pacemaker is a service capable of managing many resources, and it does so with the help of resource agents. Resource agents are responsible for handling a specific resource, determining how it should behave, and informing Pacemaker of its results.

Your resource agent implementation must comply with the Open Cluster Framework (OCF) specification. This specification defines resource agents’ behavior and implementation of methods like stop, start, promote, demote, and interaction with Pacemaker.

PAF is an OCF resource agent for Postgres written in Perl. Once your database cluster is built using internal streaming replication, PAF can expose to Pacemaker the current status of the PostgreSQL instance on each database node: primary, slave, stopped, catching up, load balancer, etc.

Configuring PostgreSQL for Automatic Failover

Establishing a PostgreSQL setup with automatic failover requires a multi-step process to create a resilient configuration. Initially, you must install PostgreSQL and establish PostgreSQL servers (or clusters). Both the primary and the standby must have repmgr installed.

For repmgr to operate optimally, it’s necessary to set up a specialized superuser account. Modify replication configurations as required, making sure that the repmgr.conf file on the primary server can be configured properly.

To conclude, begin an active PostgreSQL instance on the primary server so that repmgr can function efficiently.

Setting Up Synchronous Replication

Synchronous replication, first available in PostgreSQL 9.1, guarantees that replicas acknowledge the receipt of data before it is committed to the primary database. Streaming replication achieves this by transferring WAL (Write-Ahead Log) records from the primary to standby servers. PAF employs this feature to avert any potential data loss during failover processes.

To set it up, one must configure the primary database to wait for designated replicas to verify they have received data before committing any transactions, rather than using asynchronous replication. To preserve both high availability and data integrity, include each replica’s name in the synchronous_standby_names parameter within the configuration file of your primary server.

Adjusting pg_hba.conf

Altering the pg_hba.conf file is crucial for safeguarding replication connections. Tailor the configurations within this file to align with your particular network setups and needs.

Make sure to add entries specifically for the repmgr user in replication mode inside pg_hba.conf, and confirm that you have established the necessary permissions for this user to maintain a secure and effective replication process.

Registering Nodes

Utilize the relevant command within repmgr to enlist the primary as a recognized entity in the cluster. Following cloning, proceed to register the standby server using repmgr for its inclusion into the cluster.

It is vital to register each node to successfully oversee and regulate both primary and standby databases within a replication framework. This process guarantees synchronized and streamlined failover procedures.

How Postgres Automatic Failover Works

PAF communicates the cluster state to Pacemaker and monitors the PostgreSQL database. If the primary cannot be recovered, it notifies Pacemaker, which triggers an election to promote one of the standby servers. Pacemaker then performs management actions such as starting, stopping, monitoring, and promoting PostgreSQL nodes.

Configuring PAF for High Availability

PAF supports PostgreSQL version 9.3 and higher.
PAF is not responsible for creating or configuring primary/standby PostgreSQL deployments. Streaming replication must already be configured before using PAF.
PAF does not modify PostgreSQL configuration. However, it requires the following prerequisites:
- standby_mode = on
- recovery_target_timeline = ‘latest’
- primary_conninfo must have the application_name parameter defined and set to local node name as in Pacemaker.
- The standby must be configured as a hot standby. Hot standby nodes can be queried as read-only databases.
- A recovery template file (default: <postgresql_data_location>/recovery.conf.pcmk) has to be provided with the below parameters:
PAF exposes several configurable parameters for managing PostgreSQL resources, including:
- bindir: location of the PostgreSQL binaries (default: /usr/bin)
- pgdata: location of the PGDATA of your instance (default: /var/lib/pgsql/data)
- datadir: path to the directory set in data_directory from your postgresql.conf file
- pghost: the socket directory or IP address to use to connect to the local instance (default: /tmp)
- pgport: the port to connect to the local instance (default: 5432)
- recovery_template: the local template that will be copied as the PGDATA/recovery.conf file. This template file must exists on all node (default: $PGDATA/recovery.conf.pcmk)
- start_opts: Additional arguments given to the Postgres process on startup. See “postgres –help” for available options. Useful when the postgresql.conf file is not in the data directory (PGDATA), eg.: -c config_file=/etc/postgresql/9.3/main/postgresql.conf
- system_user: the system owner of your instance’s process (default: postgres)
- maxlag: maximum lag allowed on a standby before we set a negative primary score on it

PAF Advantages

PAF provides a free, open-source solution for configuring PostgreSQL high availability.
It can detect node failures and trigger elections when the primary becomes unavailable.
Quorum behavior can be enforced in PAF.
It will provide a complete HA database management solution for the resource, including start, stop, monitor, and handle network isolation scenarios.
It’s a distributed solution that enables the management of any node from another node.

PAF Limitations

PAF doesn’t detect if a standby node is misconfigured with an unknown or non-existent node in recovery configuration. Node will be shown as slave, even if standby is running without connecting to the primary/cascading standby node.
Requires an extra port (Default 5405) to be opened for the Pacemaker and Corosync components’ communication using UDP.
Does not support NAT-based configuration.
No pg_rewind support.

High Availability for PostgreSQL Test Scenarios

We conducted a few tests to assess the capability of the PostgreSQL HA management using PAF on some use cases. All of these tests were run while the application was running and inserting data to the PostgreSQL database. The application was written using PostgreSQL Java JDBC Driver leveraging the connection failover capability.

Standby Server Tests

Sl. No	Test Scenario	Observation
1	Kill the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state. There was no disruption in writer application.
2	Stop the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state. There was no disruption in writer application.
3	Reboot the server	Standby database server node was marked offline initially. Once the server came up after reboot, PostgreSQL database was started by Pacemaker and the server was marked as online. If fencing was enabled, the node wouldn’t have been added automatically to cluster. There was no disruption in writer application.
4	Stop the Pacemaker process	It will stop the PostgreSQL process also, and the server node will be marked offline. There was no disruption in writer application.

Master/Primary Server Tests

Sl. No	Test Scenario	Observation
1	Kill the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state. Primary was recovered within the threshold time and, hence, election was not triggered. The writer application was down for about 26 seconds.
2	Stop the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state. Primary was recovered within the threshold time and, hence, election was not triggered. There was a downtime in writer application for about 26 seconds.
3	Reboot the server	Election was triggered by Pacemaker after the threshold time for which master was not available. The most eligible standby server was promoted as the new Primary. Once the old master came up after reboot, it was added back to the database cluster as a standby. If fencing was enabled, the node wouldn’t have been added automatically to cluster. The writer application service was down for about 26 seconds.
4	Stop the Pacemaker process	It will stop the PostgreSQL process also and server will be marked offline. Election will be triggered and new master will be elected. There was downtime in writer application.

Network Isolation Tests

Sl. No	Test Scenario	Observation
1	Network isolate the standby server from other servers	Corosync traffic was blocked on the standby server. The server was marked offline and PostgreSQL service was turned off due to quorum policy. There was no disruption in the writer application.
2	Network isolate the master server from other servers (split-brain scenario)	Corosync traffic was blocked on the master server. PostgreSQL service was turned off and master server was marked offline due to quorum policy. A new master was elected in the majority partition. There was a downtime in the writer application.

Miscellaneous Tests

Sl. No	Test Scenario	Observation
1	Degrade the cluster by turning off all the standby servers.	When all the standby servers went down, PostgreSQL service on the master was stopped due to quorum policy. After this test, when all the standby servers were turned on, a new master was elected. There was a downtime in the writer application.
2	Randomly turn off all the servers one after the other, starting with the master, and bring them all back simultaneously.	All the servers came up and joined the cluster. New master was elected. There was a downtime in the writer application.

Is PAF the solution for PostgreSQL High Availability?

PAF provides several advantages for PostgreSQL high availability while supporting broader disaster recovery strategies. It uses IP address failover instead of rebooting the standby to connect to the new primary, minimizing disruption so the database remains operational and accessible without the need for manual intervention. The primary exception is timeline divergence, where pg_rewind may be required.

ScaleGrid provides a scalable infrastructure for PostgreSQL that allows users to scale resources effortlessly in response to fluctuating demands. This ease of management enables teams to concentrate on developing applications instead of maintaining the database.

In Part 1, we’ve explored ClusterLabs’ PAF solution. Part 2 examines Replication Manager (repmgr), while Part 3 covers tools like Patroni by Zalando and compares all three open-source solutions to help you choose the best fit for your PostgreSQL deployment.

Frequently Asked Questions

What is automatic failover in PostgreSQL?

Automatic failover automatically promotes a standby server to the primary role if the primary server fails. This minimizes downtime and helps maintain high availability.

What are the key components of PostgreSQL automatic failover?

PAF consists of a primary server for write operations, standby servers that take over during failures, and a monitor node that detects issues and coordinates failover.

How does pg_auto_failover ensure high availability?

pg_auto_failover automates monitoring and failover while using synchronous replication to minimize data loss and maintain high availability.

What steps are involved in configuring PostgreSQL for automatic failover?

Configuring automatic failover involves installing PostgreSQL, setting up synchronous replication, configuring pg_hba.conf, and registering the primary and standby nodes using a management tool such as repmgr.

Why is testing your automatic failover setup important?

Regular testing verifies that failover works as expected, identifies potential issues, and confirms that standby servers can take over quickly when the primary server fails.

To learn more about ScaleGrid, please visit ScaleGrid.io. Connect with ScaleGrid on LinkedIn, X, Facebook, and YouTube.

Stay Ahead with ScaleGrid Insights

Dive into the world of database management with our monthly newsletter. Get expert tips, in-depth articles, and the latest news, directly to your inbox.

Jul 14, 2026

Best PostgreSQL GUI Tools: Compare Features, Pricing & Use Cases

Whether you’re writing complex SQL queries, troubleshooting slow database performance, or managing multiple PostgreSQL instances, the right graphical user interface...

Jul 6, 2026

What Is RabbitMQ? A Quick Guide to Its Architecture, Clustering, Scaling, and Deployment Strategies

Introduction: Why RabbitMQ Still Belongs in Your Stack Distributed systems fail in ways that are rarely obvious at design time....

Jul 3, 2026

Kafka vs. RabbitMQ: Key Differences You Should Know

Choosing between RabbitMQ and Kafka depends on your specific messaging needs. RabbitMQ is designed for flexible routing and message reliability,...

PostgreSQL High Availability Solutions – Part I: PostgreSQL Automatic Failover

Key Takeaways

Understanding PostgreSQL Automatic Failover and High Availability Architecture

Key Components of PostgreSQL Automatic Failover

Primary Database Server

Standby Servers

Monitor Node

Implementing Automatic Failover in a PostgreSQL Deployment

pg_auto_failover

repmgr

PostgreSQL Automatic Failover

Configuring PostgreSQL for Automatic Failover

Setting Up Synchronous Replication

Adjusting pg_hba.conf

Registering Nodes

How Postgres Automatic Failover Works

Configuring PAF for High Availability

PAF Advantages

PAF Limitations

High Availability for PostgreSQL Test Scenarios

Standby Server Tests

Master/Primary Server Tests

Network Isolation Tests

Miscellaneous Tests

Is PAF the solution for PostgreSQL High Availability?

Frequently Asked Questions

Stay Ahead with ScaleGrid Insights

Related Posts

Best PostgreSQL GUI Tools: Compare Features, Pricing & Use Cases

What Is RabbitMQ? A Quick Guide to Its Architecture, Clustering, Scaling, and Deployment Strategies

Kafka vs. RabbitMQ: Key Differences You Should Know

Ready to Get Started?

Dive In for Free

See It in Action

Ask Us Anything