Managing High Availability (HA) in your PostgreSQL hosting is very important to ensuring your database deployment clusters maintain exceptional uptime and strong operational performance so your data is always available to your application. In an earlier blog post, we introduced you to configure high availability for PostgreSQL using streaming replication, and now we’re going to show you how to best manage client-side HA.
Ensuring high availability in PostgreSQL involves implementing automatic failover, a critical process that maintains database operability and preserves data accessibility when unexpected failures occur. If the primary server encounters issues, operations are smoothly transitioned to a standby server with minimal interruption. This article showcases essential elements and techniques for achieving automatic failover within your PostgreSQL environment.
Key Takeaways
- PostgreSQL automatic failover enhances high availability by seamlessly switching to standby servers during primary server failures, minimizing downtime, and maintaining business continuity.
- The key components of automatic failover include the primary server for write operations, standby servers for backup, and a monitor node for health checks and coordination of failover events.
- Implementing robust tools such as pg_auto_failover, repmgr, and PostgreSQL Automatic Failover (PAF) is essential for managing high availability and ensuring reliable PostgreSQL operations.
Understanding PostgreSQL Automatic Failover
High availability is essential for PostgreSQL to maintain exceptional uptime and robust performance. It reduces downtime and supports business continuity. Automatic failover is a critical strategy to achieve this.
In PostgreSQL, automatic failover involves switching to a standby server if the primary server fails, preventing costly and disruptive downtimes. Effective management of failover and switchover operations is crucial for high availability.
Tools for PostgreSQL high availability include automatic failover, monitoring, replication, and user management. Leveraging these tools ensures resilient and reliable PostgreSQL clusters. The following section covers the key components of automatic failover.
Key Components of PostgreSQL Automatic Failover
Grasping the essential elements—the primary server, standby servers, and monitor node—is crucial for establishing high availability in PostgreSQL. Each component has a unique function that contributes to uninterrupted service and efficient transition during failover scenarios.
The primary server is responsible for handling all write operations and maintaining data accuracy. In the event of a primary server failure, standby servers are prepared to assume control, which helps reduce system downtime. Meanwhile, the monitor node keeps an eye on the health of the primary node and any standby nodes to orchestrate failovers effectively when necessary.
Now, let’s delve into greater detail regarding each part.
Primary Server
In a Postgres cluster, the primary server manages all write operations and maintains data integrity. It handles every transaction, ensuring that data modifications are correctly processed. Ensuring the health and performance of this central component is crucial as it constitutes the core of your database activities.
Standby servers are activated when the primary server fails. They activate failover mechanisms designed to sustain high availability with minimal interruption. The primary server’s pivotal role in accurately processing transactions is essential for preserving both integrity and dependability within the database cluster.
Standby Servers
In the event of a failover, standby servers within PostgreSQL swiftly take on the primary role to reduce downtime. Continuously replicating data from the primary server keeps these standbys up-to-date. By transitioning responsibilities to a standby, system availability is preserved in case the primary server experiences failure.
The Postgres service and standby servers support the principal PostgreSQL instance by continuously replicating data, maintaining backups available at all times for business continuity, and safeguarding against any potential data loss.
Database operations must continue without disruption to ensure high availability, even when faced with hardware or software failures. This includes situations where there is both a hot standby ready to serve and an operational standby node.
Monitor Node
The monitor node is essential for performing health checks and managing failover processes in the PostgreSQL automatic failover system. By consistently overseeing both primary and standby servers, it quickly identifies problems, thereby maintaining the system’s integrity and dependability.
The role of a Postgres node is pivotal within this framework. The monitor node’s continuous assessments of PostgreSQL nodes’ conditions facilitate seamless transitions during failovers. Such meticulous coordination by the monitor safeguards against data loss while sustaining high availability across all nodes involved in this setup.
Implementing PostgreSQL Automatic Failover Solutions
The process of setting up PostgreSQL automatic failover involves utilizing specific tools and methods to identify service breakdowns and switch over to standby servers without interruption. Several different options exist, each with its own distinct attributes and advantages.
Multiple tools are available to manage your PostgreSQL deployment clusters’ high availability (HA) using streaming replication. These solutions offer automatic failover capabilities, monitor availability and system status, replication, user management, and other useful administrative tasks for use cases for Postgres databases. Some of the prominent open-source solutions include:
- PostgreSQL Automatic Failover by ClusterLabs
- Replication Manager for PostgreSQL Clusters by repmgr (2ndQuadrant)
- Patroni
Each tool provides its way of managing high-availability PostgreSQL clusters. In our three-part series of posts on HA for PostgreSQL, we’ll share an overview, the prerequisites, and the working and test results for each tool. In Part 1, we’ll deep-dive into ClusterLabs’ PAF solution.
More on Managing High Availability in PostgreSQL:
pg_auto_failover
The pg_auto_failover extension is designed for automated management and failover processes within a PostgreSQL environment, ensuring high availability with minimal manual effort. This open-source tool can be acquired from apt.Postgresql.org, a Debian and Ubuntu repository. It streamlines the setup of a Postgresql cluster capable of automatic failover by utilizing commands like ‘pg_autoctl create’ and ‘pg_autoctl run.’’
To implement pg_auto_failover, you must establish a monitor and initiate a primary node in your PostgreSQL instance. The ‘pg_autoctl run’ service should operate on each participating node to enable ongoing supervision and control over the system functions. When the initial host experiences issues, synchronous replication ensures that data remains consistent as connections are redirected to an alternative secondary host.
Pg_auto_failover excels in fostering environments characterized by fault tolerance while preserving data integrity across nodes. This contributes significantly to seamless high availability scenarios within any PostgreSQL setup—a testament to its robust capability in managing automated failovers without constant human oversight.
repmgr
In PostgreSQL clusters, repmgr takes the helm in handling both replication and automatic failover processes. It demands a unique superuser account and specific adjustments within the pg_hba.conf file for the effective functioning of the replication user. The installation process encompasses establishing PostgreSQL on all involved servers—both master and standby—and crafting a repmgr.conf file specifically on the master server.
Once you’ve set up repmgr.conf, it’s advisable to conduct an initial dry run to validate its configuration prior to actual implementation. If this trial is without issues, move forward with cloning operations for your standby server. To ensure high availability through the seamless transition between servers during incidents, activate automatic failover by deploying the repair daemon across both primary and standby servers.
By employing repmgr as part of your management toolkit, PostgreSQL clusters benefit from enhanced resilience attributable to meticulously orchestrated automatic failover paired with continuous replication functionality. This strategic setup ensures that database services remain accessible despite potential disruptions or failures of individual components within your server infrastructure.
PostgreSQL Automatic Failover (PAF)
PostgreSQL Automatic Failover (PAF) is a high-availability management solution for PostgreSQL by ClusterLabs. It uses Postgres synchronous replication to guarantee that no data is lost at the time of the failover operation. It makes use of the popular, industry-standard Pacemaker and Corosync stack. With Pacemaker and Corosync applications, you can detect PostgreSQL database failures and act accordingly.
Pacemaker is a service capable of managing many resources, and it does so with the help of resource agents. Resource agents are responsible for handling a specific resource, determining how it should behave, and informing Pacemaker of its results.
Your resource agent implementation must comply with the Open Cluster Framework (OCF) specification. This specification defines resource agents’ behavior and implementation of methods like stop, start, promote, demote, and interaction with Pacemaker.
PAF is an OCF resource agent for Postgres written in Perl. Once your database cluster is built using internal streaming replication, PAF can expose to Pacemaker the current status of the PostgreSQL instance on each database node: master, slave, stopped, catching up, load balancer, etc.
Configuring PostgreSQL for Automatic Failover
Establishing a PostgreSQL setup with automatic failover requires a multi-step process to create a resilient configuration. Initially, you must install PostgreSQL and establish two clusters or servers. Both the primary server and the standby servers must be repmgr installed.
For repmgr to operate optimally, it’s necessary to set up a specialized superuser account. Modify replication configurations as required, making sure that the repmgr.conf file on the master server can be configured properly.
To conclude, begin an active PostgreSQL instance on the primary server so that repmgr can function efficiently.
Setting Up Synchronous Replication
Synchronous replication, first available in PostgreSQL 9.1, guarantees that replicas acknowledge the receipt of data before it is committed to the primary database. PAF employs this feature to avert any potential data loss during failover processes.
To set up synchronous replication, one must configure the primary database so that it waits for designated replicas to verify they have received data before committing any transactions. To preserve both high availability and data integrity, include each replica’s name in the synchronous_standby_names parameter within the configuration file of your primary server.
Adjusting pg_hba.conf
Altering the pg_hba.conf file is crucial for safeguarding replication connections. Tailor the configurations within this file to align with your particular network setups and needs.
Make sure to add entries specifically for the repmgr user in replication mode inside pg_hba.conf, and confirm that you have established the necessary permissions for this user to maintain a secure and effective replication process.
Registering Nodes
Utilize the relevant command within repmgr to enlist the primary server as a recognized entity in the cluster. Following cloning, proceed to register the standby server using repmgr for its inclusion into the cluster.
It is vital to register each node to successfully oversee and regulate both primary and standby databases within a replication framework. This process guarantees synchronized and streamlined failover procedures.
How Postgres Automatic Failover Works
PAF communicates with Pacemaker regarding the cluster status and monitors the functioning of the PostgreSQL database. In the event of a failure, it informs Pacemaker. If there’s no chance of the current master being recovered, an election will be triggered between one of the current standby database servers. With the robust Pacemaker, Postgres Automatic Failover will perform management actions like start, stop, monitor, and failover on all the Postgres databases’ nodes.
PostgreSQL Automatic Failover for High Availability (HA) Configuration
- PAF supports PostgreSQL version 9.3 and higher.
- PAF is not responsible for PostgreSQL master/standby creation or its setup – you must create and setup streaming replication before using PAF.
- PAF doesn’t edit any of PostgreSQL’s configuration or setup requirements. However, it requires database users to follow a few prerequisites like:
- Slave must be configured as hot standby. Hot standby slave nodes can be queried as read-only databases.
- A recovery template file (default: <postgresql_data_location>/recovery.conf.pcmk) has to be provided with the below parameters:
- standby_mode = on
- recovery_target_timeline = ‘latest’
- primary_conninfo must have the application_name parameter defined and set to local node name as in Pacemaker.
- PAF exposes multiple parameters related to the management of a Postgres resource. This can be configured to suit one’s requirements. Below are the parameters:
- bindir: location of the PostgreSQL binaries (default: /usr/bin)
- pgdata: location of the PGDATA of your instance (default: /var/lib/pgsql/data)
- datadir: path to the directory set in data_directory from your postgresql.conf file
- pghost: the socket directory or IP address to use to connect to the local instance (default: /tmp)
- pgport: the port to connect to the local instance (default: 5432)
- recovery_template: the local template that will be copied as the PGDATA/recovery.conf file. This template file must exists on all node (default: $PGDATA/recovery.conf.pcmk)
- start_opts: Additional arguments given to the Postgres process on startup. See “postgres –help” for available options. Useful when the postgresql.conf file is not in the data directory (PGDATA), eg.: -c config_file=/etc/postgresql/9.3/main/postgresql.conf
- system_user: the system owner of your instance’s process (default: postgres)
- maxlag: maximum lag allowed on a standby before we set a negative master score on it
Postgres Automatic Failover Pros
- PAF provides the user with a free hands-on configuration and setup of PostgreSQL.
- PAF can handle node failure and trigger elections when the master goes down.
- Quorum behavior can be enforced in PAF.
- It will provide a complete high availability (HA) database management solution for the resource, including start, stop, monitor, and handle network isolation scenarios.
- It’s a distributed solution that enables the management of any node from another node.
Postgres Automatic Failover Cons
- PAF doesn’t detect if a standby node is misconfigured with an unknown or non-existent node in recovery configuration. Node will be shown as slave, even if standby is running without connecting to the master/cascading standby node.
- Requires an extra port (Default 5405) to be opened for the Pacemaker and Corosync components’ communication using UDP.
- Does not support NAT-based configuration.
- No pg_rewind support.
High Availability for PostgreSQL Test Scenarios
We conducted a few tests to determine the capability of the PostgreSQL high availability (ha) management using PAF on some use cases. All of these tests were run while the application was running and inserting data to the PostgreSQL database. The application was written using PostgreSQL Java JDBC Driver leveraging the connection failover capability.
Standby Server Tests
Sl. No | Test Scenario | Observation |
---|---|---|
1 | Kill the PostgreSQL process | Pacemaker brought the PostgreSQL process back to running state. There was no disruption in writer application. |
2 | Stop the PostgreSQL process | Pacemaker brought the PostgreSQL process back to running state. There was no disruption in writer application. |
3 | Reboot the server | Standby database server node was marked offline initially. Once the server came up after reboot, PostgreSQL database was started by Pacemaker and the server was marked as online. If fencing was enabled, the node wouldn’t have been added automatically to cluster. There was no disruption in writer application. |
4 | Stop the Pacemaker process | It will stop the PostgreSQL process also, and the server node will be marked offline. There was no disruption in writer application. |
Master/Primary Server Tests
Sl. No | Test Scenario | Observation |
1 | Kill the PostgreSQL process | Pacemaker brought the PostgreSQL process back to running state. Primary was recovered within the threshold time and, hence, election was not triggered. The writer application was down for about 26 seconds. |
2 | Stop the PostgreSQL process | Pacemaker brought the PostgreSQL process back to running state. Primary was recovered within the threshold time and, hence, election was not triggered. There was a downtime in writer application for about 26 seconds. |
3 | Reboot the server | Election was triggered by Pacemaker after the threshold time for which master was not available. The most eligible standby server was promoted as the new master. Once the old master came up after reboot, it was added back to the database cluster as a standby. If fencing was enabled, the node wouldn’t have been added automatically to cluster. The writer application service was down for about 26 seconds. |
4 | Stop the Pacemaker process | It will stop the PostgreSQL process also and server will be marked offline. Election will be triggered and new master will be elected. There was downtime in writer application. |
Network Isolation Tests
Sl. No | Test Scenario | Observation |
1 | Network isolate the standby server from other servers | Corosync traffic was blocked on the standby server. The server was marked offline and PostgreSQL service was turned off due to quorum policy. There was no disruption in the writer application. |
2 | Network isolate the master server from other servers (split-brain scenario) | Corosync traffic was blocked on the master server. PostgreSQL service was turned off and master server was marked offline due to quorum policy. A new master was elected in the majority partition. There was a downtime in the writer application. |
Miscellaneous Tests
Sl. No | Test Scenario | Observation |
1 | Degrade the cluster by turning off all the standby servers. | When all the standby servers went down, PostgreSQL service on the master was stopped due to quorum policy. After this test, when all the standby servers were turned on, a new master was elected. There was a downtime in the writer application. |
2 | Randomly turn off all the servers one after the other, starting with the master, and bring them all back simultaneously. | All the servers came up and joined the cluster. New master was elected. There was a downtime in the writer application. |
Is PAF the solution for PostgreSQL High Availability?
Postgres Automatic Failover provides several advantages in handling PostgreSQL high availability (HA) on many use cases. PAF uses IP address failover instead of rebooting the standby to connect to the new master during a failover event. This proves advantageous when the user does not want to restart the standby nodes. PAF also needs very little manual intervention and manages the overall health of all the Postgres database resources. The only case where manual intervention is required is in the event of a timeline data divergence where the user can use pg_rewind.
ScaleGrid provides a scalable infrastructure for PostgreSQL that allows users to scale resources effortlessly in response to fluctuating demands. This ease of management enables teams to concentrate on developing applications instead of maintaining the database.
In Part 1, we’ve discussed the capabilities and workings of PostgreSQL Automatic Failover (PAF) by ClusterLabs, and in Part 2, we’ll discuss the same high availability aspects using the Replication Manager for PostgreSQL clusters (repmgr) by 2ndQuadrant. Make sure to check back for Part 3, where we’ll also cover Patroni by Zalando and compare all three open-source solutions to help you determine the best fit for your application.
Frequently Asked Questions
What is automatic failover in PostgreSQL?
In the event of a primary server failure, PostgreSQL’s system resilience and reliability are bolstered through an automatic failover mechanism that seamlessly transitions to a standby server. This switch maximizes high availability by substantially reducing downtime when issues arise with the primary unit.
What are the key components of PostgreSQL automatic failover?
In the PostgreSQL automatic failover system, a primary server is dedicated to handling write operations, while standby servers are ready to take over in the event of a failure of the primary. A node is tasked with monitoring the health of all servers and orchestrating an efficient failover when necessary.
How does pg_auto_failover ensure high availability?
pg_auto_failover guarantees high availability through the automation of maintenance tasks and the orchestration of failover processes, significantly reducing data loss while providing consistent monitoring via synchronous replication.
What steps are involved in configuring PostgreSQL for automatic failover?
To set up PostgreSQL to enable automatic failover, you must first install it. Then, we proceeded to establish synchronous replication and adjust the pg_hba.conf file to secure connections. Following this, utilities such as repmgr are employed to register each node.
Why is testing your automatic failover setup important?
It is essential to routinely examine your automatic failover configuration to ensure it effectively handles malfunctions and maintains high availability. Such regular assessments reduce downtime and detect any possible problems, confirming the ability of standby servers to transition into the primary position when necessary smoothly.