Are you deploying PostgreSQL in the cloud and want to understand your options for achieving high availability? In our previous blog post, Managing High Availability in PostgreSQL – Part I, we discussed the capabilities and functioning of PostgreSQL Automatic Failover (PAF) by ClusterLabs. In Part II, we’re introducing you to an alternative open source tool, Replication Manager from 2ndQuadrant, to be closely followed by Part III where we dive into our third alternative, Patroni by Zalando.
- Managing High Availability in PostgreSQL – Part I: PostgreSQL Automatic Failover
- Managing High Availability in PostgreSQL – Part III: Patroni
Replication Manager (repmgr)
repmgr is an open-source tool suite developed by 2ndQuadrant for managing replication and failover of your PostgreSQL clusters. It provides the tools to setup, configure, manage, and monitor replication of PostgreSQL, and also enables you to perform manual switchover and failover tasks using repmgr utility. This free tool supports and enhances PostgreSQL’s built-in streaming replication.
Replication Manager provides two main tools to manage replication and failover of PostgreSQL.
- A command-line interface utility which enables you to perform various administrative tasks.
- repmgr enables you to setup standby servers, promote standbys, do a switchover, and monitor the status of your PostgreSQL cluster.
- It also provides dry run option for almost all of the administrative commands.
This is the daemon which:
- Actively monitors the PostgreSQL clusters and performs necessary actions based on the state of the cluster.
- Performs automatic failover in case the primary node goes down by promoting the most eligible standby as the new primary.
- Provides an option to monitor and store the data related to replication performance.
- Provides notification by invoking the user scripts for registered events.
How it Works
repmrg not only manages the replication of PostgreSQL clusters, but also has capabilities for setting up the standby servers for replication. Following the initial installation, we need to make changes to the repmgr configuration file (repmgr.conf) with the required details on each server. When a server is configured, it needs to be registered with repmgr using repmgr primary/standby register command. First, the primary node is setup and registered. Then, standby servers are created and configured using the repmgr standby clone command which clones the PostgreSQL standby node from another PostgreSQL server.
Replication Manager makes use of PostgreSQL extensions feature and creates its own schema on the cluster database to store the cluster-related information. Installation of the extension and creation of the schema happens during the registration of the primary server using repmgr. Once the setup is complete, manual administrative operations such as promote, follow, switchover, etc. can be done using repmgr utility. For switchover operation, it requires passwordless SSH to be setup between the nodes.
Automatic failover can be setup using repmgrd. repmgrd requires a shared library ‘repmgr’ to be loaded at the time of starting the PostgreSQL server. The library name should be mentioned in the shared_preload_libraries configuration parameter in the postgresql.conf file. Also, for repmgrd to work, failover=automatic parameter need to be set in repmgr.conf file. Once all these parameters are set, repmgrd daemon starts to actively monitor the cluster. If there is any failure in primary node, it will try to reconnect multiple times. When all attempts to connect to primary fail, the most eligible standby is chosen by election as the new primary by repmgrd.
repmgr also supports event notifications. It has a set of predefined events and stores each occurrence of these events in the repmgr.events table. repmgr enables event notifications to be passed to a user-defined program or script which can take further action, such as sending an email or triggering any alert. This is done by setting the event_notification_command parameter in repmgr.conf.
How Does It Handle The Split Brain Scenario?
repmgr tackles split brain scenarios using the location parameter, where each node should specify the location parameter based on the datacenter in which it is placed. In case of any network split, repmgr will ensure the promotion of the node which is in the same location as the primary. If it doesn’t find any node in that location, it will not promote any node in any location.
It also handles network isolation in the event of an even number of servers in a cluster. This is done using an extra node called the witness server. The witness server is a node which is considered only for the majority vote count. There will be no PostgreSQL installation on that server, and hence, no part to play in replication.
Are There Any Setup Requirements?
- repmgr will require a dedicated database and a user with superuser privileges. However, there’s also an option to provide a superuser if you are not willing to give the superuser access to repmgr user.
- If you want repmgr to copy configuration files which are located outside the PostgreSQL data directory, and/or to test switchover functionality, you will also need passwordless SSH connections between both servers, and rsync should be installed.
- If you intend to use service-based commands other than pg_ctl (which is used by repmgr by default) to start, stop, reload, and restart, you can specify them in repmgr configuration file (repmgr.conf).
- The basic configuration parameters required in repmgr configuration file are as follows:
node_id (int) – A unique integer greater than zero which identifies the node.node_name (string) – An arbitrary (but unique) string, using the server’s hostname or another identifier unambiguously associated with the server is recommended to avoid confusion.
conninfo (string) – Database connection information as a conninfo string. All servers in the cluster must be able to connect to the local node using this string.
data_directory (string) – The node’s data directory. This is needed by repmgr to perform operations when the PostgreSQL instance is not running and there’s no other way of determining the data directory.
- Repmgr provides utilities that help to setup primary and standby nodes and configure replication.
- It doesn’t use any extra ports for communication. If you want to perform switchover, only then will it require passwordless SSH to be configured.
- Provides notification by invoking the user scripts for the registered events.
- Performs automatic failover in case of primary server failure.
- repmgr doesn’t detect if the standby is misconfigured with an unknown or non-existent node in recovery configuration. The node will be shown as standby even if it is running without connecting to the primary/cascading standby node.
- Cannot retrieve the status of another node from a node where PostgreSQL service is down. Hence, it doesn’t provide a distributed control solution.
- It doesn’t handle recovering the health of individual nodes.
High Availability Test Scenarios
We conducted a few tests on PostgreSQL high availability management using repmgr. All of these tests were run while the application was running and inserting data to the PostgreSQL database. The application was written using PostgreSQL Java JDBC Driver leveraging the connection failover capability.
Standby Server Tests
|Sl. No||Test Scenario||Observation|
|1||Kill the PostgreSQL process||Standby server was marked as failed. There was no disruption in writer application. Manual intervention was required to start the PostgreSQL process again.|
|2||Stop the PostgreSQL process||Standby server was marked as failed. There was no disruption in writer application. Manual intervention was required to start the PostgreSQL process again.|
|3||Reboot the server||Standby server was marked as failed. Once the server came up after reboot, PostgreSQL was started manually and server was marked as running. There was no disruption in writer application.|
|4||Stop the repmgrd process||The standby server will not be a part of the automated failover situation. PostgreSQL service was found to be running. There was no disruption in writer application.|
Master/Primary Server Tests
|Sl. No||Test Scenario||Observation|
|1||Kill the PostgreSQL process||
|2||Stop the PostgreSQL process and bring it back immediately after health check expiry||
|3||Reboot the server||
|4||Stop the repmgr process||
Network Isolation Tests
|Sl. No||Test Scenario||Observation|
|1||Network isolate the primary server from other servers (all have same value for location in repmgr configuration)||
|2||Network isolate the primary server from other servers (the standby servers has same value for location but primary had a different value for location in repmgr configuration)||
repmgr provides several commands to setup and monitor PostgreSQL replication. It is feature-rich and also eases the job of the database administrator (DBA). However, it’s not a full fledged high availability management tool since it will not manage the resources. Manual intervention is required to ensure the resource is in proper state.
So, in this post, we’ve discussed the capabilities and workings of Replication Manager by 2ndQuadrant. In our next post, we’ll discuss the same high availability aspects using Patroni by Zalando. For users looking to automate their high availability in the cloud, check out our PostgreSQL on Azure and PostgreSQL on AWS fully managed solutions.