A long time ago, in a galaxy far far away, ‘threads’ were a programming novelty rarely used and seldom trusted. In that environment, the first PostgreSQL developers decided forking a process for each connection to the database is the safest choice. It would be a shame if your database crashed, after all.
Since then, a lot of water has flown under that bridge, but the PostgreSQL community has stuck by their original decision. It is difficult to fault their argument – as it’s absolutely true that:
- Each client having its own process prevents a poorly behaving client from crashing the entire database.
- On modern Linux systems, the difference in overhead between forking a process and creating a thread is much lesser than it used to be.
- Moving to a multithreaded architecture will require extensive rewrites.
However, in modern web applications, clients tend to open a lot of connections. Developers are often strongly discouraged from holding a database connection while other operations take place. “Open a connection as late as possible, close a connection as soon as possible”. But that causes a problem with PostgreSQL’s architecture – forking a process becomes expensive when transactions are very short, as the common wisdom dictates they should be.
The Connection Pool Architecture
Using a modern language library does reduce the problem somewhat – connection pooling is an essential feature of most popular database-access libraries. It ensures ‘closed’ connections are not really closed, but returned to a pool, and ‘opening’ a new connection returns the same ‘physical connection’ back, reducing the actual forking on the PostgreSQL side.
However, modern web applications are rarely monolithic, and often use multiple languages and technologies. Using a connection pool in each module is hardly efficient:
- Even with a relatively small number of modules, and a small pool size in each, you end up with a lot of server processes. Context-switching between them is costly.
- The pooling support varies widely between libraries and languages – one badly behaving pool can consume all resources and leave the database inaccessible by other modules.
- There is no centralized control – you cannot use measures like client-specific access limits.
As a result, popular middlewares have been developed for PostgreSQL. These sit between the database and the clients, sometimes on a seperate server (physical or virtual) and sometimes on the same box, and create a pool that clients can connect to. These middleware are:
- Optimized for PostgreSQL and its rather unique architecture amongst modern DBMSes.
- Provide centralized access control for diverse clients.
- Allow you to reap the same rewards as client-side pools, and then some more (we will discuss these more in more detail in our next posts)!
PostgreSQL Connection Pooler Cons
A connection pooler is an almost indispensable part of a production-ready PostgreSQL setup. While there is plenty of well-documented benefits to using a connection pooler, there are some arguments to be made against using one:
- Introducing a middleware in the communication inevitably introduces some latency. However, when located on the same host, and factoring in the overhead of forking a connection, this is negligible in practice as we will see in the next section.
- A middleware becomes a single point of failure. Using a cluster at this level can resolve this issue, but that introduces added complexity to the architecture.
- A middleware implies extra costs. You either need an extra server (or 3), or your database server(s) must have enough resources to support a connection pooler, in addition to PostgreSQL.
- Sharing connections between different modules can become a security vulnerability. It is very important that we configure pgPool or PgBouncer to clean connections before they are returned to the pool.
- The authentication shifts from the DBMS to the connection pooler. This may not always be acceptable.
- It increases the surface area for attack, unless access to the underlying database is locked down to allow access only via the connection pooler.
- It creates yet another component that must be maintained, fine tuned for your workload, security patched often, and upgraded as required.
Should You Use a PostgreSQL Connection Pooler?
However, all of these problems are well-discussed in the PostgreSQL community, and mitigation strategies ensure the pros of a connection pooler far exceed their cons. Our tests show that even a small number of clients can significantly benefit from using a connection pooler. They are well worth the added configuration and maintenance effort.
In the next post, we will discuss one of the most popular connection poolers in the PostgreSQL world – PgBouncer, followed by Pgpool-II, and lastly a performance test comparison of these two PostgreSQL connection poolers in our final post of the series.