Deploying a fully resilient PostgreSQL cluster on Diamanti

Scope of tutorial

In this tutorial we are presenting a simple solution to containerize an advanced database like PostgreSQL (commonly called Postgres) to make it fully resilient (available, scalable, recoverable and upgradable). You will learn how Diamanti brings simplicity while using containers to their fullest capability, as we deploy a completely self-sustaining and self-recovering Database as a Service. The approach will demonstrate how to remove failover and recovery complexity while ensuring that the Postgres Pod is highly available within the Diamanti cluster.

Introduction to Diamanti

Diamanti brings to market the first appliance purpose-built for containerized applications that combines the ease of hyperconverged infrastructure with the unparalleled performance and efficiency of bare-metal Docker containers. Diamanti accelerates time-to-market, guarantees real-time service levels, and consolidates containers with 90% utilization.

Understanding PostgreSQL

Even with the emergence of modern databases like mongoDB, cassandra, and Hbase, conventional databases like Postgres are still highly used. Conventional databases are typically single instance and very difficult to scale, recover and to make highly available. There are various methods to make them highly resilient, but having done so, are the implementation able to fulfill users expectations in today’s dynamic and fast moving world?

According to a report published in datadog, Postgres is the 5th most popular application and the most popular database being containerized these days. Containerization helps to make Postgres highly available and somewhat scalable. But configuring persistent storage and networking, and scaling and upgrading the infrastructure are still challenging.

Below is one simple and most commonly used architecture for achieving high availability for a Postgres cluster. In this example pgpool does the failover from master to slave. Pgpool is a middleware between Postgres servers and client, that abstracts the backend servers from client. Similarly watch-pod in this example is responsible for promoting the slave to master, in case of failure. But this solution is not easily scalable, recoverable or upgradable. For instance, when a Postgres instance goes down, there are many steps that the user must take to correctly failover and bring back the failed instance up and running. Additionally, significant time is lost to resynch the volume of the failed instance.

Fig1: conventional solution for containerizing the PostgreSQL
In this tutorial we are going to demonstrate how to design a containerized Postgres cluster to make it a true Database as a Service. We will focus on how you can deploy a Postgres cluster with Kubernetes, and moreover, how Diamanti reduces your operational overhead to achieve high availability, scalability and fast self-recovery. This solution will assure that you are back to full capacity quickly after a failure.

Solution Overview

First, let’s get an overview of the proposed solution, which offers both scaling flexibility and fast recovery time in seconds.

Fig2: Diamanti’s solution for containerizing the PostgreSQL
In this solution, the Postgres master is not predetermined, instead the master is selected based on an election by the current active nodes. Each Postgres node is identical when it boots up. Then each node configures itself based on its role of master or slave. The above diagram shows:

The application talks to pgpool or write/read services, which hides the implementation complexity and status of the Postgres cluster from the application.
It is recommended to create 3 or more pods for the Postgres instance. A replication controller will ensure that the Postgres instance is relaunched in case of failure.
Replication from master to slave pods is done via inbuilt asynchronous streaming replication module, where slave follows the WAL (Write Ahead Log) of master to keep its database in sync. Additionally, hardware level replication among volume mirrors ensures that one given pod can be rescheduled to other node in case of failure of first node.
Each pod has 3 containers:
Postgres container – Running the actual Postgres database instance.
Consul container – Runs on each database pod and allows the cluster to elect its leader. In case of a master failure, the consul triggers a re-election to elect a new master.
Pg-controller – This is a home grown node-js based application that keeps track of who is the master and reconfigures its local Postgres container accordingly. If it is a slave, it configures it’s local Postgres instance to follow the WAL of the master for asynchronous streaming replication. This container also monitors a change in mastership and configures local Postgres instances when needed. Additionally, this container can monitor an instance health and kill the pod to restart itin case it is not reachable.
We will be creating a 2-way or 3-way mirrored Diamanti volume, where each mirror physically resides on a different node. This approach will help ensure quick recovery of failed instances.
The Write service abstracts the current master based on the Kubernetes label. As a result, the application does not need to track which node is the master, simplifying the Postgres deployment.
The Read service load balances the slave clients to provide read scaling. A better load balancer like NGINX or HAproxy can be deployed for advanced load balancing of read operations.
Pgpool is only used here to isolate the writes and reads. No other pgpool functionalities are utilized. If your application is aware of read vs. write and can send writes to a write service andreads to a read service, then you do not need to use pgpool.

Fig3: Failure in case of Diamanti’s solution for containerizing the PostgreSQL

When the Kubernetes node that the master pod is running on goes down or is restarted, the Kubernetes pod and volume also goes down. But the mirror of that volume is still present on another node. In this event, two things will happen:

The Consul will detect that the master is down and it will call for an election. Once a new consul master is elected, the pg-controller sidecar will detect that the master is changed, and the new master pg-controller will promote its own Postgres instance to become the Postgres master. Other slaves will now start pointing to the new master.
Meanwhile, the old master will be restarted on the node where the mirror of its volume exists. Now this pod will rejoin the cluster as a slave. Using the same logical volume, the Postgres pod will be up and running within seconds. Internally, the Diamanti platform automatically syncs all mirrors and switchesthe roles. In event of a disk failure, the Diamanti platform automatically switches to the other mirrored volume, providing uninterrupted access to disk.

Fig4: Failover and recovery after failure in Diamanti’s solution for containerizing the PostgreSQL

As you can see in the diagram above, pod-2 is now the master with the write service pointing to it. Also pod-0, which was the master earlier, has been restarted and has rejoined as a slave.

Test application

To demonstrate the Postgre solution described above, we use a simple node.js application. This application creates a restful API interface for any other front end application to easily manage the backend database. For this example,we will populate a simple sample database on the Postgres instances. The application will talk to either the Postgres instance or pgpool instance depending on the deployment approach selected. This application is modified from the original reference: http://mherman.org/blog/2016/03/13/designing-a-restful-api-with-node-and-postgres/#.WPW3346hdEI

What you need

  1. Docker image for Postgres container:
    Github: https://github.com/diamantiSolutions/ (solutions/postgres/cluster-RAS)
    Dockerhub: guptaarvindk/diamanti-postgres:v1
    Originally referred from:
    https://github.com/CrunchyData/crunchy-containers
  2. Docker image for pg-controller sidecar container:
    Github: https://github.com/diamantiSolutions/ (solutions/postgres/images/pg-controller)
    Dockerhub: guptaarvindk/diamanti-postgres:v1
  3. Docker image for consul sidecar container:
    Github: https://github.com/diamantiSolutions/ (solutions/postgres/cluster-RAS)
    Dockerhub: consul:0.8.1
  4. Docker image for pgpool container:
    Github: https://github.com/diamantiSolutions/ (solutions/postgres/images/pgpool)
    Dockerhub: guptaarvindk/pgpool:v2
    Originally referred from:
    https://github.com/CrunchyData/crunchy-containers
  5. Docker image for node.js based restful API application:
    Github: https://github.com/diamantiSolutions/ (solutions/postgres/images/nodeJsPgApp)
    Dockerhub: diamantisolutions/nodejs-mongo-app:v2
  6. Pod spec for launching pods based on all the above images:
    Github: https://github.com/diamantiSolutions/ (solutions/postgres/cluster-RAS)

Deploying a resilient Postgres cluster with Diamanti Mirrored Volumes

Setting up Diamanti Cluster

In this tutorial we assume that you already have a working diamanti cluster ready for deployment.

Create a Diamanti Network object

The First step is to create a Diamanti network object to facilitate easy connectivity among the Postgres nodes. Let’s create a network named “default”. You will need to know the IP address ranges, gateway and vlan that will be used for your network. Please consult your network administrator if you are not sure about the values assigned. Please be aware incorrect values in this step can cause unexpected behavior.


[[email protected] ~]# dctl network create default -s 172.16.137.0/24 --start 172.16.137.4 --end 172.16.137.20 -g 172.16.137.1 -v 137
[[email protected] ~]# dctl network list
NAME TYPE START ADDRESS TOTAL USED GATEWAY VLAN
default public 172.16.137.4 17 0 172.16.137.1 137

Create 2- or 3-way Diamanti Mirrored Volume

Now let’s provision a 3-way Diamanti Mirrored Volume for each Postgres instance we are going to create:


[[email protected] ~]# dctl volume create pg-vol-0 -s 20G -m 3
NAME SIZE NODE LABELS PHASE ATTACH-STATUS ATTACHED-TO DEVICE-PATH AGE
pg-vol-0 18.63GiB [] <none> Pending 0s
[[email protected] ~]# dctl volume create pg-vol-1 -s 20G -m 3
NAME SIZE NODE LABELS PHASE ATTACH-STATUS ATTACHED-TO DEVICE-PATH AGE
pg-vol-1 18.63GiB [] <none> Pending 0s
[[email protected] ~]# dctl volume create pg-vol-2 -s 20G -m 3
NAME SIZE NODE LABELS PHASE ATTACH-STATUS ATTACHED-TO DEVICE-PATH AGE
pg-vol-2 18.63GiB [] <none> Pending 0s
[[email protected] ~]# dctl volume list
NAME SIZE NODE LABELS PHASE ATTACH-STATUS ATTACHED-TO DEVICE-PATH AGE
pg-vol-0 18.64GiB [appserv82 appserv83 appserv85] <none> Available Available 0s
pg-vol-1 18.64GiB [appserv85 appserv82 appserv83] <none> Available Available 0s
pg-vol-2 18.64GiB [appserv85 appserv83 appserv82] <none> Available Available 0s

Create configmap to setup consul

Create consul config map from a configuration file:


[[email protected] ~]# kubectl create configmap consul --from-file=consul-server-configmap.json
configmap "consul" created

Create PostgreSQL services

Lets create Postgres master and slave service, where master service will point to label role=pgmaster and slave service will point to label role=pgslave.


[[email protected] ~]# kubectl create -f pg-svc.json
service "pgmaster" created
service "pgslave" created

Create PostgreSQL pod

We will deploy Postgres instances using software images and specs provided by Diamanti. Please keep in mind that a pod with this spec will create three containers (Postgre, consul, pg-controller) in each pod. Once you have the spec downloaded, deploy the Postgres nodes as follows by just changing the instance number in the template spec:


[[email protected] ~]# sed -e 's~<num>~0~g' postgres-pod.json kubectl create -f -
replicationcontroller "pg-0" created
service "pg-0" created
[[email protected] ~]# sed -e 's~<num>~1~g' postgres-pod.json kubectl create -f -
replicationcontroller "pg-1" created
service "pg-1" created
[[email protected] ~]# sed -e 's~<num>~2~g' postgres-pod.json kubectl create -f -
replicationcontroller "pg-2" created
service "pg-2" created

You can see now that all the pods are running and previously created volumes are attached to the pods.


[[email protected] ~]# dctl volume list
NAME SIZE NODE LABELS PHASE ATTACH-STATUS ATTACHED-TO DEVICE-PATH AGE
pg-vol-0 18.64GiB [appserv82 appserv83 appserv85] diamanti.com/pod-name=default/pg-0-m60p1 Available Attached appserv85 /dev/nvme1n1 55m
pg-vol-1 18.64GiB [appserv85 appserv82 appserv83] diamanti.com/pod-name=default/pg-1-31x3t Available Attached appserv82 /dev/nvme1n1 55m
pg-vol-2 18.64GiB [appserv85 appserv83 appserv82] diamanti.com/pod-name=default/pg-2-26std Available Attached appserv83 /dev/nvme1n1 55m

Set up consul node to join the cluster

Wait for 30 seconds to enable all Postgres nodes to start up before joining them to the cluster.


[[email protected] ~]# kubectl create -f consul-join.yaml
job "consul-join" created

Create pgpool pod and service to route read and write request

Now let’s create a pgpool pod and service which will route all the writes to the master service while allowing all the reads to go to the slave service.


[[[email protected] ~]# kubectl create -f pgpool.json
replicationcontroller "pgpool" created
service "pgpool" created

Create nodejs application to access pgpool, pg-write, pg-read services

Next, create the nodejs app container for setting up the simple application to test our setup for functionality and failover.


[[email protected] ~]# kubectl create -f nodeJsPgApp.json
replicationcontroller "nodejs-pg-app-rc" created
service "nodejs-pg-app" created
replicationcontroller "nodejs-pg-app-rc-pgpool" created
replicationcontroller "nodejs-pg-app-rc-master" created
replicationcontroller "nodejs-pg-app-rc-slave" created

This will create various containers connecting to the different services like pgpool, master, slave or individual pods. For now, let’s only focus on the container which talks to the master/write and slave/read services.

Populate PostgreSQL database

Before we start inserting or reading the data on the Postgres cluster, let’s first setup a test table and insert some initial data. For that we need to see who is the master:


[[email protected] ~]# kubectl get pods -l role=pgmaster
NAME READY STATUS RESTARTS AGE
pg-1-dtxme 3/3 Running 0 14m

Now launch the container shell of the master Postgres pod and run the following SQL cmd in the shell. Alternatively, you can add the sql file to the docker image and run it as part of the init script.


[[email protected] ~]# kubectl exec -it pg-1-dtxme /bin/sh
sh-4.2# psql -U postgres -c 'ALTER ROLE pgbench WITH CREATEDB'
ALTER ROLE
sh-4.2# psql -U pgbench << EOF
> DROP DATABASE IF EXISTS puppies;
> CREATE DATABASE puppies;
> \c puppies;
> CREATE TABLE pups (ID SERIAL PRIMARY KEY, name VARCHAR, breed VARCHAR, age INTEGER,sex VARCHAR);
> INSERT INTO pups (name, breed, age, sex) VALUES ('Tyler', 'Retrieved', 3, 'M');
> EOF
DROP DATABASE
CREATE DATABASE
You are now connected to database "puppies" as user "pgbench".
CREATE TABLE
INSERT 0 1

Testing the solution

Now we are ready to test the solution we just created. Let’s connect to the shell on the node.js application pod and try executing some APIs. First we will connect to the shell of an application pod that talks to the master service to determine if the initialized data is there.


[[email protected] ~]# kubectl exec -it nodejs-pg-app-rc-master-c5r7v /bin/sh
# get the list of puppies
/usr/src/node-app # curl http://localhost:3000/api/puppies
{"status":"success","data":[{"id":1,"name":"Tyler","breed":"Retrieved","age":3,"sex":"M"}],"message":"Retrieved ALL puppies"}

Now let’s connect to the shell of an application pod that talks to the slave service to determine if its consistent with the master.


[[email protected] ~]# kubectl exec -it nodejs-pg-app-rc-slave-vxqsb /bin/sh
# get the list of puppies
/usr/src/node-app # curl http://localhost:3000/api/puppies
{"status":"success","data":[{"id":1,"name":"Tyler","breed":"Retrieved","age":3,"sex":"M"}],"message":"Retrieved ALL puppies"}

Now let’s insert/delete some entries to the master database using the application pod that connects to a slave service


[[email protected] ~]# kubectl exec -it nodejs-pg-app-rc-master-c5r7v /bin/sh
#insert 1 more puppy to table
/usr/src/node-app # curl --data "name=Whisky&breed=annoying&age=3&sex=f" http://localhost:3000/api/puppies
{"status":"success","message":"Inserted one puppy"}
/usr/src/node-app # curl http://localhost:3000/api/puppies
{"status":"success","data":[{"id":1,"name":"Tyler","breed":"Retrieved","age":3,"sex":"M"},{"id":2,"name":"Whisky","breed":"annoying","age":3,"sex":"f"}],"message":"Retrieved ALL puppies"}
#update existing puppy
/usr/src/node-app # curl -X PUT --data "name=Hunter&breed=annoying&age=33&sex=m" http://localhost:3000/api/puppies/2
{"status":"success","message":"Updated puppy"}
/usr/src/node-app # curl http://localhost:3000/api/puppies
{"status":"success","data":[{"id":1,"name":"Tyler","breed":"Retrieved","age":3,"sex":"M"},{"id":2,"name":"Hunter","breed":"annoying","age":33,"sex":"m"}],"message":"Retrieved ALL puppies"}
#insert one more and then delete first puppy
/usr/src/node-app # curl --data "name=Whisky&breed=annoying&age=3&sex=f" http://localhost:3000/api/puppies
{"status":"success","message":"Inserted one puppy"}
/usr/src/node-app # curl -X DELETE http://127.0.0.1:3000/api/puppies/1
{"status":"success","message":"Removed 1 puppy"}
/usr/src/node-app # curl http://localhost:3000/api/puppies
{"status":"success","data":[{"id":2,"name":"Hunter","breed":"annoying","age":33,"sex":"m"},{"id":3,"name":"Whisky","breed":"annoying","age":3,"sex":"f"}],"message":"Retrieved ALL puppies"}

Now we can check if the slave databases are consistent or not using the application pod that connects to the slave service.


[[email protected] ~]# kubectl exec -it nodejs-pg-app-rc-slave-vxqsb /bin/sh
/usr/src/node-app # curl http://localhost:3000/api/puppies
{"status":"success","data":[{"id":2,"name":"Hunter","breed":"annoying","age":33,"sex":"m"},{"id":3,"name":"Whisky","breed":"annoying","age":3,"sex":"f"}],"message":"Retrieved ALL puppies"}

In reality, you can actually connect to pgpool for either write and read requests, but for our testing it will be difficult to determine if the read is going to a slave or the master. So for this testing we have used master/slave services directly instead of going through pgpool.

Testing High Availability

To test high availability, let’s kill the master pod. As soon as we kill master pod, the master service will fail as expected, but it should not impact the slave reads. So let’s immediately check if we are still able to read from the slaves while the master is down.


[[email protected] ~]# kubectl delete pods -l role=pgmaster
pod "pg-0-z7f82" deleted
[[email protected] ~]# kubectl get pods -l role=pgmaster
No resources found.
[[email protected] ~]# kubectl exec -it nodejs-pg-app-rc-slave-vxqsb /bin/sh
/usr/src/node-app # curl http://localhost:3000/api/puppies
{"status":"success","data":[{"id":2,"name":"Hunter","breed":"annoying","age":33,"sex":"m"},{"id":3,"name":"Whisky","breed":"annoying","age":3,"sex":"f"}],"message":"Retrieved ALL puppies"}

It will take around 30 seconds for one of the slaves to get promoted as the new master and resume all the write operations. After waiting about 30 seconds, let’s check if master pod is up or not. If it is up, then try to send a write request to the master service to assure that failover is indeed working as expected.


[[email protected] ~]# kubectl get pods -l role=pgmaster
NAME READY STATUS RESTARTS AGE
pg-2-9c8m7 3/3 Running 1 14m
[[email protected] ~]# kubectl exec -it nodejs-pg-app-rc-master-c5r7v /bin/sh
#insert 1 more puppy to table
/usr/src/node-app # curl --data "name=Jack&breed=lab&age=5&sex=m" http://localhost:3000/api/puppies
{"status":"success","message":"Inserted one puppy"}
/usr/src/node-app # curl http://localhost:3000/api/puppies
{"status":"success","data":[{"id":1,"name":"Tyler","breed":"Retrieved","age":3,"sex":"M"},{"id":2,"name":"Whisky","breed":"annoying","age":3,"sex":"f"},{"id":100,"name":"Jack","breed":"lab","age":5,"sex":"m"}],"message":"Retrieved ALL puppies"}

Now let’s check if the old master is back as a slave or not. The old master should be restarted as a slave within 30-40 seconds after failure.


[[email protected] ~]# kubectl get pods -l role=pgmaster
NAME READY STATUS RESTARTS AGE
pg-2-26std 3/3 Running 0 29m
[[email protected] ~]# kubectl get pods -l role=pgslave
NAME READY STATUS RESTARTS AGE
pg-0-m60p1 3/3 Running 0 29m
pg-1-31x3t 3/3 Running 0 29m

Let’s double check if writes to the new master are actually getting propagated to all slaves. For that, let’s connect to the shell of the app that talks to the slave service. Alternatively you can connect to the shell of individual apps that talks to each Postgres instance directly and check the consistency of data across each node.


[[email protected] ~]# kubectl exec -it nodejs-pg-app-rc-slave-vxqsb /bin/sh
/usr/src/node-app # curl http://localhost:3000/api/puppies
{"status":"success","data":[{"id":1,"name":"Tyler","breed":"Retrieved","age":3,"sex":"M"},{"id":2,"name":"Whisky","breed":"annoying","age":3,"sex":"f"},{"id":100,"name":"Jack","breed":"lab","age":5,"sex":"m"}],"message":"Retrieved ALL puppies"}

As you can see, even when deleting or moving the pod from one node to another node, the failed pod recovers back quickly with the same persistent data. Please note if a node goes down, Kubernetes take more than 5 minutes to detect the node failure This Kubernetes behavior can be easily addressed with Diamanti. Pg-controller supports a heartbeat mechanism for all the Postgres instances. In the event it detects that an instance is not responding, it will kill the pod so that it can be recreated immediately by Kubernetes, instead of waiting for 5 minutes. The optimal polling duration can be established to detect when a service down while allowing enough time to avoid a false alarm.

What’s Next

There are endless possible ways you can architect your Postgres solution around the Diamanti appliance. You can continue using traditional approaches of achieve availability, scalability and recoverability or you can utilize Diamanti features to make things simpler. You can also enhance the above solution to achieve the following:

Scalability

As discussed earlier, each Postgres instance is identical, and you can keep adding new Postgres pods to the consul raft, which will automatically boot themselves as new slaves.

Disaster Recovery

The above highly available and scalable solution provides quick in-cluster recovery. It can also be used for a disaster recovery solution by setting up one of the replicas to a remote site with asynchronous streaming replication.

Endnotes

Diamanti is committed to its tagline Dev Done Fast. Ops Done Right.™. We are enabling new innovative ideas so that developers can focus on development, not on infrastructure. In this tutorial we saw how Diamanti makes it drastically simple to deploy a fully resilient Postgres cluster. We will explore in future tutorials we will explore how other modern and conventional applications can benefit from containerization on the Diamanti platform.