-
Notifications
You must be signed in to change notification settings - Fork 82
High availability section #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| --- | ||
| layout: page | ||
| title: "Options" | ||
| date: 2015-01-27 22:02:36 | ||
| categories: | ||
| permalink: /ha/options.html | ||
| --- | ||
|
|
||
|
|
||
|
|
||
| Making postgres highly available | ||
|
|
||
| When thinking about high availability and postgres, there are two seperate concerns that we need to address. The first one | ||
| is data replication, that is, how we copy the data to all available nodes and the second one is failover, that is, how we | ||
| detect and manage failure of a node. | ||
|
|
||
| This guide deals with the standard PostgreSQL distribution and as such the scenario that covers is that of a single master and | ||
| multiple slaves. The master serves requests and upon failure, a new master is chosen from the available nodes. | ||
|
|
||
| Replication | ||
| ----------- | ||
|
|
||
| There are a couple of different options when it comes to replication, with different tradeoffs between them. | ||
|
|
||
| The obvious way is to transfer, somehow, the changes in the underlying datafiles that PostgreSQL generates and have the slaves | ||
| waiting on them until they become active. In that setup, there can only be one active node that serves the requests and when | ||
| a failover occures the new active node mounts the datafiles, recovers what needs to be recovered and starts serving requests. The | ||
| main advantage of this solution is that the primary (the node that serves the requests) has zero penalty in the write performance | ||
| but the disadvantage is that we get to have a few nodes sitting inactive, waiting to resume active duties. | ||
|
|
||
| The second solution is to take advantage of postgres binary replication. By using that we can have a master postgres communicating | ||
| the changes to the standby servers continuously and having them replay those changes, so all slaves are in the same state. The | ||
| advantage in that is that we can use the standby servers to offload read activity (still, we can have only one writeable node) with the | ||
| disadvantage being that we have to pay a small performance drop when writing, as there must be a notification that the changes were | ||
| actually applied to the slaves. | ||
|
|
||
| This guide covers the second solution, which for applications that are read-oriented, we get an almost linear scale on the read performance, | ||
| as read operation (queries) can be performed to a multitude of stand-by servers. | ||
|
|
||
| Prerequisites | ||
| ============== | ||
|
|
||
| Network | ||
| -------- | ||
|
|
||
| We assume that all pg nodes are on the same subnet (e.g. 10.8.4.0/24). There are a few ways to achieve that: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the guide should assume this. Use of passwords and roles with the replication privilege is more secure and not even more difficult than opening up remote Here's an example:
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The network assumption was not made for security reasons. The separation of networks is mostly to isolate replication and cluster traffic from the normal traffic. I'll amend the pg_hba part to include use of replication users. An edit. I reread the guide and the 10.8.4.0/24 are the real IP's of the machines. The 10.8.3.0/24 subnet IPs are the VIP ones assigned by pacemaker. |
||
|
|
||
| 1. Physically place them on the same network | ||
| 2. If the above is not feasible (e.g. a cloud provider, or leased machines) you can use a VPN to establish a private network between the servers | ||
| from which all communication will happen. | ||
|
|
||
| Operating System: We assume an ubuntu derivative, something later than 14.04 | ||
| Postgres Version: 9.3 or later | ||
|
|
||
| We assume that the databases will be accessed through a 10.8.3.0/24 network. The master will be 10.8.3.1 and the slave 10.8.3.2. | ||
| We assume that the database nodes will have hostnames db[number].dbcluster | ||
|
|
||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| --- | ||
| layout: page | ||
| title: "Replication configuration" | ||
| date: 2015-01-27 22:02:36 | ||
| categories: | ||
| permalink: /ha/replication.html | ||
| --- | ||
|
|
||
|
|
||
| The master configuration | ||
| ======================= | ||
|
|
||
| The master server is responsible for distributing the changes to the stand by servers. We have to change to following settings to achieve that: | ||
|
|
||
| port = 5432 | ||
| listen_addresses = '*' # we are using a star (bind to everything) as we want to reuse the same postgresql.conf to all our nodes | ||
| wal_level = hot_standby | ||
| max_wal_senders = 5 # this is the total of wal senders that can be used concurrently by the standbys or streaming base backups | ||
| wal_keep_segments = 8 # how many wal segments should we keep? This has a relation with the speed with which the standbys consume the logs. Increase if you have slow standbys | ||
| archive_mode = on # We keep archives in case we need them for slaves that have fallen behind | ||
| hot_standby = on # this is ignored by the master server | ||
|
|
||
| It is much easier to keep a single postgresql.conf and share it between all your nodes. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe. It falls down if there's slight asymmetries between nodes, e.g. one is larger than another, or the preferred file system configuration and layout has evolved. Seems like spurious advice for this section.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is mostly a comment that it is easier to keep a single postgresql.conf in such a setup. Given that the servers should be equivalent, in the sense that failover will promote one of them to assume active duties, there shouldn't be serious divergence, otherwise you may end up with an underpowered slave serving live traffic. |
||
|
|
||
| Apart from that, we have to permit for stand-by servers to connect to master and request logs. We need to add a line in pg_hba.conf that permits the slaves from the same subnet to connect. | ||
|
|
||
| hostssl replication all 10.8.4.0/24 md5 | ||
|
|
||
| Finally, we need to create a user that is allowed to connect to the server and start replication. | ||
|
|
||
| psql# CREATE USER replicator REPLICATION LOGIN ENCRYPTED PASSWORD 'password'; | ||
|
|
||
|
|
||
|
|
||
| The slave configuration | ||
| ========================= | ||
|
|
||
| Setting up a stand-by to consume the logs is easy. We just need a base backup of the main database, plus all the archive logs that have happened in the meantime. | ||
| The command to do it in one take is | ||
|
|
||
| $ sudo -u postgres pg_basebackup -h 10.8.4.1 -U postgres -D /db/data -X stream -R -p 5432 -U replicator -W password | ||
|
|
||
| where 10.8.4.1 is the IP address of the master from which we want to make the backup, 5432 the port of the master and /db/data the directory in the filesystem where | ||
| the data are to be saved and replicator is the user we defined in the previous step. | ||
|
|
||
| The same command generates a recovery.conf file that notifies the postgres instance that it's a standby server and from where it should connect to get the archive logs. | ||
|
|
||
| At this point, we can edit the recovery.conf to specify a trigger file. A trigger file is a file that when present, instructs the standby to assume master duties. We don't | ||
| need it, as we will do the failover via the clustware, the setting nevertheless is: | ||
|
|
||
| trigger_file = '/path/to/the/trigger/file' | ||
|
|
||
| Keep in mind that the trigger file MUST NOT exist. You create it when you want to promote the standby to master, e.g. via touch /path/to/the/trigger/file | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,158 @@ | ||
| --- | ||
| layout: page | ||
| title: "Cluster configuration" | ||
| date: 2015-01-27 22:02:36 | ||
| categories: | ||
| permalink: /ha/cluster.html | ||
| --- | ||
|
|
||
|
|
||
| The cluster | ||
| ------------------ | ||
|
|
||
| Now that we have a correctly replicating database, we need to establish a mechanism for managing the actual failover and promotion of nodes. For this, we will create a cluster | ||
| using pacemaker and corosync. | ||
|
|
||
| Packages | ||
| ============== | ||
|
|
||
| The following steps must be run in all nodes of our DB cluster. | ||
|
|
||
| Let's start by installing the appropriate packages | ||
|
|
||
| $ sudo apt-get install corosync pacemaker pacemaker-cli-utils cluster-glue | ||
|
|
||
| Then we need to update the postgres RA (resource agent) as the one in the standard distribution is a bit old | ||
|
|
||
| $ sudo wget https://raw.githubusercontent.com/ClusterLabs/resource-agents/master/heartbeat/pgsql -O /usr/lib/ocf/resource.d/heartbeat/pgsql | ||
|
|
||
|
|
||
| Corosync configuration | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's rather handy to have an example of a corosync/pacemaker/et al configuration, but I'm confused as to what actually happens during failover. Is it actually manipulating VIPs, as indicated below? Or is it a simple promotion whereby the client will "somehow" have to be notified of the failover?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They apply the proper vips to nodes (master and slave nodes) and promote one server to master. The newly introduced slave (after failover) will assume the slave serving IP. |
||
| ===================== | ||
|
|
||
| After we have finished with the installations, it's time to configure corosync. | ||
| The corosync configuration should be applied to every database node in /etc/corosync/corosync.conf | ||
|
|
||
| totem { | ||
| version: 2 | ||
| secauth: off | ||
| cluster_name: dbcluster | ||
| transport: udpu | ||
| } | ||
|
|
||
| nodelist { | ||
| node { | ||
| ring0_addr: db1.hostname | ||
| nodeid: 1 | ||
| } | ||
| node { | ||
| ring0_addr: db2.hostname | ||
| nodeid: 2 | ||
| } | ||
| } | ||
|
|
||
| logging { | ||
| fileline: off | ||
| to_logfile: yes | ||
| to_syslog: no | ||
| debug: off | ||
| logfile: /var/log/corosync.log | ||
| timestamp: on | ||
| logger_subsys { | ||
| subsys: AMF | ||
| debug: off | ||
| } | ||
| } | ||
|
|
||
| quorum { | ||
| provider: corosync_votequorum | ||
| expected_votes: 2 | ||
| two_nodes: 1 | ||
| } | ||
|
|
||
|
|
||
| After we restart corosync and pacemaker, we are ready to configure pacemaker. Pacemaker configuration is being done through crm, | ||
| so we execute the following: | ||
|
|
||
| Pacemaker/resources configuration | ||
| =================================== | ||
|
|
||
| crm configure property no-quorum-policy="ignore" | ||
| crm configure property stonith-enabled="false" # we don't need STONITH for now | ||
|
|
||
| crm configure rsc_defaults resource-stickiness="INFINITY" | ||
| crm configure rsc_defaults migration-threshold=1 | ||
|
|
||
| # The IP of the MASTER node | ||
| crm configure primitive vip-master ocf:heartbeat:IPaddr2 params ip=10.8.3.1 cidr_netmask=24 \ | ||
| op start timeout="60s" interval="0s" on-fail="restart" \ | ||
| op monitor timeout="60s" interval="10s" on-fail="restart" \ | ||
| op stop timeout="60s" interval="0s" on-fail="block" | ||
|
|
||
| # The IP of the SLAVE node | ||
| crm configure primitive vip-slave ocf:heartbeat:IPaddr2 params ip=10.8.3.2 cidr_netmask=24 \ | ||
| meta \ | ||
| resource-stickiness="1" \ | ||
| op start timeout="60s" interval="0s" on-fail="restart" \ | ||
| op monitor timeout="60s" interval="10s" on-fail="restart" \ | ||
| op stop timeout="60s" interval="0s" on-fail="block" | ||
|
|
||
| crm configure primitive pingCheck ocf:pacemaker:ping \ | ||
| params \ | ||
| name="default_ping_set" \ | ||
| host_list="10.8.3.1" \ | ||
| multiplier="100"\ | ||
| op start timeout="60s" interval="0s" on-fail="restart" \ | ||
| op monitor timeout="60s" interval="10s" on-fail="restart" \ | ||
| op stop timeout="60s" interval="0s" on-fail="ignore" | ||
|
|
||
| crm configure clone clnPingCheck pingCheck | ||
|
|
||
| crm configure primitive pgsql ocf:heartbeat:pgsql params pgport="5234"\ | ||
| pgctl="/usr/lib/postgresql/9.3/bin/pg_ctl" \ | ||
| psql="/usr/lib/postgresql/9.3/bin/psql" \ | ||
| pgdata="/db/data/" \ | ||
| node_list="db1.dbcluster db2.dbcluster" \ | ||
| restore_command="cp /db/data/pg_archive/%f %p" \ | ||
| primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \ | ||
| master_ip="10.8.3.1" \ | ||
| stop_escalate="0" \ | ||
| rep_mode="async" \ | ||
| start_opt="-p 5234" \ | ||
| op start timeout="60s" interval="0s" on-fail="restart" \ | ||
| op monitor timeout="60s" interval="4s" on-fail="restart" \ | ||
| op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \ | ||
| op promote timeout="60s" interval="0s" on-fail="restart" \ | ||
| op demote timeout="60s" interval="0s" on-fail="stop" \ | ||
| op stop timeout="60s" interval="0s" on-fail="block" \ | ||
| op notify timeout="60s" interval="0s" | ||
|
|
||
| crm configure ms msPostgresql pgsql \ | ||
| meta \ | ||
| master-max="1" \ | ||
| master-node-max="1" \ | ||
| clone-max="2" \ | ||
| clone-node-max="1" \ | ||
| notify="true" | ||
|
|
||
| crm configure colocation rsc_colocation-1 inf: msPostgresql clnPingCheck | ||
| crm configure colocation rsc_colocation-2 inf: vip-master msPostgresql:Master | ||
|
|
||
| # we want the slave to move to the master if the slave fails. This is optional but it helps | ||
| # if we have the read traffic served by the slave node. | ||
| # crm configure colocation rsc_colocation-3 inf: vip-slave msPostgresql:Slave | ||
|
|
||
| crm configure order rsc_order-1 0: clnPingCheck msPostgresql | ||
| crm configure order rsc_order-2 0: msPostgresql:promote vip-master:start symmetrical=false | ||
|
|
||
| # Again, optional but required if we serve read traffic from the slave | ||
| # crm configure order rsc_order-3 0: msPostgresql:demote vip-slave:start symmetrical=false | ||
|
|
||
| crm configure location rsc_location-1 vip-slave \ | ||
| rule 200: pgsql-status eq "HS:sync" \ | ||
| rule 200: pgsql-status eq "HS:async" \ | ||
| rule 100: pgsql-status eq "PRI" | ||
|
|
||
| crm configure location rsc_location-2 msPostgresql \ | ||
| rule -inf: not_defined default_ping_set or default_ping_set lt 100 | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably cut the file/block device based failover from this. That's becoming a "special needs" solution, from what I can tell, whereas streaming replication performs well and is portable to most situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pondered about it when I was writing it, after all, the guide was concerned about streaming replication. I decided to keep it as a reference to what is possible if someone, for whatever reason, does not want to use streaming replication.