Linux|centos7-postgresql数据库|yum安装数据库和配置repmgr高可用集群以及repmgr的日常管理工作

本文主要是介绍Linux|centos7-postgresql数据库|yum安装数据库和配置repmgr高可用集群以及repmgr的日常管理工作，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

一、

前言

postgresql 的yum部署其实还是有点东西的，本文就做一个小小的记录，高可用方面repmgr插件还是非常不错的，但如何部署以及部署后如何使用也是一个难点，因此，也在本文里做一个记录

环境介绍：

第一台服务器：操作系统是centos7，内核版本是3.10，IP地址是192.168.123.17 VMware虚拟机，内存4G，cpu4核，计划在此服务器上安装postgresql12.4版本以及repmgr高可用插件。该服务器是主节点

第二台服务器：操作系统是centos7，内核版本是3.10，IP地址是192.168.123.20 VMware虚拟机，内存4G，cpu4核，计划在此服务器上安装postgresql12.4版本以及repmgr高可用插件。该服务器是从节点

二、

详细的yum安装postgreslq数据库

1，配置yum源

yum安装postgresql比较简单，源配置官方源就可用了，命令如下（2台服务器都执行）：

yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm

为了防止某些库安装的时候提示缺少依赖，增加阿里云的库和阿里云的epel库

wget -O /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo
wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo

增加不常用的rpm包依赖仓库：

[centos-sclo-sclo]
name=CentOS-7 - SCLo sclo
baseurl=https://mirrors.tuna.tsinghua.edu.cn/centos/7/sclo/$basearch/sclo/
gpgcheck=0
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo[centos-sclo-rh]
name=CentOS-7 - SCLo rh
baseurl=https://mirrors.tuna.tsinghua.edu.cn/centos/7/sclo/$basearch/rh/
gpgcheck=0
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo

2，

安装postgresql的组件和postgresql的server服务(2台服务器都执行)

组件比较多，主要是lib库（postgresql运行的依赖库），插件扩展库（postgresql的插件库），开发库（一些postgresql的特殊组件需要使用的库），连接库（开发微服务使用的库）

后面的几个看自己的需求安装不，比如postgresql12-plperl

版本号根据自己需求来，我这里是12.4，不想安这个低版本的，修改12.4就可以了，比如全部的12.4修改为12.8

yum install postgresql12-server-12.4 postgresql12-libs-12.4 postgresql12-devel-12.4 postgresql12-contrib-12.4 postgresql12-tcl postgresql12-odbc postgresql12-12.4  postgresql12-plperl-12.4 postgresql12-pltcl-12.4 postgresql12-plpython-12.4  postgresql12-plpython3-12.4 postgresql12-llvmjit-12.4 postgresql12-devel-12.4  -y

3，

postgresql的环境变量定义和一些特殊处理（两台服务器都执行）

在/etc/profile 文件末尾添加如下变量，计划postgresql安装在/data/pgsql/data目录下

PGDATA=/data/pgsql/data
export PGDATA
PGHOME=/data/pgsql
export PGHOME
PATH=$PATH:$PGHOME/bin:$PGDATA:/usr/pgsql-12/bin/
export PATH  PGHOME PGDATA

写入完成后，激活变量：

source /etc/profile

切换用户到普通用户postgres，将该用户的环境变量文件全部注释掉，结果如下：

-bash-4.2$ whoami
postgres
-bash-4.2$ cat ~/.bash_profile 
#[ -f /etc/profile ] && source /etc/profile
#PGDATA=/var/lib/pgsql/12/data
#export PGDATA
# If you want to customize your settings,
# Use the file below. This is not overridden
# by the RPMS.
#[ -f /var/lib/pgsql/.pgsql_profile ] && source /var/lib/pgsql/.pgsql_profile

注释的原因是我们不希望使用默认的/var/lib/12目录存放数据库的数据文件，注销普通用户再次su - postgres 重新激活变量后，在普通用户下，可以看到postgresql数据库相关变量已经固定了：

[root@centos10 media]# su - postgres
Last login: Sat Mar 30 05:19:48 CST 2024 on pts/0
-bash-4.2$ echo $PGDATA
/data/pgsql/data

4，创建相关目录并赋予普通用户的权限，准备初始化工作（root用户执行，两台服务器都执行）：

mkdir -p /data/pgsql/
chown -Rf postgres. /data/pgsql

5，执行数据库初始化（仍然是切换到普通用户postgres）：

由于前面的准备工作比较充分，环境变量已经完全配置好了，因此，执行initdb就可以完成初始化了，/data/pgsql/data/目录以及其下的文件将在初始化的时候创建好

初始化日志如下：

[root@centos7 media]# su - postgres
Last login: Sat Mar 30 05:25:23 CST 2024 on pts/0
-bash-4.2$ initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".Data page checksums are disabled.creating directory /data/pgsql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Asia/Shanghai
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... okinitdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.Success. You can now start the database server using:pg_ctl -D /data/pgsql/data -l logfile start

6，

在root用户下管理postgresql的启停

由于是yum安装的，因此，启停脚本是已经写好的，可以直接用，但我们数据库的数据文件存放路径修改了，因此，需要先修改一下启停脚本：

/usr/lib/systemd/system/postgresql-12.service

Environment=PGDATA=/var/lib/pgsql/12/data/
修改为
Environment=PGDATA=/data/pgsql/data/

这个启停脚本名字太长了，修改一下名字就叫pg12吧：

mv /usr/lib/systemd/system/postgresql-12.service /usr/lib/systemd/system/pg12.service

启动postgresql数据库并查看进行确认是否启动成功：

systemctl enable pg12 && systemctl start pg12

查看postgreslq的进程：

[root@centos7 media]# ps aux |grep postgre
postgres 16161  0.2  0.4 397292 17424 ?        Ss   05:51   0:00 /usr/pgsql-12/bin/postmaster -D /data/pgsql/data/
postgres 16164  0.0  0.0 251676  2008 ?        Ss   05:51   0:00 postgres: logger   
postgres 16166  0.0  0.0 397292  2060 ?        Ss   05:51   0:00 postgres: checkpointer   
postgres 16167  0.0  0.0 397292  2288 ?        Ss   05:51   0:00 postgres: background writer   
postgres 16168  0.0  0.0 397292  2064 ?        Ss   05:51   0:00 postgres: walwriter   
postgres 16169  0.0  0.0 397844  3212 ?        Ss   05:51   0:00 postgres: autovacuum launcher   
postgres 16170  0.0  0.0 251672  2120 ?        Ss   05:51   0:00 postgres: stats collector   
postgres 16171  0.0  0.0 397844  2808 ?        Ss   05:51   0:00 postgres: logical replication launcher   
root     16173  0.0  0.0 112712   964 pts/0    S+   05:51   0:00 grep --color=auto postgre

7，

远程连接和postgres用户密码设置

修改主配置文件 /data/pgsql/data/postgresql.conf

listen_addresses = '*'          
port = 15433                            
max_connections = 1000   
wal_level = logical

这里是设置数据库连接端口是15433

修改安全连接文件pg_hba.conf：

末尾添加如下两行，这两行的意思是复制用户replication可以随意连接，但其它用户远程连接的时候需要密码，本地登录的时候不校验密码

host    replication     all             0.0.0.0/0                 trust
host    all             all             0.0.0.0/0           md5

8，

本地登录数据库设置密码

su - postgres -c "psql -p15433"alter user postgres with password '123456';

此时，数据库算是完全安装完毕，连接端口15433，连接密码123456 后面就不在重复这些信息了

随意安装一个插件，看看lib库什么的是否正常：

[root@centos10 ~]# su - postgres -c "psql -p15433"
psql (12.4)
Type "help" for help.postgres=# create extension 
adminpack           btree_gin           dblink              file_fdw            hstore_plperlu      insert_username     jsonb_plperl        jsonb_plpythonu     ltree_plpython3u    pg_buffercache      pgrowlocks          pg_visibility       plpython3u          postgres_fdw        tablefunc           unaccent
amcheck             btree_gist          dict_int            fuzzystrmatch       hstore_plpython2u   intagg              jsonb_plperlu       lo                  ltree_plpythonu     pgcrypto            pg_stat_statements  plperl              plpythonu           refint              tcn                 "uuid-ossp"
autoinc             citext              dict_xsyn           hstore              hstore_plpython3u   intarray            jsonb_plpython2u    ltree               moddatetime         pg_freespacemap     pgstattuple         plperlu             pltcl               seg                 tsm_system_rows     xml2
bloom               cube                earthdistance       hstore_plperl       hstore_plpythonu    isn                 jsonb_plpython3u    ltree_plpython2u    pageinspect         pg_prewarm          pg_trgm             plpython2u          pltclu              sslinfo             tsm_system_time     
postgres=# create extension pg_trgm ;
CREATE EXTENSION

二，

repmgr的基本安装部署和配置

repmgr的介绍：

Repmgr是2ndQuadrant（第二象限公司）开发的一款复制的开源工具套件，用于管理PostgreSQL服务器集群中的复制和故障转移。

最初，它主要是为了简化流副本的管理，后来发展成为一个完整的故障转移管理套件。它通过设置备用服务器，监视复制以及执行管理任务（如故障转移或手动切换操作）的工具，增强了PostgreSQL内置的热备份功能。

其中各组件功能如下：

No.1 Repmgrd 守护进程

它主动监视复制集群中的服务器并执行以下任务：

1）监控和记录集群复制性能；

2）通过检测主服务器故障并提升最合适的备用服务器来执行故障转移；

3）将有关群集中事件的通知提供给用户定义的脚本，该脚本可以执行诸如通过电子邮件发送警报等任务；

4）repmgrd 根据本地数据库角色不同，其功能也不同：

主库：repmgrd仅监控本地数据库，负责自动恢复、同异步切换；
备库：repmgrd监控本地数据库和主数据库，负责自动切换、复制槽删除。

No.2 Repmgr命令管理

用于执行管理任务的命令行工具，主要有以下方面作用：

1）设置备用服务器；

2）将备用服务器升级为主服务器；

3）切换主服务器和备用服务器；

4）显示复制群集中的服务器状态。

No.3 用户和元数据为了有效地管理复制集群，repmgr提供专用数据库存储和管理有关repmgr集群服务的相关信息。

此模式在部署repmgr服务时，由repmgr扩展自动创建，该扩展在初始化repmgr -administered集群（repmgr主寄存器）的第一步中安装，包含以下对象：

表:

repmgr.events：记录感兴趣的事件；

repmgr.nodes：复制群集中每个服务器的连接和状态信息；

repmgr.monitoring_history：repmgrd写入的历史备用监视信息。
视图

repmgr.show_nodes：基于表repmgr.nodes，另外显示服务器上游节点的名称。

repmgr.replication_status：启用repmgrd的监视时，显示每个备用数据库的当前监视状态。

repmgr元数据模式可以存储在现有的数据库或在自己的专用数据库。

请注意，repmgr元数据模式不能驻留在不属于repmgr管理的复制集群的数据库服务器上。数据库用户必须可供repmgr访问此数据库并执行必要的更改。此用户不需要是超级用户，但是某些操作（如初始安装repmgr扩展）将需要超级用户连接（可以使用命令行选项--superuser在需要时指定）。

1、

计划repmgr安装方式为yum，安装命令如下：

yum install repmgr_12-5.4.1       repmgr_12-devel-5.4.1  repmgr_12-llvmjit-5.4.1  -y

2、

主数据库装载repmgr插件

修改主数据库17的主配置文件，data/pgsql/data/postgresql.conf，修改完毕后重启服务器

主要是定义数据库的端口，以及归档命令

listen_addresses = '*'
port = 15433
max_connections = 1000	
shared_buffers = 128MB	
dynamic_shared_memory_type = posix
wal_level = logical	
max_wal_size = 1GB
min_wal_size = 80MB
archive_mode = on	
archive_command = 'test ! -f /data/pgsql/arclog/%f && cp %p /data/pgsql/arclog/%f'		# 
log_destination = 'stderr'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%a.log'	
log_truncate_on_rotation = on		
log_rotation_age = 1d			
log_rotation_size = 0			
log_line_prefix = '%m [%p] '		
log_timezone = 'Asia/Shanghai'
datestyle = 'iso, mdy'
timezone = 'Asia/Shanghai'
lc_messages = 'en_US.UTF-8'							
lc_monetary = 'en_US.UTF-8'			
lc_numeric = 'en_US.UTF-8'			
lc_time = 'en_US.UTF-8'	
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries = 'repmgr'
wal_log_hints = on

DETAIL: pg_rewind requires "wal_log_hints" to be enabled

根据以上配置，创建目录/data/pgsql/arclog并赋予postgres这个用户权限，最后重启服务器

mkdir -p /data/pgsql/arclog
chown -Rf postgres. /data/pgsql/arclog/
systemctl restart pg12
ntpdate ntp.aliyun.com

3、

主节点17repmgr的配置文件 /etc/repmgr/12/repmgr.conf 增加如下内容：

node_id=1
node_name='pg1'
conninfo='host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456'
data_directory='/data/pgsql/data'
replication_user='repmgr'
replication_type='physical'
#location='pg1'                                  
use_replication_slots=true
witness_sync_interval=15
log_level='DEBUG'
log_facility='STDERR'
log_file='/var/log/repmgr/pg2.log'
pg_bindir='/usr/pgsql-12/bin/'
ssh_options='-q -o ConnectTimeout=10'
promote_check_timeout=60
promote_check_interval=1
node_rejoin_timeout=60
failover='automatic'
#priority=100                                                   
connection_check_type=ping
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
service_start_command = '/usr/bin/systemctl start pg12'
service_stop_command = '/usr/bin/systemctl stop pg12'
service_restart_command = '/usr/bin/systemctl restart pg12'
service_reload_command = '/usr/bin/systemctl reload pg12'
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
repmgrd_service_start_command = '/usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose'
repmgrd_service_stop_command = '/usr/bin/kill `cat /run/repmgr/repmgrd-12.pid`'

4、

根据以上配置文件，数据库添加repmgr用户

[root@centos7 ~]# su - postgres -c 'psql -p15433'
psql (12.4)
Type "help" for help.postgres=# create user repmgr with password '123456';
CREATE ROLE
postgres=# create database repmgr;
CREATE DATABASEALTER USER repmgr WITH REPLICATION;

5、

切换普通用户，利用repmgr注册主节点

[root@centos7 ~]# su - postgres
Last login: Sat Mar 30 23:08:48 CST 2024 on pts/0
-bash-4.2$ repmgr primary register
INFO: connecting to primary database...
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL: 
fe_sendauth: no password suppliedDETAIL: attempted to connect using:user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=

失败了，原因是没有传递密码变量，执行变量命令，并重新注册，注册的时候临时给repmgr用户超级权限，注册成功

export PGPASSWORD="123456"-bash-4.2$ repmgr primary register -S postgres
WARNING: following problems with command line parameters detected:--superuser ignored when executing PRIMARY REGISTER
INFO: connecting to primary database...
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: attempting to install extension "repmgr"
DEBUG: established superuser connection as "postgres"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered

可以看到，扩展插件repmgr启用了，具体路径为repmgr数据库下的名为repmgr的schema下

命令行也可以佐证确实是安装了插件repmgr：

repmgr=# set search_path to repmgr ;
SET
repmgr=# \dxList of installed extensionsName   | Version |   Schema   |            Description             
---------+---------+------------+------------------------------------plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural languagerepmgr  | 5.4     | repmgr     | Replication manager for PostgreSQL
(2 rows)

如果觉得注册的不对，可以重新注册，重新注册命令如下：

-bash-4.2$ repmgr primary unregister -S postgres --force
WARNING: following problems with command line parameters detected:--superuser ignored when executing PRIMARY UNREGISTER
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: node "pg1" (ID: 1) was successfully unregistered

看到 successfully后，再次执行注册就可以，查看repmgr集群的状态：

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                       
----+------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------1  | pg1  | primary | * running |          | pg1      | 100      | 1        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr

6、

从节点20的repmgr的配置文件

node_id=2
node_name='pg2'
conninfo='host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456'
data_directory='/data/pgsql/data'
replication_user='repmgr'
replication_type='physical'
#location='pg2'
use_replication_slots=true
witness_sync_interval=15
log_level='DEBUG'
log_facility='STDERR'
log_file='/var/log/repmgr/pg2.log'
pg_bindir='/usr/pgsql-12/bin/'
ssh_options='-q -o ConnectTimeout=10'
promote_check_timeout=60
promote_check_interval=1
node_rejoin_timeout=60
failover='automatic'
#priority=120
connection_check_type=ping
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
service_start_command = '/usr/bin/systemctl start pg12'
service_stop_command = '/usr/bin/systemctl stop pg12'
service_restart_command = '/usr/bin/systemctl restart pg12'
service_reload_command = '/usr/bin/systemctl reload pg12'
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
repmgrd_service_start_command = '/usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose'
repmgrd_service_stop_command = '/usr/bin/kill `cat /run/repmgr/repmgrd-12.pid`'

7、

从节点利用流复制搭建主从集群，通过repmgr命令来执行（postgres这个普通用户执行）

repmgr standby clone -h 192.168.123.17 -p15433 -d repmgr --force  -U repmgr -c

HINT: this may take some time; consider using the -c/--fast-checkpoint option

日志如下：

NOTICE: destination directory "/data/pgsql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.123.17 port=15433 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
DEBUG: 1 node records returned by source node
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: upstream_node_id determined as 1
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
WARNING: directory "/data/pgsql/data" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/data/pgsql/data"
DEBUG: replication slot "repmgr_slot_2" exists but is inactive; reusing
NOTICE: starting backup (using pg_basebackup)...
INFO: executing:/usr/pgsql-12/bin/pg_basebackup -l "repmgr base backup"  -D /data/pgsql/data -h 192.168.123.17 -p 15433 -U repmgr -c fast -X stream -S repmgr_slot_2 
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
could not identify current directory: No such file or directory
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: /usr/bin/systemctl start pg12
HINT: after starting the server, you need to register this standby with "repmgr standby register"

命令执行完成后，就可以启动从节点了：

systemctl start pg12

查看进程可以看到从节点正常复制了：

[root@centos10 ~]# ps aux |grep postgre
root      1637  0.0  0.0 191784  2336 pts/0    S    Mar30   0:00 su - postgres
postgres  1638  0.0  0.0 115444  2064 pts/0    S+   Mar30   0:00 -bash
postgres  1744  0.3  1.1 450504 46412 ?        Ss   00:05   0:00 /usr/pgsql-12/bin/postmaster -D /data/pgsql/data/
postgres  1745  0.0  0.0 260664  2084 ?        Ss   00:05   0:00 postgres: logger   
postgres  1746  0.0  0.0 450708  3100 ?        Ss   00:05   0:00 postgres: startup   recovering 000000010000000000000013
postgres  1747  0.0  0.0 450504  2112 ?        Ss   00:05   0:00 postgres: checkpointer   
postgres  1748  0.0  0.0 450504  2120 ?        Ss   00:05   0:00 postgres: background writer   
postgres  1749  0.0  0.0 262784  2008 ?        Ss   00:05   0:00 postgres: stats collector   
postgres  1750  0.4  0.1 457520  4700 ?        Ss   00:05   0:00 postgres: walreceiver   streaming 0/130007C8
root      1752  0.0  0.0 112712   964 pts/1    S+   00:05   0:00 grep --color=auto postgre

主节点正常发送wal日志了：

[root@centos7 ~]# ps aux |grep postgres
postgres  2205  0.0  1.1 450504 46464 ?        Ss   00:01   0:00 /usr/pgsql-12/bin/postmaster -D /data/pgsql/data/
postgres  2207  0.0  0.0 260664  2084 ?        Ss   00:01   0:00 postgres: logger   
postgres  2209  0.0  0.1 450908  4180 ?        Ss   00:01   0:00 postgres: checkpointer   
postgres  2210  0.0  0.0 450764  3916 ?        Ss   00:01   0:00 postgres: background writer   
postgres  2211  0.0  0.1 450504  6296 ?        Ss   00:01   0:00 postgres: walwriter   
postgres  2212  0.0  0.0 451784  3352 ?        Ss   00:01   0:00 postgres: autovacuum launcher   
postgres  2213  0.0  0.0 262784  2196 ?        Ss   00:01   0:00 postgres: archiver   last was 000000010000000000000012.00000028.backup
postgres  2214  0.0  0.0 262940  2352 ?        Ss   00:01   0:00 postgres: stats collector   
postgres  2215  0.0  0.0 451624  2856 ?        Ss   00:01   0:00 postgres: logical replication launcher   
root      2216  0.0  0.0 191784  2336 pts/0    S    00:01   0:00 su - postgres
postgres  2217  0.0  0.0 115444  2004 pts/0    S+   00:01   0:00 -bash
postgres  2260  0.0  0.0 451812  3848 ?        Ss   00:05   0:00 postgres: walsender repmgr 192.168.123.20(38408) streaming 0/130008B0
postgres  2265  0.0  0.1 452464  7044 ?        Ss   00:07   0:00 postgres: postgres postgres 192.168.123.1(56048) idle
root      2294  0.0  0.0 112712   960 pts/1    S+   00:10   0:00 grep --color=auto postgres

8、

注册从节点

-bash-4.2$ repmgr standby register
INFO: connecting to local node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: connecting to primary database
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "pg2" (ID: 2) successfully registered

9、

查看集群状态

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------1  | pg1  | primary | * running |          | default  | 100      | 1        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=1234562  | pg2  | standby |   running | pg1      | default  | 100      | 1        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456

三、

自动故障转移和测试

如果需要启用自动故障转移，那么，我们需要开启repmgrd服务，每个节点都开启

开启比较简单，一条命令就可以了

systemctl start repmgr-12

服务启动完毕后，可以查看进程是否正确：

[root@centos7 ~]# ps aux |grep repm
postgres  5915  0.2  0.2 453488  9716 ?        Ss   09:28   0:00 postgres: repmgr repmgr 192.168.123.17(35544) idle
postgres  5917  0.0  0.0  84276  2160 ?        S    09:28   0:00 /usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose
root      5925  0.0  0.0 112712   960 pts/3    S+   09:29   0:00 grep --color=auto repm

注意，这里问题来了，该服务是依赖于postgresql数据库的进程的，数据库必须是在线的才可以启停repmgrd服务

那么，很多人会疑惑，这个服务有什么用？其实很简单，就是故障转移用的，这些是关键配置，在发生宕机的情况下，这些命令会自动的启停数据库，例如，主节点宕机，那么，从节点的该服务将会在规定时间内检查主节点，如果一直找不到主，将会把自己提升为主，先执行这个命令：

promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'

然后执行这个命令：

service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'

特别注意：这些命令全部需要绝对路径，否则会报找不到命令，配置文件修改后，重启repmgrd服务就可以生效

promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
service_start_command = '/usr/bin/systemctl start pg12'
service_stop_command = '/usr/bin/systemctl stop pg12'
service_restart_command = '/usr/bin/systemctl restart pg12'
service_reload_command = '/usr/bin/systemctl reload pg12'
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
repmgrd_service_start_command = '/usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose'
repmgrd_service_stop_command = '/usr/bin/kill `cat /run/repmgr/repmgrd-12.pid`'

下面是将主节点数据库服务关闭后，从节点的repmgrd服务打印的日志/var/log/repmgr/pg2.log

[2024-03-31 09:13:29] [INFO] 2 total nodes registered
[2024-03-31 09:13:29] [INFO] primary node  "pg1" (ID: 1) and this node have the same location ("default")
[2024-03-31 09:13:29] [INFO] no other sibling nodes - we win by default
[2024-03-31 09:13:29] [DEBUG] election result: WON
[2024-03-31 09:13:29] [NOTICE] this node is the only available candidate and will now promote itself
[2024-03-31 09:13:29] [DEBUG] get_node_record():SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 1
[2024-03-31 09:13:29] [INFO] promote_command is:"/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file"
[2024-03-31 09:13:29] [NOTICE] redirecting logging output to "/var/log/repmgr/pg2.log"[2024-03-31 09:13:29] [DEBUG] connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
[2024-03-31 09:13:29] [DEBUG] connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
[2024-03-31 09:13:29] [DEBUG] connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
[2024-03-31 09:13:29] [NOTICE] promoting standby to primary
[2024-03-31 09:13:29] [DETAIL] promoting server "pg2" (ID: 2) using "/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data"
[2024-03-31 09:13:29] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2024-03-31 09:13:29] [DEBUG] setting node 2 as primary and marking existing primary as failed
[2024-03-31 09:13:29] [NOTICE] STANDBY PROMOTE successful
[2024-03-31 09:13:29] [DETAIL] server "pg2" (ID: 2) was successfully promoted to primary
[2024-03-31 09:13:29] [DEBUG] result of promote_command: 0
[2024-03-31 09:13:29] [INFO] checking state of node 2, 1 of 6 attempts

再次启动原主节点的postgresql数据库，在主节点查看集群状态，可以发现脑裂了：

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="ID | Name | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+----------------------+----------+----------+----------+----------+------------------------------------------------------------------------------1  | pg1  | primary | * running            |          | default  | 100      | 1        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=1234562  | pg2  | standby | ! running as primary |          | default  | 100      | 2        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456WARNING: following issues were detected- node "pg2" (ID: 2) is registered as standby but running as primary

在从节点，可以看到显示的不一样，不过仍然是脑裂：

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------1  | pg1  | primary | - failed  | ?        | default  | 100      |          | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=1234562  | pg2  | primary | * running |          | default  | 100      | 2        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456WARNING: following issues were detected- unable to connect to node "pg1" (ID: 1)

解决方案：

首先，在从节点强制踢掉自己，数据库服务不能关闭，命令如下：

repmgr primary unregister --force

然后关闭从节点的数据库服务，重新从现主17服务器拉取：

repmgr standby clone -h 192.168.123.17 -p15433 -d repmgr --force  -U repmgr -c

同步拉取完毕后，在重新强制注册为standby节点，就恢复正常了：

-bash-4.2$ repmgr standby register --force
INFO: connecting to local node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: connecting to primary database
DEBUG: connecting to: "user=repmgr password=1234562 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: standby registration complete
NOTICE: standby node "pg2" (ID: 2) successfully registered

再次查看集群状态，发现恢复正常了：

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------1  | pg1  | primary | * running |          | default  | 100      | 1        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=1234562  | pg2  | standby |   running | pg1      | default  | 100      | 1        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456

总结：

脑裂的处理步骤是，首先，将异常的节点也就是 running as primary的节点踢掉，这个节点在哪，就在哪个节点踢掉，踢节点也就是unregister，踢完后standby clone，重新克隆的时候需要停止异常节点数据库服务，也就是原从节点的数据库服务，完事后在重新注册并再次启动原从节点的数据库就可以了；而脑裂产生的原因是原主节点数据库挂掉后，又重新启动数据库服务了，重新启动数据库后，原主认为自己是主节点，而原从节点已经提升为主节点，它也认为自己是主节点，造成集群内有两个主节点的信息了，那么，此时集群的主从复制就是混乱状态

如果脑裂很快发现按上述步骤没什么问题，但如果时间很长才发现，那么，需要根据pg_controlldata命令确定哪个是主，哪个是从，Latest checkpoint's oldestXID高的为主，反之为从，在从节点按上述操作即可

-bash-4.2$ pg_controldata 
pg_control version number:            1201
Catalog version number:               201909212
Database system identifier:           7351900556498275811
Database cluster state:               in production
pg_control last modified:             Sun 31 Mar 2024 11:41:05 AM CST
Latest checkpoint location:           0/2F001528
Latest checkpoint's REDO location:    0/2F0014F0
Latest checkpoint's REDO WAL file:    00000001000000000000002F
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:3382
Latest checkpoint's NextOID:          16454
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        479
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  3382
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Sun 31 Mar 2024 11:41:05 AM CST
Fake LSN counter for unlogged rels:   0/3E8
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    logical
wal_log_hints setting:                on
max_connections setting:              1000
max_worker_processes setting:         8
max_wal_senders setting:              10
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0
Mock authentication nonce:            e76af00554364729517eece5fa0fcf6f7f0662da3a154cabb730c2ab24edd267

重新加入节点：

repmgr node rejoin -h 192.168.123.20 -p 15433 -Upostgres -d repmgr --force-rewind

四，

主备手动切换

repmgr standby switchover --force -S postgres

这个命令比较方便，在从节点执行，不过需要配置免密，并且需要standby从库有上级，比如下面的，Upstream这一栏里，表明pg1的上级是pg2，pg2是主节点，pg1是从节点，

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                           
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------1  | pg1  | standby |   running | pg2      | default  | 100      | 2        | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=1234562  | pg2  | primary | * running |          | default  | 100      | 3        | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456

那么，想快速的把这两个节点调换一个位置就是上面的switchover命令了，非常的方便，比如，现在我又想把pg2切换到从，pg1切换到主，那么，应该是在pg1上执行switchover命令

在切换期间，pg2的数据库需要重启一次，重启完毕后，看pg1的日志如下：

EBUG: minimum of 1 free slots (0 for siblings) required; 10 available
NOTICE: attempting to pause repmgrd on 2 nodes
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: pausing repmgrd on node "pg1" (ID: 1)
DEBUG: pausing repmgrd on node "pg2" (ID: 2)
NOTICE: local node "pg1" (ID: 1) will be promoted to primary; current primary "pg2" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "pg2" (ID: 2)
DEBUG: remote_command():ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node service --action=stop --checkpoint --superuser=postgres
NOTICE: issuing CHECKPOINT on node "pg2" (ID: 2) 
DETAIL: executing server command "/usr/bin/systemctl stop pg12"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 3 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 4 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 5 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 6 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 7 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 8 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 9 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 10 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 11 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 12 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 13 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 14 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 15 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 16 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 17 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_NO_RESPONSE
DEBUG: remote_command():ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node status --is-shutdown-cleanly
NOTICE: current primary has been cleanly shut down at location 0/3C000028
DEBUG: local node last receive LSN is 0/3C0000A0, primary shutdown checkpoint LSN is 0/3C000028
NOTICE: promoting standby to primary
DETAIL: promoting server "pg1" (ID: 1) using "/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data"
waiting for server to promote.... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
DEBUG: setting node 1 as primary and marking existing primary as failed
NOTICE: STANDBY PROMOTE successful
DETAIL: server "pg1" (ID: 1) was successfully promoted to primary
DEBUG: executing:/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf  --no-wait -d \'user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433\' node rejoin > /tmp/node-rejoin.1711860263.log 2>&1 && echo "1" || echo "0"
DEBUG: remote_command():ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf  --no-wait -d \'user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433\' node rejoin > /tmp/node-rejoin.1711860263.log 2>&1 && echo "1" || echo "0"
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL: 
could not connect to server: Connection refusedIs the server running on host "192.168.123.20" and acceptingTCP/IP connections on port 15433?DETAIL: attempted to connect using:user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
ERROR: execution of "repmgr node rejoin" on demotion candidate "pg2" (ID: 2) failed
DETAIL: check log file "/tmp/node-rejoin.1711860263.log" on "pg2" for details
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL: 
could not connect to server: Connection refusedIs the server running on host "192.168.123.20" and acceptingTCP/IP connections on port 15433?DETAIL: attempted to connect using:user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
INFO: sleeping 1 second; 1 of 60 attempts ("standby_reconnect_timeout") to reconnect to demoted primary
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: replication slot "repmgr_slot_1" deleted on node 2
NOTICE: switchover was successful
DETAIL: node "pg1" is now primary and node "pg2" is attached as standby
DEBUG: unpausing repmgrd on node "pg1" (ID: 1)
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: unpausing repmgrd on node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: STANDBY SWITCHOVER has completed successfully

在强调一次，需要配置免密才可以使用此功能，因为从切换到主的时候，需要ssh到主做一点事情，切换完毕后，原主也就是新从需要在注册一下，到pg2执行：

-bash-4.2$ repmgr standby register -F
INFO: connecting to local node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: connecting to primary database
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: standby registration complete
NOTICE: standby node "pg2" (ID: 2) successfully registered

再看集群状态就十分正常了：

-bash-4.2$ repmgr service status
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="ID | Name | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+------+---------+-----------+----------+---------+------+---------+--------------------1  | pg1  | primary | * running |          | running | 7631 | no      | n/a                2  | pg2  | standby |   running | pg1      | running | 5511 | no      | 0 second(s) ago

这篇关于Linux|centos7-postgresql数据库|yum安装数据库和配置repmgr高可用集群以及repmgr的日常管理工作的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！