Postgresql & Pgpool 场景问题和解决方案

2019年11月3日 400次阅读

原文地址：
https://www.tony-yin.site/201…

《Postgresql & Pgpool 场景问题和解决方案》

本文整理了一些笔者遇到的postgresql和pgpool的常见问题和解决方案。

环境

Postgresql作为数据后端，pgpool作为postgresql的中间件，通过vip对客户端提供服务，并利用自身的failover机制保证数据库HA。

Nodes：

192.168.1.1
192.168.1.2
192.168.1.3

Vip：

192.168.1.4

Version：

Postgresql: 9.4.20
Pgpool: 4.0.3

如何查看pgpool节点信息

[[email protected] ~]# psql -h 192.168.1.4 -p 9998 -U postgres postgres -c "show pool_nodes"
Password for user postgres: 
 node_id |   hostname    | port | status | lb_weight |  role   | select_cnt | load_balance_n
ode | replication_delay | last_status_change  
---------+---------------+------+--------+-----------+---------+------------+---------------
----+-------------------+---------------------
 0       | 192.168.1.1 | 5432 | up     | 0.333333  | standby | 0          | false         
    | 0                 | 2019-10-23 16:14:42
 1       | 192.168.1.2 | 5432 | up     | 0.333333  | primary | 34333174   | true          
    | 0                 | 2019-10-23 16:14:42
 2       | 192.168.1.3 | 5432 | up     | 0.333333  | standby | 0          | false         
    | 0                 | 2019-10-23 16:14:42
(3 rows)

如何将一个节点踢出集羣

[[email protected]]# pcp_dettach_node -n 0 -p 9898 -h 192.168.1.1 -U postgres

如何将一个节点重新加入集羣

[[email protected]]# pcp_attach_node -n 0 -p 9898 -h 192.168.1.1 -U postgres

如何查看watchdog信息

[[email protected]]# pcp_watchdog_info -h 192.168.1.1 -p 9898 -U postgres
Password: 
3 YES 192.168.1.1:9998 Linux host16 192.168.1.1

192.168.1.1:9998 Linux host16 192.168.1.1 9998 9000 4 MASTER
192.168.1.2:9998 Linux host17 192.168.1.2 9998 9000 7 STANDBY
192.168.1.3:9998 Linux host18 192.168.1.3 9998 9000 7 STANDBY

如何查看pg node是否在recovery

[[email protected]]# psql -h 192.168.1.1 -p 5432 -U postgres postgres -c "select pg_is_in_recovery()"
Password for user postgres: 
 pg_is_in_recovery 
-------------------
 f
(1 row)

如何指定某个节点成为新的master节点

这个命令并不会真正的将postgresql后端从standby改为master，而只是修改了pgpool的内部状态；简而言之只是修改了pgpool的状态，而postgresql对应的recovery文件并没有改变，还是需要failover脚本来改变。

pcp_promote_node -n 0 -p 9898 -h 192.168.1.1 -U postgres

Ansible 版本改变导致部署后的数据库脑裂

一套环境上部署的数据库经常会发生脑裂问题，后经定位发现是pgpool配置文件涉及other_pgpool_id的参数项没有正确配置导致，没有递增。

这些配置项不正确是由于该环境上的ansible版本为2.7，而数据库自动化部署是基于ansible 2.4开发，ansible 2.4.2后，jinja2部分高级语法发生改变，pgpool配置文件是通过jinja2生成的，部分配置项如果还使用原有语法，那么有些配置结果不会达到预期，所以需要根据ansible版本定制配置文件模板。

下面是模板文件pgpool.conf.j2对应不同版本的对应配置：

ansible < v2.4.2

# - Other pgpool Connection Settings -
{% set other_pgpool_id = 0 %}
{% for backend in pgpool_cluster_entries %}
{% if inventory_hostname != backend.ip %}
heartbeat_destination{{other_pgpool_id}} = '{{backend.ip}}'
heartbeat_destination_port{{other_pgpool_id}} = 9694
heartbeat_device{{other_pgpool_id}} = ''

{% set other_pgpool_id = other_pgpool_id + 1 %}
{% endif %}
{% endfor %}

# - Other pgpool Connection Settings -
{% set other_pgpool_id = 0 %}
{% for backend in pgpool_cluster_entries %}
{% if inventory_hostname != backend.ip %}
other_pgpool_hostname{{other_pgpool_id}} = '{{backend.ip}}'
other_pgpool_port{{other_pgpool_id}} = 9998
other_wd_port{{other_pgpool_id}} = 9000

{% set other_pgpool_id = other_pgpool_id + 1 %}
{% endif %}
{% endfor %}

ansible >= v2.4.2

# - Other pgpool Connection Settings -
{% set other_pgpool_id = namespace(a=0) %}
{% for backend in pgpool_cluster_entries %}
{% if inventory_hostname != backend.ip %}
heartbeat_destination{{other_pgpool_id.a}} = '{{backend.ip}}'
heartbeat_destination_port{{other_pgpool_id.a}} = 9694
heartbeat_device{{other_pgpool_id.a}} = ''

{% set other_pgpool_id.a = other_pgpool_id.a + 1 %}
{% endif %}
{% endfor %}

# - Other pgpool Connection Settings -
{% set other_pgpool_id = namespace(a=0) %}
{% for backend in pgpool_cluster_entries %}
{% if inventory_hostname != backend.ip %}
other_pgpool_hostname{{other_pgpool_id.a}} = '{{backend.ip}}'
other_pgpool_port{{other_pgpool_id.a}} = 9998
other_wd_port{{other_pgpool_id.a}} = 9000

{% set other_pgpool_id.a = other_pgpool_id.a + 1 %}
{% endif %}
{% endfor %}

重启3个节点后，存在节点状态为down

问题原因

pgpool作为postgresql的中间件，当集羣内存在至少两个节点时，就会进行选举，如果此时第三个节点还没起来，当选举完成后，pgpool不会将没有参加选举的节点自动加入集羣，需要手工attach进集羣，或者同时重启pgpool进行重启选举，即pgpool本身不具有重启后能自动加入集羣并恢复的机制。

解决方案

方案1：手动attach

将掉线节点手动重新加入数据库集羣中，例如掉线节点为192.168.1.1并且node id为0，执行下面的attach命令：

pcp_attach_node -n 0 -p 9898 -h 192.168.1.1 -U postgres

方案2：重启pgpool，触发重新选举

分别在三个节点上，停止pgpool服务

systemctl stop pgpool.service

Pgpool每次选举都会读取pgpool状态文件，为了避免影响下次选举，所以需要删除该状态文件

rm -f /var/log/pgpool/pgpool_status

分别在三个节点上，启动pgpool服务

systemctl start pgpool.service

重启3个节点后，有一个节点状态为down

问题原因

NetworkManager未关闭导致。

解决方案

NetworkManager开启会影响pgpool的正常工作，需确保关闭。

重启primary节点，数据库进入只读模式

问题原因

该问题是个小概率偶现问题，即master节点断电后，新的master被选举出，新的master会将本地配置文件修改为master对应的，然后还在成为新master的过程中，这时候通过数据库VIP读取的master信息仍为旧master，这就使得本地数据库failover脚本认为新master出现了不一致，于是将之前postgresql修改为master的一系列配置文件又改回了standby对应的配置文件，其中primary info仍指向为旧master。这就导致没有新的master产生，旧的master一直为down的状态。而没有master节点，数据库则会进入只读模式。

解决方案

修改failover脚本代码逻辑，当本地配置文件与数据库角色状态不一致时，不会第一时间去修改本地recovery文件。之前再加一层判断：如果master节点postgresql服务还能正常访问，再去修改recovery文件。

数据库经常出现短暂卡顿

问题原因

客户端的数据库连接数超过pgpool配置连接数上限。

解决方案

客户端

针对代码异常没有运行到Response的情况，需要添加try-catch，在最终的finally加上返回Response的代码；
针对非web接口，即最后不走Response的情况，需要在程序最后额外添加关闭数据库连接的代码；
针对多线程使用数据库的场景，解决方案就是除了主线程的每次工作线程完成一个任务后，就把它相关的数据库连接关掉。

from django.db import connections
# 每一个线程都有专属的connections，把本线程名下的所有连接关闭。
connections.close_all()

数据库端

Pgpool配置文件中配置客户端连接空闲最大时间为300秒：

client_idle_limit = 300     # Client is disconnected after being idle for that many seconds
                            # (even inside an explicit transactions!)
                            # 0 means no disconnection

该参数表示当一个客户端在执行最后一条查询后如果空闲到了client_idle_limit 秒数，到这个客户端的连接将被断开。这里配置为300秒，防止客户端存在长时间的空闲连接占用连接数。

算法网

Postgresql & Pgpool 场景问题和解决方案

环境

如何查看pgpool节点信息

如何将一个节点踢出集羣

如何将一个节点重新加入集羣

如何查看watchdog信息

如何查看pg node是否在recovery

如何指定某个节点成为新的master节点

Ansible 版本改变导致部署后的数据库脑裂

ansible < v2.4.2

ansible >= v2.4.2

重启3个节点后，存在节点状态为down

问题原因

解决方案

方案1：手动attach

方案2：重启pgpool，触发重新选举

重启3个节点后，有一个节点状态为down

问题原因

解决方案

重启primary节点，数据库进入只读模式

问题原因

解决方案

数据库经常出现短暂卡顿

问题原因

解决方案

客户端

数据库端

Refer