PostgreSQL分区方案测试（数据插入与分区表数量关系）

2023年7月30日 304次阅读来源: 遥想公瑾当年

一前言

单表数据量太大时，数据库通过建立分区表的方式，从而达到对大表进行拆分，使每个分区的子表数据规模适中，方便数据管理与常规查询分析等，分区表是数据库比较常用的数据管理需求。
当前PostgreSQL 11版本及其社区目前分区方案整理如下：

Pg11自带分区表，目前分区类型有list,range,hash三种类型。
社区分区插件pg_pathman，目前分区类型支持range,hash类型。
citus为代表的分布式数据节点分区方案，目前支持hash（更常用更通用）与append类型（不是很了解，不怎么常用，使用需谨慎）。
这些方案都能在不同场景下支持不同的分区方案，本文选择一个方向对这些分区方案进行测试，即：某个表的分区表数量逐渐增多与数据插入性能衰减程度关系。

二测试过程

2.1 基准测试

对某个表不做任何分区，使用pgbench压测，测试插入tps。
建立表：

drop table if exists shiptrack cascade;
create table shiptrack(
    ship_id int not null, --船舶id
    pos_time timestamp without time zone not null, --坐标时间
    lon numeric not null,  --经度
    lat numeric not null --纬度
);
create index shiptrack_id_idx on shiptrack using btree(ship_id);
create index shiptrack_time_idx on shiptrack using btree(pos_time);

测试脚本 test.sql：

\set _ship_id random(1,700000)
\set x random(-180,180)
\set y random(-90,90)
INSERT INTO shiptrack(ship_id,pos_time,lon ,lat) values(:_ship_id,clock_timestamp(),:x, :y);

pg_bench压测：

[postgres@~]$ pgbench -M prepared -n -r -P 1 -f test.sql -c 80 -j 80 -T 30 -U postgres mytest
transaction type: test.sql
scaling factor: 1
query mode: prepared
number of clients: 80
number of threads: 80
duration: 30 s
number of transactions actually processed: 1102316
latency average = 2.179 ms
latency stddev = 2.590 ms
tps = 36616.746279 (including connections establishing)
tps = 36662.855209 (excluding connections establishing)

tps:36662

2.2 pg自带分区表

2.2.1 list分区

建立表：

drop table if exists shiptrack cascade;
create table shiptrack(
    ship_id int not null, --船舶id
    pos_time timestamp without time zone not null, --坐标时间
    lon numeric not null,  --经度
    lat numeric not null --纬度
) partition by list(ship_id );
create index shiptrack_time_idx on shiptrack using btree(pos_time);

创建分区表，分区数量4：

do language plpgsql $$  
 DECLARE
    i int;
    sql text;
 BEGIN
    for i in 1..4 loop
        sql:=format('create table if not exists shiptrack_%s PARTITION OF shiptrack for values in (%s)',i,i);
        execute sql;
        raise notice '%',sql;
    end loop;
end;  
$$;

测试脚本 test.sql：

\set _ship_id random(1,4)
\set x random(-180,180)
\set y random(-90,90)
INSERT INTO shiptrack(ship_id,pos_time,lon ,lat) values(:_ship_id,clock_timestamp(),:x, :y);

list分区数量4，tps：34492
同理：
list分区数量40，tps：21473。
list分区数量400，tps: 3220。

2.2.2 hash分区

建立表：

drop table if exists shiptrack cascade;
create table shiptrack(
    ship_id int not null, --船舶id
    pos_time timestamp without time zone not null, --坐标时间
    lon numeric not null,  --经度
    lat numeric not null --纬度
)  partition by hash(ship_id);
create index shiptrack_id_idx on shiptrack using btree(ship_id);
create index shiptrack_time_idx on shiptrack using btree(pos_time);

创建分区表，分区数量4：

do language plpgsql $$  
 DECLARE
    i int;
    sql text;
        tableCount int;
 BEGIN
        tableCount :=4;
    for i in 1..tableCount loop
        sql:=format('create table if not exists shiptrack_%s PARTITION OF shiptrack for VALUES WITH (MODULUS %s, REMAINDER %s)',i,tableCount,i-1);
        execute sql;
        raise notice '%',sql;
    end loop;
end;  
$$;

测试脚本 test.sql：

\set _ship_id random(1,700000)
\set x random(-180,180)
\set y random(-90,90)
INSERT INTO shiptrack(ship_id,pos_time,lon ,lat) values(:_ship_id,clock_timestamp(),:x, :y);

hash分区数量4，tps：32442
同理：
hash分区数量40，tps：20611。
hash分区数量400，tps: 3167。

2.2 pg_pathman测试（仅hash分区测试）

建立表：

drop table if exists shiptrack cascade;
create table shiptrack(
    ship_id int not null, --船舶id
    pos_time timestamp without time zone not null, --坐标时间
    lon numeric not null,  --经度
    lat numeric not null --纬度
) ;
create index shiptrack_id_idx on shiptrack using btree(ship_id);
create index shiptrack_time_idx on shiptrack using btree(pos_time);

创建分区表，分区数量4：

select create_hash_partitions('shiptrack'::regclass,'ship_id',4,false);

测试脚本 test.sql：

\set _ship_id random(1,700000)
\set x random(-180,180)
\set y random(-90,90)
INSERT INTO shiptrack(ship_id,pos_time,lon ,lat) values(:_ship_id,clock_timestamp(),:x, :y);

pg_pathman分区数量4，tps：29948
同理：
pg_pathman分区数量40，tps：28267。
pg_pathman分区数量400，tps: 23283。
pg_pathman分区数量1000，tps: 17922。

2.3 citus分布式表分区

2.3.1 citus–逐条插入

注意：citus分布式，各个环境与单机环境配置有差异，不太好同上单独比较。
建立表：

drop table if exists shiptrack cascade;
create table shiptrack(
    ship_id int not null, --船舶id
    pos_time timestamp without time zone not null, --坐标时间
    lon numeric not null,  --经度
    lat numeric not null --纬度
) ;
create index shiptrack_id_idx on shiptrack using btree(ship_id);
create index shiptrack_time_idx on shiptrack using btree(pos_time);

创建分区表，分区数量4：

--设置分区数量
set citus.shard_count=4;
--设置副本数量
set citus.shard_replication_factor=2;
--对表进行分区
select create_distributed_table('shiptrack','ship_id','hash');

测试脚本

\set _ship_id random(1,700000)
\set x random(-180,180)
\set y random(-90,90)
INSERT INTO shiptrack(ship_id,pos_time,lon ,lat) values(:_ship_id,clock_timestamp(),:x, :y);

citus分区数量4，tps：25342
同理：
citus分区数量40，tps：26112。
citus分区数量400，tps: 26817。
citus分区数量1000，tps: 26524。
逐条sql插入性能比较稳定。

2.3.2 citus–批量插入

建表和分区同上，只需更新下test.sql脚本为批量操作，单批次数量1000：

\set _ship_id random(1,700000)
\set x random(-180,180)
\set y random(-90,90)
INSERT INTO shiptrack(ship_id,pos_time,lon ,lat) select 
:_ship_id,clock_timestamp(),:x, :y from generate_series(1,1000);

创建表:

drop table if exists shiptrack cascade;
create table shiptrack(
    ship_id int not null, --船舶id
    pos_time timestamp without time zone not null, --坐标时间
    lon numeric not null,  --经度
    lat numeric not null --纬度
) ;
create index shiptrack_id_idx on shiptrack using btree(ship_id);
create index shiptrack_time_idx on shiptrack using btree(pos_time);

--设置分区数量
set citus.shard_count=4;
--执行分区
select create_distributed_table('shiptrack','ship_id','hash');

执行压测语句：

[postgres@~]$ pgbench -M prepared -n -r -P 1 -f test.sql -c 80 -j 80 -T 30 -U postgres mytest

通过设置set citus.shard_count为4,40,400,450,550,700，1000，重试上述步骤，得到tps依次是:
citus分区数量4，tps：1563
同理：
citus分区数量40，tps：1549。
citus分区数量400，tps: 1583。
citus分区数量450，tps: 1115。
citus分区数量550，tps: 1063。
citus分区数量700，tps: 895。
citus分区数量1000，tps: 706。

三测试现象与总结

单表不分区：tps：36662
数据插入与分区表数量关系测试汇总如下表：

测试方式	分区数 4	分区数 40	分区数 400	分区数 1000
pg自带分区 list	34492	21473	3220	未测试
pg自带分区 hash	32442	20611	3167	未测试
pg_pathman hash	29948	28267	23283	17922
citus hash（单条插入）	25342	26112	26817	26524
citus hash（批量插入）	1563	1549	1583	706

现象：

随着分区表数量增加，除了citus都有性能衰减现象。
pg_pathman衰减比较平滑，pg自带的分区表指数级别衰减。
pg自带的hash分区，性能稍微弱于自带的list分区。
citus单条插入性能与分区数量递增关系几乎没有影响。
citus以1000数据量为一批次，在分区数为400之前稳定，从400之后开始平滑衰减。

个人总结：

pg11自带的分区，适用于分区数量不大的业务场景。
在单机pg分区中，如果分区数量比较大，建议采用pg_pathman，可以获取更好的性能。
在具备分布式数据节点的前提下，使用citus的分区在插入和查询都能有较好的表现。

    原文作者：遥想公瑾当年
    原文地址: https://www.jianshu.com/p/1cba77d18694
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。

一 前言

二 测试过程

2.1 基准测试

2.2 pg自带分区表

2.2.1 list分区

2.2.2 hash分区

2.2 pg_pathman测试（仅hash分区测试）

2.3 citus分布式表分区

2.3.1 citus–逐条插入

2.3.2 citus–批量插入

三 测试现象与总结

一前言

二测试过程

三测试现象与总结