发现问题

大部分开发人员习惯使用order by offset limit进行分页,使用该方法可能会导致扫描的数据放大,因为offset的行会被扫描。表现就是一般offset的行比较小的情况也,也就是翻页,是很快的,但是一旦offset的值很大,翻页的数量很大,那么一定会变慢。

看个例子:

#建表,插入数据,并分析收取统计信息
CREATE UNLOGGED TABLE data (
   id bigint GENERATED ALWAYS AS IDENTITY,
   value double precision NOT NULL,
   created timestamp with time zone NOT NULL
);
 
SELECT setseed(0.2740184);
 
INSERT INTO data (value, created)
SELECT random() * 1000, d
FROM generate_series(
        TIMESTAMP '2022-01-01 00:00:00 UTC',
        TIMESTAMP '2022-12-31 00:00:00 UTC',
        INTERVAL '1 second'
     ) AS d(d);
 

ALTER TABLE data ADD PRIMARY KEY (id);
 
VACUUM (ANALYZE) data;

下面我们翻页获取数据看看性能如何:

#如下SQL,创建一个符合索引最合适
SELECT value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created;

#创建索引
CREATE INDEX data_created_value_idx ON data
   (created, value);


#可以看到是Index Only Scan,结果很快
postgres=# explain analyze SELECT value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created
LIMIT 50;
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..126.51 rows=50 width=16) (actual time=0.022..0.195 rows=50 loops=1)
   ->  Index Only Scan using data_created_value_idx on data  (cost=0.56..801382.02 rows=318146 width=16) (actual time=0.021..0.190 rows=50 loops=1)
         Index Cond: ((value >= '0'::double precision) AND (value <= '10'::double precision))
         Heap Fetches: 0
 Planning Time: 0.084 ms
 Execution Time: 0.210 ms

#但是当我们OFFSET 200000的时候,执行计划就变了,代价比较高了,这就是我们前面所说的offset都要扫描
postgres=# explain analyze SELECT value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created
OFFSET 200000 LIMIT 50;
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=434757.47..434763.31 rows=50 width=16) (actual time=2697.793..2704.289 rows=50 loops=1)
   ->  Gather Merge  (cost=411422.51..442355.57 rows=265122 width=16) (actual time=2627.028..2695.579 rows=200050 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=410422.49..410753.89 rows=132561 width=16) (actual time=2607.976..2613.950 rows=67369 loops=3)
               Sort Key: created
               Sort Method: external merge  Disk: 2760kB
               Worker 0:  Sort Method: external merge  Disk: 2640kB
               Worker 1:  Sort Method: external merge  Disk: 2640kB
               ->  Parallel Seq Scan on data  (cost=0.00..396876.00 rows=132561 width=16) (actual time=0.042..2551.663 rows=104958 loops=3)
                     Filter: ((value >= '0'::double precision) AND (value <= '10'::double precision))
                     Rows Removed by Filter: 10378242
 Planning Time: 0.102 ms
 Execution Time: 2704.851 ms
(14 rows)

如上我们看到的,翻页越多,性能越差,唯一的好处,就是书写简单。

优化手段1:使用游标

由于普通游标只能在单个事务的上下文中工作。因此,普通游标对于分页的作用有限,因为在事务打开时进行用户交互是一个非常差的体验:长事务不仅会使表锁保持很长时间(这可能阻塞DDL或TRUNCATE语句),而且还会阻塞autovacuum的进程,从而导致表膨胀。

WITH HOLD | WITHOUT HOLD 默认值为WITHOUT HOLD, 使用WITH HOLD将CURSOR的使用范围扩大到SESSION级别,WITHOUT HOLD是TRANSACTION级别,另外,WITH HOLD将消耗更多的资源(内存或临时文件)来保持数据。

游标说明可以参考:https://www.postgresql.org/docs/13/sql-declare.html

#创建一个游标,取出满足的条件的结果集
begin;
DECLARE c SCROLL CURSOR WITH HOLD FOR
SELECT value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created;
COMMIT;

#移动游标并取出50行,和OFFSET 200000 LIMIT 50效果一样
MOVE ABSOLUTE 200000 IN c;
FETCH 50 FROM c;

#使用完游标后,必须记得关闭游标
close c;

使用游标的优点和缺点:

优点:

1.游标适用于所有分页查询,不管是第一页还是最后一页,效率一样

2.游标的结果集是稳定的

缺点:

1.当完成操作时,一定不要忘记关闭游标,否则结果集将保存在服务器上,直到数据库会话结束

2.如果游标长时间打开,数据将变的陈旧,无法获取动态的最新数据

3.游标长时间打开,相当于一个长事物,长事物的负面影响,相信大家有一定共识

优化手段2:使用位点

位点的原理很简单,就是记录上次查询出来的结果作为一个位点,查询的时候基于这个点的条件去查询。这样也就可以去掉offset了。注意要有一个pk,没有的话,需要加一个类似字段,这样位点才不会重复。

例子如下:

#通过查询记住下一页的起始位点,例子中id作为pk,标识唯一
SELECT id, value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created, id
LIMIT 50;
  id  |        value        |        created         
------+---------------------+------------------------
.........略.........
 4568 |   7.771510504657186 | 2022-01-01 01:16:07+08
 4586 |  1.2500308700502671 | 2022-01-01 01:16:25+08
 4607 |  3.3096537558421346 | 2022-01-01 01:16:46+08
 
#我们必须记住从页面的最后一行创建的id的值。然后我们可以取下一页
SELECT id, value, created
FROM data
WHERE value BETWEEN 0 AND 10
  AND (created, id) > ('2022-01-01 01:16:46+08', 4607)
ORDER BY created, id
LIMIT 50;

postgres=# explain SELECT id, value, created
FROM data
WHERE value BETWEEN 0 AND 10
AND (created, id) > ('2022-01-01 01:16:46+08', 4607)
ORDER BY created, id
LIMIT 50;
                                                                               QUERY PLAN                                                                     
          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
----------
 Limit  (cost=4.32..194.08 rows=50 width=24)
   ->  Incremental Sort  (cost=4.32..1207236.72 rows=318103 width=24)
         Sort Key: created, id
         Presorted Key: created
         ->  Index Scan using data_created_value_idx on data  (cost=0.56..1192922.08 rows=318103 width=24)
               Index Cond: ((created >= '2022-01-01 01:16:46+08'::timestamp with time zone) AND (value >= '0'::double precision) AND (value <= '10'::double pr
ecision))
               Filter: (ROW(created, id) > ROW('2022-01-01 01:16:46+08'::timestamp with time zone, 4607))
(7 rows)

#加一个更适合的索引,执行计划会更好一些
CREATE INDEX data_keyset_idx ON data (created, id, value);
postgres=#   explain SELECT id, value, created
FROM data
WHERE value BETWEEN 0 AND 10
  AND (created, id) > ('2022-01-01 01:16:46+08', 4607)
ORDER BY created, id
LIMIT 50;
--------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------
 Limit  (cost=0.56..160.58 rows=50 width=24)
   ->  Index Only Scan using data_keyset_idx on data  (cost=0.56..1018064.43 rows=318103 width=24)
         Index Cond: ((ROW(created, id) > ROW('2022-01-01 01:16:46+08'::timestamp with time zone, 4607)) AND (value >= '0'::double precision) AND (value <= '1
0'::double precision))
(3 rows)

位点优化的有点和缺点:

优点:

1.每次查询只获取我们需要的数据,不需要扫描不额外的数据,减少了相关资源代价

2.每个查询将展示最新并发数据修改的当前数据

缺点:

1.需要一个专门为查询而设计的特殊索引

2.只有事先能获取到确切的位点,查询时才有用

参考:
https://baitexiaoyuan.oss-cn-zhangjiakou.aliyuncs.com/updatecrm/shujuku/

总结

到此这篇关于PostgreSQL中offset...limit分页优化的一些常见手段的文章就介绍到这了,更多相关PostgreSQL offset...limit分页优化内容请搜索萤火虫技术以前的文章或继续浏览下面的相关文章希望大家以后多多支持萤火虫技术!

点赞(304) 打赏

评论列表 共有 0 条评论

暂无评论

微信小程序

微信扫一扫体验

立即
投稿

微信公众账号

微信扫一扫加关注

发表
评论
返回
顶部