关于Redis的aof持久化的二三事

2024年5月14日 194次阅读来源: 窦锦帅

相信很多小伙伴对redis的持久化保证有疑问：

redis不是内存性应用吗？为什么磁盘拥堵的情况下会影响读写呢？
redis不是支持持久化吗？为什么会丢数据？
redis的持久化数据安全吗？到什么级别？能不能取代数据库？

所以简单整理了这篇文章，对redis的持久化内部原理进行分析，同时可为其它需要持久化的实现提供参考。鉴于OS的差异，本文统一以linux 2.6+为准。

Redis的持久化工作原理

Redis的持久化有RDB和AOF两种，RDB 可以定时备份内存中的数据集。服务器启动的时候，可以从 RDB 文件中回复数据集。AOF 可以记录服务器的所有写操作。在服务器重新启动的时候，会把所有的写操作重新执行一遍，从而实现数据备份。当写操作集过大（比原有的数据集还大），redis会重写写操作集。因为每次RDB都保存全量数据，这是一个开销很大的操作，为了避免进行RDB时fork对主进程影响，以及尽量减少发生故障时丢失的数据量，一般情况大家采用数据持久化策略是AOF。
下面先来看一下AOF数据组织方式假设redis中有foo:helloworld的string类型的key，那么进行AOF持久化后，appendonly.aof文件有如下内容：

  

*2         # 表示这条命令的消息体共2行
$6         # 下一行的数据长度为6
SELECT     # 消息体
$1         # 下一行数据长度为1
0          # 消息体
*3         # 表示这条命令的消息体共2行
$3         # 下一行的数据长度为3
set        # 消息体
$3         # 下一行的数据长度为3
foo        # 消息体
$10        # 下一行的数据长度为10
helloworld # 消息体

通过解析上面内容，能得到熟悉的一条redis命令：SELECT 0; SET foo helloworld 我们可以通过执行命令：BGREWRITEAOF实现一次aof文件的重写，这时redis会将内存中每一个key按照上面格式写入磁盘上appendonly.aof文件; 而当Redis启动载入这个AOF文件时，会创建用于执行AOF文件包含Redis命令的伪客户端，并在载入完成后关闭这个伪客户端。另外，因为AOF持久化是通过记录写命令流水来记录数据变化，这个文件会越来越大，为了解决这个问题，Redis提供了AOF重写功能，通过将当期内存中数据导出创建一个新的AOF文件替换现有AOF文件，这样新文件的体积就会小很多，具体机制这里就不过多展开了。

文件IO相关原理

在进入redis的具体实现之前，我们先梳理一下文件IO相关函数及系统实现。这里涉及的文件IO操作有如下几个：

open

int open(const char path, int oflag, … / mode_t mode */ );
path参数是要打开的文件名，oflag参数指定一个或多个选项，例如：

O_WRONLY 只写打开
O_APPEND 每次写操作之前，将文件偏移量设置在文件的当前结尾处，在一处成功写之后，该文件的偏移量增加时间写的字节数。
O_CREAT 如果文件不存在则先创建它。
write

ssizet write(int fd, const void *buf, sizet nbytes);
write向打开的文件写数据，返回值通常与nbytes值相同，否则表示出错

ftruncate

int ftruncate(int fd, off_t length);
通过ftruncate可以将文件长度截短或是增长，如果length小于原来长度，超过length的数据就不能在访问，如果大于原来长度，文件长度则增加，如果之前文件尾端到length长度之间没有数据则读出的为0,相当于在文件中创建了空洞

在我们向文件中写数据时，传统Unix/Liunx系统内核通常现将数据复制到缓冲区中，然后排入队列，晚些时候再写入磁盘。这种方式被称为延迟写(delayed write)。对磁盘文件的write操作，更新的只是内存中的page cache，因为write调用不会等到硬盘IO完成之后才返回，因此如果OS在write调用之后、硬盘同步之前崩溃，则数据可能丢失。为了保证磁盘上时间文件系统与缓冲区内容的一致性，UNIX系统提供了sync、fsync和fdatasync三个函数:

sync
只是将所有修改过的块缓冲区排查写队列，然后就返回，它并不等待时间写磁盘操作结束。通常称为update的系统守护进程会周期性地（一般每隔30秒）调用sync函数。这就保证了定期flush内核的块缓冲区。
fsync
只对由文件描述符fd指定的单一文件起作用，并且等待写磁盘操作结束才返回。fsync可用于数据库这样的应用程序，这种应用程序需要确保将修改过的块立即写到磁盘上。
fdatasync
类似于fsync，但它只影响文件的数据部分。而除数据外，fsync还会同步更新文件的属性。

  现在来看一下fsync的性能问题，与fdatasync不同，fsync除了同步文件的修改内容（脏页），fsync还会同步文件的描述信息（metadata，包括size、访问时间statime & stmtime等等），因为文件的数据和metadata通常存在硬盘的不同地方，因此fsync至少需要两次IO写操作，这个在fsync的man page有说明：

Applications that access databases or log files often write a tiny data fragment (e.g., one line in a log file) and then call fsync()
immediately in order to ensure that the written data is physically
stored on the harddisk. Unfortunately, fsync() will always initiate
two write operations: one for the newly written data and another one
in order to update the modification time stored in the inode. If the
modification time is not a part of the transaction concept fdatasync()
can be used to avoid unnecessary inode disk write operations.
fdatasync不会同步metadata，因此可以减少一次IO写操作。fdatasync的man page中的解释：
fdatasync() is similar to fsync(), but does not flush modified
metadata unless that metadata is needed in order to allow a subsequent
data retrieval to be correctly handled. For example, changes to
st_atime or st_mtime (respectively, time of last access and time of
last modification; see stat(2)) do not require flushing because they
are not necessary for a subsequent data read to be handled correctly.
On the other hand, a change to the file size (st_size, as made by say
ftruncate(2)), would require a metadata flush. The aim of fdatasync()
is to reduce disk activity for applications that do not require all
metadata to be synchronized with the disk.

具体来说，如果文件的尺寸（st_size）发生变化，是需要立即同步，否则OS一旦崩溃，即使文件的数据部分已同步，由于metadata没有同步，依然读不到修改的内容。而最后访问时间(atime)/修改时间(mtime)是不需要每次都同步的，只要应用程序对这两个时间戳没有苛刻的要求，基本没有影响。在Redis的源文件src/config.h中可以看到在Redis针对Linux实际使用了fdatasync()来进行刷盘操作

源文件:src/config.h

 91 #ifdef __linux__
 92 #define aof_fsync fdatasync
 93 #else
 94 #define aof_fsync fsync
 95 #endif

Redis的AOF刷盘工作原理

Redis是通过apendfsync参数来设置不同刷盘策略，apendfsync主要有下面三个选项：

always
每次有新命令追加到AOF文件是就执行一次同步到AOF文件的操作,安全性最高，但是性能影响最大。
everysec
每秒执行一次同步到AOF文件的操作，redis会在一个单独线程中执行同步操作。
no
将数据同步操作交给操作系统来处理，性能最好，但是数据可靠性最差。加入在配置文件设置appendonly=yes后，没有指定apendfsync，默认会使用everysec选项，一般都是采用的这个选项。
下面我们来具体分析一下Redis代码中关于AOF刷盘操作的工作原理：

在appendonly yes激活AOF时，会调用startAppendOnly()函数来打开appendonly.aof文件句柄。

241 server.aof_fd = open(server.aof_filename,O_WRONLY|O_APPEND|O_CREAT,0644);
同时在Redis启动时也会创建专门的bio线程处理aof持久化，在src/server.c文件的initServer()中会调用bioInit()函数创建两个线程，分别用来处理刷盘和关闭文件的任务。代码如下:

源文件:src/bio.h

38 /* Background job opcodes */
39 #define BIO_CLOSE_FILE    0 /* Deferred close(2) syscall. */
40 #define BIO_AOF_FSYNC     1 /* Deferred AOF fsync. */
41 #define BIO_NUM_OPS       2

源文件: src/bio.c

116     for (j = 0; j < BIO_NUM_OPS; j++) {
117         void *arg = (void*)(unsigned long) j;
118         if (pthread_create(&thread,&attr,bioProcessBackgroundJobs,arg) != 0) {
119             serverLog(LL_WARNING,"Fatal: Can't initialize Background Jobs.");
120             exit(1);
121         }
122         bio_threads[j] = thread;
123     }
  当redis服务器执行写命令时，例如SET foo helloworld，不仅仅会修改内存数据集，也会记录此操作，记录的方式就是前面所说的数据组织方式。redis将一些内容被追加到server.aofbuf缓冲区中，可以把它理解为一个小型临时中转站，所有累积的更新缓存都会先放入这里，它会在特定时机写入文件或者插入到server.aofrewritebufblocks，同时每次写操作后先写入缓存，然后定期fsync到磁盘，在到达某些时机(主要是受auto-aof-rewrite-percentage/auto-aof-rewrite-min-size这两个参数影响)后，还会fork子进程执行rewrite。为了避免在服务器突然崩溃时丢失过多的数据，在redis会在下列几个特定时机调用flushAppendOnlyFile函数进行写盘操作：

进入事件循环之前
服务器定时函数serverCron()中，在Redis运行期间主要是在这里调用flushAppendOnlyFile
停止AOF策略的stopAppendOnly()函数中
注：因 serverCron 函数中的所有代码每秒都会调用 server.hz 次，为了对部分代码的调用次数进行限制，Redis使用了一个宏 runwithperiod(milliseconds) { … } ，这个宏可以将被包含代码的执行次数降低为每 milliseconds 执行一次。

源文件: src/server.c

 1099 int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
 1260     /* AOF postponed flush: Try at every cron cycle if the slow fsync
 1261      * completed. */
 1262     if (server.aof_flush_postponed_start) flushAppendOnlyFile(0);
 1263
 1264     /* AOF write errors: in this case we have a buffer to flush as well and
 1265      * clear the AOF error in case of success to make the DB writable again,
 1266      * however to try every second is enough in case of 'hz' is set to
 1267      * an higher frequency. */
 1268     run_with_period(1000) {
 1269         if (server.aof_last_write_status == C_ERR)
 1270             flushAppendOnlyFile(0);
 1271     }
 1316 }
  通过下面的代码可以看到flushAppendOnlyFile函数中，在write写盘之后根据apendfsync选项来执行刷盘策略，如果是AOFFSYNCALWAYS，就立即执行刷盘操作，如果是AOFFSYNCEVERYSEC，则创建一个后台异步刷盘任务。 在函数bioCreateBackgroundJob()会创建bio后台任务，在函数bioProcessBackgroundJobs()会执行bio后台任务的处理。

源文件：src/aof.c

 200 // 调用bio的创建异步线程任务函数，添加后台刷盘任务
 201 void aof_background_fsync(int fd) {
 202     bioCreateBackgroundJob(BIO_AOF_FSYNC,(void*)(long)fd,NULL,NULL);
 203 }
 
  238 int startAppendOnly(void) {
 239     char cwd[MAXPATHLEN];
 240     // 通过appendonly yes激活AOF时，会调用startAppendOnly()函数来打开appendonly.aof文件句柄。
 241     server.aof_last_fsync = server.unixtime;
 242     server.aof_fd = open(server.aof_filename,O_WRONLY|O_APPEND|O_CREAT,0644);
 243     serverAssert(server.aof_state == AOF_OFF);
 244     if (server.aof_fd == -1) {
 245         char *cwdp = getcwd(cwd,MAXPATHLEN);
 246
 247         serverLog(LL_WARNING,
 248             "Redis needs to enable the AOF but can't open the "
 249             "append only file %s (in server root dir %s): %s",
 250             server.aof_filename,
 251             cwdp ? cwdp : "unknown",
 252             strerror(errno));
 253         return C_ERR;
 254     }
 255     if (server.rdb_child_pid != -1) {
 256         server.aof_rewrite_scheduled = 1;
 257         serverLog(LL_WARNING,"AOF was enabled but there is already a child process saving an RDB file on disk. An AOF background was scheduled to start when possible.");
 258     } else if (rewriteAppendOnlyFileBackground() == C_ERR) {
 259         close(server.aof_fd);
 260         serverLog(LL_WARNING,"Redis needs to enable the AOF but can't trigger a background AOF rewrite operation. Check the above logs for more info about the error.");
 261         return C_ERR;
 262     }
 263     /* We correctly switched on AOF, now wait for the rewrite to be complete
 264      * in order to append data on disk. */
 265     server.aof_state = AOF_WAIT_REWRITE;
 266     return C_OK;
 267 }
 
     // 执行write和fsync操作
 288 void flushAppendOnlyFile(int force) {
 289     ssize_t nwritten;
 290     int sync_in_progress = 0;
 291     mstime_t latency;
 292     // 没有数据，无需写盘
 293     if (sdslen(server.aof_buf) == 0) return;
 294     /* 通过bio的任务计数器bio_pending来判断是否有后台fsync操作正在进行
          * 如果有就要标记下sync_in_progress
          */
 295     if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
 296         sync_in_progress = bioPendingJobsOfType(BIO_AOF_FSYNC) != 0;
 297     /* 如果没有设置强制刷盘的选项，可能不会立即进行,而是延迟执行AOF刷盘
          * 因为 Linux 上的 write(2) 会被后台的 fsync 阻塞, 如果强制执行 
          * write 的话，服务器主线程将阻塞在 write 上面
          */         
 298     if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
 302         if (sync_in_progress) {
 303             if (server.aof_flush_postponed_start == 0) {
 306                 server.aof_flush_postponed_start = server.unixtime;
 307                 return;
                 // 如果距离上次执行刷盘操作没有超过2秒，直接返回，
 308             } else if (server.unixtime - server.aof_flush_postponed_start < 2) {
 311                 return;
 312             }
                 /* 如果后台还有 fsync 在执行，并且 write 已经推迟 >= 2 秒
                  * 那么执行写操作（write 将被阻塞）
                  * 假如此时出现死机等故障，可能存在丢失2秒左右的AOF日志数据
                  */              
 315             server.aof_delayed_fsync++;
 316             serverLog(LL_NOTICE,"Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down      Redis.");
 317         }
 318     }
 324     // 将server.aof_buf中缓存的AOF日志数据进行写盘
 325     latencyStartMonitor(latency);
 326     nwritten = write(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
 327     latencyEndMonitor(latency);
         // 重置延迟刷盘时间
 343     server.aof_flush_postponed_start = 0;
 344     // 如果write失败，那么尝试将该情况写入到日志里面
 345     if (nwritten != (signed)sdslen(server.aof_buf)) {
 346         static time_t last_write_error_log = 0;
 347         int can_log = 0;
 348
 350         if ((server.unixtime - last_write_error_log) > AOF_WRITE_LOG_ERROR_RATE) {
 351             can_log = 1;
 352             last_write_error_log = server.unixtime;
 353         }
 354
 356         if (nwritten == -1) {
 357             if (can_log) {
 358                 serverLog(LL_WARNING,"Error writing to the AOF file: %s",
 359                     strerror(errno));
 360                 server.aof_last_write_errno = errno;
 361             }
 362         } else {
 363             if (can_log) {
 364                 serverLog(LL_WARNING,"Short write while writing to "
 365                                        "the AOF file: (nwritten=%lld, "
 366                                        "expected=%lld)",
 367                                        (long long)nwritten,
 368                                        (long long)sdslen(server.aof_buf));
 369             }
 370             // 通过ftruncate尝试删除新追加到AOF中的不完整的数据内容
 371             if (ftruncate(server.aof_fd, server.aof_current_size) == -1) {
 372                 if (can_log) {
 373                     serverLog(LL_WARNING, "Could not remove short write "
 374                              "from the append-only file.  Redis may refuse "
 375                              "to load the AOF the next time it starts.  "
 376                              "ftruncate: %s", strerror(errno));
 377                 }
 378             } else {
 381                 nwritten = -1;
 382             }
 383             server.aof_last_write_errno = ENOSPC;
 384         }
             // 处理写入AOF文件是出现的错误
 387         if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
 392             serverLog(LL_WARNING,"Can't recover from AOF write error when the AOF fsync policy is 'always'. Exiting...");
 393             exit(1);
 394         } else {
 398             server.aof_last_write_status = C_ERR;
                 // 如果是已经写入了部分数据，是不能通过ftruncate进行撤销的
                 // 这里通过sdsrange清除掉aof_buf中已经写入磁盘的那部分数据
 402             if (nwritten > 0) {
 403                 server.aof_current_size += nwritten;
 404                 sdsrange(server.aof_buf,nwritten,-1);
 405             }
 406             return; 
 407         }
 408     } else {
 411         if (server.aof_last_write_status == C_ERR) {
 412             serverLog(LL_WARNING,
 413                 "AOF write error looks solved, Redis can write again.");
 414             server.aof_last_write_status = C_OK;
 415         }
 416     }
         // 更新写入后的 AOF 文件大小
 417     server.aof_current_size += nwritten;
 418
 419      /* 当 server.aof_buf 足够小,重新利用空间，防止频繁的内存分配。
           * 相反，当 server.aof_buf 占据大量的空间，采取的策略是释放空间。
           */
 420      
 421     if ((sdslen(server.aof_buf)+sdsavail(server.aof_buf)) < 4000) {
 422         sdsclear(server.aof_buf);
 423     } else {
 424         sdsfree(server.aof_buf);
 425         server.aof_buf = sdsempty();
 426     }
 427
 428     /* 如果 no-appendfsync-on-rewrite 选项激活状态
 429      * 并有BGSAVE或BGREWRITEAOF正在进行，那么不执行fsync
          */
 430     if (server.aof_no_fsync_on_rewrite &&
 431         (server.aof_child_pid != -1 || server.rdb_child_pid != -1))
 432             return;
 433
 434     // 执行 fysnc
 435     if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
 436         /* aof_fsync is defined as fdatasync() for Linux in order to avoid
 437          * flushing metadata. */
 438         latencyStartMonitor(latency);
 439         aof_fsync(server.aof_fd); /* Let's try to get this data on the disk */
 440         latencyEndMonitor(latency);
 441         latencyAddSampleIfNeeded("aof-fsync-always",latency);
 442         server.aof_last_fsync = server.unixtime;
 443     } else if ((server.aof_fsync == AOF_FSYNC_EVERYSEC &&
 444                 server.unixtime > server.aof_last_fsync)) {
 445         if (!sync_in_progress) aof_background_fsync(server.aof_fd);
 446         server.aof_last_fsync = server.unixtime;
 447     }
 448 }
 449

最后我们重新回顾一下关于aof的写盘操作：

主线程操作完内存数据后，会执行write，之后根据配置决定是立即还是延迟fdatasync
redis在启动时，会创建专门的bio线程用于处理aof持久化
如果是apendfsync=everysec，时机到达后，会创建异步任务(bio)
bio线程轮询任务池，拿到任务后同步执行fdatasync

结论：

关于数据可靠性：

如果是always每次写命令后都是刷盘，故障时丢失数据最少，如果是everysec，会丢失大概2秒的数据，在bio延迟刷盘时如果后台刷盘操作卡住，在ServerCron里面每一轮循环(频率取决于hz参数，我们设置为100，也就是一秒执行100次循环)都检查是否上一次后台刷盘操作是否超过2秒，如果超过立即进行一次强制刷盘，因此可以粗略的认为最大可能丢失2.01秒的数据。
如果在进行bgrewriteaof期间出现故障，因rewrite会阻塞fdatasync刷盘，可能丢失的数据量更大，这个就不太容易量化评估了。

关于aof对延迟的影响

关于AOF对访问延迟的影响，Redis作者曾经专门写过一篇博客 fsync() on a different thread: apparently a useless trick，结论是bio对延迟的改善并不是很大,因为虽然apendfsync=everysec时fdatasync在后台运行，wirte的aof_buf并不大,基本上不会导致阻塞，而是后台的fdatasync会导致write等待datasync完成了之后才调用write导致阻塞，fdataysnc会握住文件句柄，fwrite也会用到文件句柄,这里write会导致了主线程阻塞。这也就是为什么之前浪潮服务器的RAID出现性能问题时，虽然对大部分应用没有影响，但是对于Redis这种对延迟非常敏感的应用却造成了影响的原因。

是否可以关闭AOF？

既然开启AOF会造成访问延迟，那么是可以关闭呢，答案是肯定的，对应纯缓存场景，例如数据Missed后会自动访问数据库，或是可以快速从数据库重建的场景，完全可以关闭，从而获取最优的性能。其实即使关闭了AOF也不意味着当一个分片实例Crash时会丢掉这个分片的数据，我们实际生产环境中每个分片都是会有主备(Master/Slave)两个实例，通过Redis的Replication机制保持同步，当主实例Crash时会自动进行主从切换，将备实例切换为主，从而保证了数据可靠性，为了避免主备同时Crash，实际生产环境都是将主从分布在不同物理机和不同交换机下。

Redis的持久化是否具备数据库能力

目前还不能代替数据库，更不具备关系型数据库的功能，如果是对数据可靠性要求高的业务需要慎重，建议考虑使用基于RocksDB的解决方案

    原文作者：窦锦帅
    原文地址: https://segmentfault.com/a/1190000016096933
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。