如题。

Level Compaction

简介

Compaction 的条件

当 L0 层的数量达到 level0_file_num_compaction_trigger 后，触发从 L0 到 L1 的 Compaction。
这个值一般是 4，设的较大对写友好，但是读会需要在 L0 扫多个 pass，从而降低读的性能。
这可能导致 L1 的大小超出限制，此时会选出至少 1 个 L1 层的 SST，和 L2 层合并。

注意，WAL 切换时不会直接触发 compaction，但是 WAL 切换会导致 MemTable 刷新，并生成新的 SST 文件，这可能间接影响 compaction 的触发条件。

Parallel Compaction

L1 层往下的 Compaction 是可以并行的
L0 -> L1 的 Compaction 默认不是并行的，但是有一个 subcompaction-based parallelization 特性。这个时候，一个文件可能会被按照 range 切分，从而和 L1 层的多个文件一同 compact。

Pick Compaction

当多个 Level 都满足 Compaction 的条件，则需要计算一个 score，触发最大的 score 对应的那一层：

对于非 L0，这个得分是当前 size 除以 target size。如果这一层某些文件正在被 compact，那么它们不会被计算在当前 size 内。
对于 L0，得分是两个的较大者
- 当前 file 的数量，除以 level0_file_num_compaction_trigger
- 当前 file 的总大小，除以 max_bytes_for_level_base

Compaction 的条件

L0 文件数量超过限制
level0_file_num_compaction_trigger
层级总大小超过限制
max_bytes_for_level_base
待压缩数据量超出限制
soft/hard_pending_compaction_bytes_limit
单个文件大小超过限制
target_file_size_base
层级间文件重叠
L0 compaction
手动触发
level_compaction_dynamic_level_bytes
冷数据

为什么 RocksDB 没有 seek compaction？

首先，RocksDB 有一个 patch，如果一个文件已经被 cache 了，那么就不应该被计算 seek compaction 的惩罚。

https://github.com/facebook/rocksdb/commit/c1bb32e1ba9da94d9e40af692f60c2c0420685cd

In the current code, a Get() call can trigger compaction if it has to look at more than one file. This causes unnecessary compaction because looking at more than one file is a penalty only if the file is not yet in the cache. Also, th current code counts these files before the bloom filter check is applied.
This patch counts a ‘seek’ only if the file fails the bloom filter check and has to read in data block(s) from the storage.

然后发现，改进之后 seek compaction 就较少被触发了，于是为了减少代码复杂度，就被移除了。

Choose Level Compaction Files

介绍 Level Compaction 是如何选择 Compact 哪些文件的。

level_compaction_dynamic_level_bytes

level_compaction_dynamic_level_bytes 允许 RocksDB 在运行时动态调整每个 Level 的大小，而不是像现在这样使用 10 倍的关系。

如果 max_bytes_for_level_base 为 false，那么 L1 的大小是 max_bytes_for_level_base，后面每一层都是之前的 max_bytes_for_level_multiplier * max_bytes_for_level_multiplier_additional[n] 倍。

如果 level_compaction_dynamic_level_bytes 为 true，那么每一层的大小是动态调整的。此时，最下面一层的大小是它的实际大小，然后第 n-1 层的大小是第 n 层的大小除以 max_bytes_for_level_multiplier。如果一层的大小小于 max_bytes_for_level_base / max_bytes_for_level_multiplier，那么我们就不会启用这一层。因此，整个 LSM 结构好像是从最下面一层往上构建的，也就是说 base_level 默认从 1 变成 6，然后逐级向下调整。

可以简单推演下，当 L6 达到一定阈值后，base_level 会下降到 L5，然后 L0 会直接被 compact 到 L5。当 L5 达到阈值之后，会被 Compact 到 L6，此时 L6 的大小变大，从而推动 L5 的阈值也变大。如此渐进达成收敛，最终 L5 的阈值增大到一定程度后，会产生 L4 来。

在开启这个选项之后，compaction score 的逻辑也要进行调整。在计算 Ln 的 store 时，现在得 Ln size / (Ln target size + total_downcompact_bytes)。相比之前，加上了一个 total_downcompact_bytes 项。这个项是从 L0 到 Ln-1 一直 compact 到 Ln 所预计需要的总的字节数。如果写入负载更大，那么 compaction debt 更大。这样，更高层的 total_downcompact_bytes 会更大，那么较低的层会被优先 compact。【Q】这段逻辑比较复杂，可能后续需要看看源码。

Intra-L0 Compaction

有点类似于 TiFlash 中的 delta compaction。

FIFO Compaction

实际上是一个很简单的策略。它实际上是定期删除老数据，所以适合时序数据。注意，这种情况下，数据可能被删除。

在 FIFO Compaction 中，所有的文件都在 level 0。当总大小超过 CompactionOptionsFIFO::max_table_files_size) 之后，就删除最老的 SST。因此写放大是 1，当然，其实还要考虑 WAL 的写放大。

因为都在 Level 0，所以 FIFO 下 level 0 可能有很多 sst，从而让读取速度变得很慢。这种情况下，建议使用更多的 Bloom bits 从而减少 Bloom filter 的假阳性问题。

通过设定 CompactionOptionsFIFO.allow_compaction = true 可以拿最少 level0_file_num_compaction_trigger 个文件，将它们 Compaction 到一起。选取的顺序是从新到旧。

Compact 的逻辑如下面的例子所示。因为 FIFO 中不存在所谓的版本问题了，所以 Compact 的目的就是让多个 SST 文件变成一个有序的大的 SST 文件。

For example, if level0_file_num_compaction_trigger = 8 and every flushed file is 100MB. Then as soon as there is 8 files, they are compacted to one 800MB file. And after we have 8 new 100MB files, they are compacted in the second 800MB, and so on. Eventually we’ll have a list of 800MB files and no more than 8 100MB files.

Compaction 的执行条件：定期检查数据库大小是否超过 compaction_options_fifo.max_table_files_size，如果超过了，就一次 drop 一个最老的文件，直到重新满足大小限制。

with TTL

现在并不是数据库大小超过某个 size 才 compaction 了。而是直接删除 ttl 比某个值旧的所有 SST 文件。

Universal Compaction

Remote Compaction

从下面的调用图中可以看出，Compaction 命令同样是由 Primary 发出的，但是实际上是由 Compaction worker 执行的。

Schedule
The first step is primary DB triggers the compaction, instead of running the compaction locally, it sends the compaction information to a callback in CompactionService. The user needs to implement the CompactionService::Schedule(), which sends the compaction information to a remote process to schedule the compaction.
Compact
On the remote Compaction Worker side, it needs to run DB::OpenAndCompact() with the compaction information sent from the primary. Based on the compaction information, the worker opens the DB in read-only mode and runs the compaction. The compaction worker cannot change the LSM tree, it outputs the compaction result to a temporary location that the user needs to set.
Return Result
Once the compaction is done, the compaction result needs to be sent back to primary, which includes the metadata about the compacted SSTs and some internal information. The same as scheduling, the user needs to implement the communication between primary and compaction workers.
Install & Purge
The primary is waiting for the result by callback CompactionService::Wait(). The result should be passed to that API and return function call. After that, the primary will install the result by renaming the result SST files in the temporary workplace to the LSM files. Then the compaction input files will be purged. As RocksDB is renaming the result SST files, make sure the temporary workplace and the DB are on the same file system. If not, the user needs to copy the file to the DB file system before returning the Wait() call.

Reference

https://github.com/facebook/rocksdb/wiki/
RocksDB Wiki