Database paper part 7

包含:

  • We Ain’t Afraid of No File Fragmentation: Causes and Prevention of Its Performance Impact on Modern Flash SSDs

We Ain’t Afraid of No File Fragmentation: Causes and Prevention of Its Performance Impact on Modern Flash SSDs

https://www.usenix.org/system/files/fast24-jun.pdf

Abstract

The primary cause of the degraded performance is not due to request splitting but stems from a significant increase in die-level collisions.

如果在写连续的 file block 中,有其他的写入过来,那么这些 file block 就不会在连续的 die 上,从而产生 random die allocation。这种情况比如发生在 file overwrite 的时候。

In SSDs, when other writes come between writes of neighboring file blocks, the file blocks are not placed on consecutive dies, resulting in random die allocation. This randomness escalates the chances of die-level collisions, causing deteriorated read performance later. We also reveal that this may happen when a file is overwritten.

Evaluations with commercial SSDs and an SSD emulator indicate that our approach effectively curtails the read performance drop arising from both fragmentation and overwrites, all without the need for defragmentation. Representatively, when a 162 MB SQLite database was fragmented into 10,011 pieces, our approach limited the performance drop to 3.5%, while the conventional system experienced a 40% decline.

Introduction

To prevent performance degradation caused by fragmentation, file systems utilize various techniques [35], such as delayed allocation [23] and preallocation of data blocks [2], to maintain continuity among data blocks.

SSD 中没有磁头的物理移动,所以减少了顺序读和随机读之间的性能 gap。但 [4] 中说,SSD 上读 fragmented 的文件,也有 2-5 倍的性能损失。诸如 [13, 31, 42] 的文件只是认为这些性能损失的原因是 request splitting in the kernel I/O path due to fragmentation。

这篇文章指出 fragmentation 导致的性能损失实际上根因是 die-level collisions。而 die-level collisions 会减少 SSD 内部的并发度。

An SSD’s firmware allocates its flash memory pages in a round-robin manner across the flash memory dies based on the order in which they are written.

所以,如果发生了 fragmentation,那么 the pages storing contiguous file blocks 不能被放置在 contiguous dies 上,而是被分配在任意的 dies 上。

这个论文修改了 nvme 的协议,让 write 命令指定 page-to-die mapping。

With these hints, the page for an appending write is mapped to the die following the die where the previous file block’s page was assigned to. In addition, the page for an overwrite operation to an existing file block, which also disrupts the page-to-die mapping pattern, is mapped to the same die where the original page was located.

Background and Motivation

Old Wisdom on File Fragmentation

In the HDD era, the primary and direct cause of performance degradation from file fragmentation was the seek time between dispersed sectors of the file.

Fragmentation 对读取的影响更大,因为读取必须要等待完成,而写入则可以被 buffer。

Fragmentation 在三个层面影响性能:

  • kernel I/O path
    Only a single command is required for the host to instruct the storage device to perform read or write operations on contiguous storage space.
    Thus, when a sequential read occurs for a file, the Linux kernel reads the data block mapping in the file’s inode, and for each contiguous data block region, it creates a bio (block I/O) data structure. This data structure is used to create the corresponding request data structure to be passed to the device driver, which then issues the command for the request to the device.
    Through this process, a single sequential file access may be split into multiple bios and corresponding requests to the storage device, depending on the degree of file fragmentation.
  • storage device interface
    This request splitting is known to increase I/O execution time, as it increases the number of data structure creations and calls to underlying functions, including the device driver code.
    Specifically, the frequency of fetching, decoding, translating commands into storage media operations, and queuing media access operations increases. Therefore, file fragmentation also delays the processing time of the storage device controller.
  • storage media access

ext4 为了减少 fragment 产生的优化:

  • The delayed allocation technique used in the ext4 file system performs data block allocation not at the write system call handling but at the time of page flush.
  • In addition, ext4 reserves a predefined window of free data blocks for each file’s inode. These reserved free blocks will be actually allocated to the file for its successive append writes.

defragmentation 的手段:

  • 【Sato】Allocates contiguous free blocks to a temporary inode, copies the fragmented file data to the temporary inode, deletes the original file, and renames the temporary inode to the original’s.

File Fragmentation in SSD-Era

很多学者和厂商说 SSD 不受 fragmentation 的影响,defragmentation 反而可能会损害 SSD 的寿命。

SSDs offer significantly higher performance than a single flash memory die (chip) because they operate multiple flash dies in parallel.

NVMe 有 65535 个命令队列,每个队列能 queue 65535 个 commands。总共 65535 的平方,可以说非常大了。

Specifically, NVMe SSDs offer 65,535 command queues, each capable of queueing 65,536 commands.

Even when fragmentation leads to smaller request sizes that cannot fully utilize die-level parallelism, smaller flash operations in the command queues can still be processed out-of-order, allowing most dies to be fully utilized.

因此,很多学者认为 kernel I/O path 和 storage device interface 中的 request splitting 是影响性能的关键。

Internals of Modern Flash SSDs

a die can only process one request at a time.

FTL 会将需要写入的 page 存储到尽可能多的 die 中。

To prevent die-level collisions for read operations, the flash translation layer (FTL) of an SSD’s firmware must perform physical page allocation in a manner that distributes the physical pages storing contiguous logical pages across as many dies as possible.

所以,是 rr 地选择 die,而不是一股脑全写到一个 die 里面。

For this purpose, the FTL of most modern SSDs selects a die in a round-robin manner when allocating a flash page for processing an incoming page write request.

Additionally, modern FTLs perform the valid page copy within the die where the page resides during the garbage collection (GC) process if the die has a sufficient number of free pages.

For example, in Fig. 2, File A is evenly distributed across four dies since its four pages were written without interference. Thus, a sequential read of File A will be performed simultaneously on these four dies, resulting in a bandwidth of up to four times the flash die performance.

In contrast, assume that the writes to File B and File C were interleaved. As the die for storing a logical page is assigned in a round-robin manner according to the order of writes performed within the SSD, both the third and last pages of File B ended up being allocated to Die 3. As a result, the time to read File B is twice as long as that for reading an ideally-placed file of the same size, such as File A.