介绍下 jemalloc 的实现。目前的实现和 4.5 及之前的实现还是有比较大的差别的。因此代码主要是看的 4.5，并介绍了下 5.2.1 的几个重要的变化。

设计理念

最大化内存分配和释放的性能
减少内部碎片和外部碎片
减少线程之间的竞争和 false sharing
支持 heap profiling

实现理念

基本情况

jemalloc 中定义了一系列 size classes，这是通过 size_classes.h 生成的。从小到大可以进行编号，例如大小在 0 到 8 之间的就是 0 号，8 到 16 之间的就是 1 号。size2index 和 index2size 可以实现转换。

基于内存大小，把申请的内存分为三类：

Small 的 size 小于 page 大小，为 8/16/32 bytes 等。会返回某个 run 中的一个 region。
Large 以 page 为单位，小于 chunk 的大小。会返回一个 run。
Huge 会使用多个 chunk。
所以 Huge 是按照 chunk 对齐的，这通常被用来区分一个地址是否属于 Huge。

因为 tcache 的存在，内存分配可能和上述相比有更多的优化。

名词解释：

page 指的是操作系统的 page，一般是 4KB。mmap 系统调用分配内存以 page 为单位，但分配的内存未必是按照 page size 对齐的，所以可能要多分配一些出来，然后 trim。
chunk 指的是 jemalloc 向 OS 申请内存的最小单位。它一定是 page size 的倍数。默认是 2MiB。
run：一个 chunk 分成多个不同大小的 run。
region：一个 run 分成多个相同大小的 region。Small 对象的内存分配和释放就是标记某个 region 被使用。
bin：由 bin 管理存储同一种大小的 Small 对象（也就是说同一种 size class）的多个 run。

chuck 的基本设计

arena_chunk_s 的结构如下所示：

extent_node_t 的 node
arena_chunk_map_bits_t 即 arena_chunk_map_bits_s 类型的 map_bits
包含了 chunk 中除了 header 使用的 page 之外的所有 page 的元信息。
arena_chunk_map_misc_t 即 arena_chunk_map_misc_s
记录了各个 run 的元信息。
这个结构并没有在 arena_chunk_s 中被表示出来。应该是 C 语言的问题。
pages，也就是 payload

这个 map_bits 的长度实际上等于 map_bias，是在 arena_boot 中计算的。这个计算有个神奇的 for (i < 3) 循环，将在后面介绍。

arena_chunk_map_bits_t 是一个 size_t。它如何设置，取决于 page 是 run 的第一个、中间的、普通的 page。以及 run 的类型是未分配、small 和 large。容易推断出，一个 run 至少对应一个 page。

在初始化时还会设置：

arena_maxrun 等于 chunk 的大小减去 header 的大小。
large_maxclass：Large 类型的 class 的最大的大小。

run 的基本设计

arena_chunk_map_misc_t 是 run 的元信息，前面提到，它也是存在 chunk 的 header 中，而不会有自己单独的 run header。

ph_link 有两个互斥的用途：

在 arena_s 中的 runs_avail 堆，用来管理 arena 中所有可用的 run
在 arena_bin_s 中的 runs 堆，用来管理分配给某个 bin 的 run

struct arena_chunk_map_misc_s {
    phn(arena_chunk_map_misc_t)     ph_link;

    union {
        /* Linkage for list of dirty runs. */
        arena_runs_dirty_link_t     rd;

        /* Profile counters, used for large object runs. */
        union {
            void            *prof_tctx_pun;
            prof_tctx_t     *prof_tctx;
        };

        /* Small region run metadata. */
        arena_run_t         run;
    };
};

通过 arena_run_split_small 函数，可以将 run 分解，得到对应 size 的一系列 run，并交给对应的 bin 管理。

通过 arena_run_coalesce 函数，run 在释放时也可以被前后合并。

对于 Small 结构，有 arena_run_t 结构用来维护

struct arena_run_s {
    szind_t     binind;

    unsigned    nfree;

    bitmap_t    bitmap[BITMAP_GROUPS_MAX];
};

bin 的基本设计

runs_avail 是一个数组，数组长度等于 runs_avail_nclasses。分配给 bin 的 run 会从 arena_s 的 runs_avail 中删除，移动到 arena_bin_s 管理。其中，runcur 记录目前正在被用来分配的 run。runs 记录目前非空、非满的 run。此外：

空的 bin 不会被任何结构记录，因为这个时候内存都已经分配给用户了，用户理应通过 free 来释放内存。
满的 bin 会被会收到 runs_avail 中

struct arena_bin_s {
    malloc_mutex_t      lock;

    arena_run_t     *runcur;

    arena_run_heap_t    runs;

    /* Bin statistics. */
    malloc_bin_stats_t  stats;
};

为什么 run 被记录在 arena 而不是 chunk 中呢？

free

这里定义 pageind 为

1	pageind = ((uintptr_t)ptr - (uintptr_t)chunk) >> LG_PAGE;

5.2.1

首先需要区分 Huge、Large 和 Small：

Huge 可以判断是否和 chunk 大小也就是 2MiB 对齐。
Large 可以通过 page 对应的 map_bits 和 CHUNK_MAP_LARGE 做 and 来实现。

Small 比较麻烦，因为还需要找到对应的 region。

在 arena_dalloc_small 中处理，主要是调用 arena_dalloc_bin，然后调用 arena_decay_tick。

步骤是：

找到对应的 run page offset，也就是 pppppppp pppppppp ppp。也就是这个 region 对应的 run 的第一个 page。
然后获得这个 run 的 misc 数据。目的是找到 run 对应的 bin 是哪个。
对对应的 bin 加锁，执行 arena_dalloc_bin_locked_impl。在 4.2.1 中这个函数有一些重复计算的地方不知道为啥。这个函数主要调用 arena_run_reg_dalloc，后面还有一些对 bin 是否为空的判断。

arena_run_reg_dalloc 的有效过程是：

arena_run_regind 获得 region id
bitmap_unset(run->bitmap, &bin_info->bitmap_info, regind); 标记这个 region 被释放。

purge

muzzy 和 dirty

purge 分为 2 个阶段：

active -> dirty
变为 dirty 后，heap profiling 是无法看到的。但是 RSS 依然较高。
dirty -> muzzy
调用 madvise(MADV_FREE)
muzzy -> cleaned
调用 madvise(MADV_DONTNEED)

何时 purge

有两个 ctl，对应了 arena.$i.{purtge,decay} 命令，相关文档：

arena_i_purge_ctl
Purge all unused dirty pages for arena <i>, or for all arenas if <i> equals MALLCTL_ARENAS_ALL.
arena_i_decay_ctl
Trigger decay-based purging of unused dirty/muzzy pages for arena <i>, or for all arenas if <i> equals MALLCTL_ARENAS_ALL.
The proportion of unused dirty/muzzy pages to be purged depends on the current time; see opt.dirty_decay_ms and opt.muzy_decay_ms for details.

static const ctl_named_node_t arena_i_node[] = {
    {NAME("purge"),     CTL(arena_i_purge)},
    {NAME("decay"),     CTL(arena_i_decay)},
    {NAME("reset"),     CTL(arena_i_reset)},
    {NAME("dss"),       CTL(arena_i_dss)},
    {NAME("lg_dirty_mult"), CTL(arena_i_lg_dirty_mult)},
    {NAME("decay_time"),    CTL(arena_i_decay_time)},
    {NAME("chunk_hooks"),   CTL(arena_i_chunk_hooks)}
};

purge 和 decay 的区别，是 arena_purge 的 all 参数。实际上导致了调用的是 arena_purge_to_limit(ndirty_limit=0) 还是 arena_maybe_purge。

arena_maybe_purge 分为两种模式，受到 opt_purge 管理：

purge_mode_ratio
purge 更少的 page，直到满足 arena->ndirty <= ndirty_limit。
这个应该是对应了 lazy 模式。
purge_mode_decay
purge 尽可能多的 page，但是不违背 arena->ndirty >= ndirty_limit。
这实际上是默认值。decay 实际上是基于时间的管理。

实际上这里 ndirty_limit 就是能够容忍的最多的 dirty page 的数量。

看看默认的 purge_mode_decay 模式，首先要检查下结构 arena_decay_s：

time
Approximate time in seconds from the creation of a set of unused dirty pages until an equivalent set of unused dirty pages is purged and/or reused.
产生的 dirty pages 会在 decay_time 时间后全部 purge。默认是 10s。
interval
time / SMOOTHSTEP_NSTEPS。相当于分这么多组去 decay，而不是一次性搞完，从而避免每次占用较多的资源。
SMOOTHSTEP_NSTEPS 等于 200。
epoch
Time at which the current decay interval logically started.
We do not actually advance to a new epoch until sometime after it starts because of scheduling and computation delays, and it is even possible to completely skip epochs.
In all cases, during epoch advancement we merge all relevant activity into the most recently recorded epoch.
结合 interval 和 deadline 的理解就是 epoch 理论上就是不断自增 interval，但实际上什么时候开始是不确定的，甚至可能跳过某些 epoch。每个 epoch 开始的时候会处理之前的所有工作。
deadline
Deadline for current epoch.
This is the sum of interval and per epoch jitter. 实际上就是 [epoch + interval, epoch + 2*interval)。
Epochs always advance by precise multiples of interval, but we but we randomize the deadline to reduce the likelihood of arenas purging in lockstep.
ndirty
epoch 开始的时候 dirty page 的数量。
backlog
Trailing log of how many unused dirty pages were generated during each of the past SMOOTHSTEP_NSTEPS decay epochs, where the last element is the most recent epoch. Corresponding epoch times are relative to epoch.

因为在 purge 的过程中，会有新的 dirty page 产生，所以将整个 purge 划分为 SMOOTHSTEP_NSTEPS 组，每组分别负责 interval 时间内产生的 dirty page 的回收，每组能保留的 dirty pages 数量根据 Smootherstep 曲线，总的能保留的 dirty page 数量为 200 组的叠加，超出的会 purge。

实现在 arena_maybe_purge_decay：

arena_decay_deadline_reached 检查 decay.deadline 是否已经到达
arena_decay_epoch_advance
- arena_decay_epoch_advance_helper
- arena_decay_epoch_advance_purge
  计算出 ndirty_limit。调用 arena_purge_to_limit(ndirty_limit)，更新 decay.dirty。

Smooth steps 曲线

SMOOTHSTEP 是一个表：

x 从 0.005 开始递增到 1.000
y 是对应 x 取值时，smooth step 函数的值。对于 jemalloc 的场景，就是在 x 的时刻，需要回收大概多少的内存了
h 是 y 乘上 2 ** SMOOTHSTEP_BFP 的结果，这样转成整数计算方便很多

#define SMOOTHSTEP_VARIANT  "smoother"
#define SMOOTHSTEP_NSTEPS   200
#define SMOOTHSTEP_BFP      24
#define SMOOTHSTEP \
 /* STEP(step, h,                            x,     y) */ \
    STEP(   1, UINT64_C(0x0000000000000014), 0.005, 0.000001240643750) \
    STEP(   2, UINT64_C(0x00000000000000a5), 0.010, 0.000009850600000) \

这个 SMOOTHSTEP 宏是计算机生成的，但也可以手算，结果如下。可以看到 20 和 0x14 相等。

1
2
3

$ dc
> 5 k 2 24 ^ 0.000001240643750 * p
20.814548172800000

5.2.1 版本的 purge 和 decay

在 4.5 中，使用 dirty_time。到了 5.2.1 中，就变成了 dirty_decay_ms 和 muzzy_decay_ms 了，但值还是一样的。

这个基于 5.2.1 的分支上有我的一些调试记录。

这里，有个结论是 5.2.1 上如果不开启 background thread 特性，那么指定 dirty_decay_ms 为非 0 值，则可能起不到预期的回收 dirty 页面的效果。我也在 tiflash 这个 binary 上进行了测试，验证过：

关闭 background thread，但是开启 dirty_decay_ms，则后续手动 purge 仍有内存释放
开启 background thread，且开启 dirty_decau_ms，则内存会被回收，稍后手动 purge 不会进一步释放内存

原因是回收的逻辑大概如下：

如果 decay_ms 为 0，则在 arena_maybe_decay 中会直接调用 arena_decay_to_limit 去 purge。这实际上是会 purge 最多 current_npages。
否则判断是否到达这个 epoch 的 deadline
- 如果是，则调用 arena_decay_epoch_advance
  在这个函数中，会判断是否开启 background thread 特性。
  如果关闭，会调用 arena_decay_try_purge，这也是 background thread 关闭后唯一释放内存的路径。如果开启，则只有当前线程是 background thread 的时候才会调用 arena_decay_try_purge。
  如果 purge，会 purge 最多 current_npages。
- 如果否，则只有当前线程是 background thread 的时候，才会调用 arena_decay_try_purge。purge 最多 arena_decay_backlog_npages_limit。

从上述逻辑发现，只要 background thread 特性被关闭，则 arena_maybe_decay 即使被调用，也不会有渐进的 decay。

再往上的调用关系：

arena_decay_ms_set 和 arena_decay_impl 会调用 arena_maybe_decay。
arena_decay_dirty 和 arena_decay_muzzy 会调用 arena_decay_impl。
- arena_extents_dirty_dalloc 和 arena_decay 会调用 arena_decay_dirty。

我在 test.cpp 中做了实验。在 free 的时候，会调用 arena_extents_dirty_dalloc。而 arena_decay 在下面被调用：

arena_decay_ticks
background_work_sleep_once
ctl 方法 arena_i_decay
ctl 方法 arena_i_destroy_ctl
tcache_destroy

4.5 如何 purge

这里以 4.5 的 arena_purge_to_limit 来讲解。

首先介绍两个结构 rd_link 和 cc_link。它们就是我们要清理的对象，都是 qr 结构的链表。

rd_link 对应了要回收的 run，cc_link 对应了要回收的 chunk。容易想到，run 和 chunk 的回收方式可能是不一样的，chunk 可能最终会被通过 munmap 返回给系统。

struct arena_runs_dirty_link_s {
    qr(arena_runs_dirty_link_t) rd_link;
};

struct extent_node_s {
    ...
    /* Linkage for arena's runs_dirty and chunks_cache rings. */
    arena_runs_dirty_link_t rd;
    qr(extent_node_t)   cc_link;
};

代码如下，可以看到，它将要释放的 chunk 和 run 分别放到 purge_chunks_sentinel 和 purge_runs_sentinel 中进行管理。

static void
arena_purge_to_limit(tsdn_t *tsdn, arena_t *arena, size_t ndirty_limit)
{
    chunk_hooks_t chunk_hooks = chunk_hooks_get(tsdn, arena);
    size_t npurge, npurged;
    arena_runs_dirty_link_t purge_runs_sentinel;
    extent_node_t purge_chunks_sentinel;

    arena->purging = true;

    /*
     * Calls to arena_dirty_count() are disabled even for debug builds
     * because overhead grows nonlinearly as memory usage increases.
     */
    if (false && config_debug) {
        size_t ndirty = arena_dirty_count(arena);
        assert(ndirty == arena->ndirty);
    }
    assert(opt_purge != purge_mode_ratio || (arena->nactive >>
        arena->lg_dirty_mult) < arena->ndirty || ndirty_limit == 0);

    qr_new(&purge_runs_sentinel, rd_link);
    extent_node_dirty_linkage_init(&purge_chunks_sentinel);

    npurge = arena_stash_dirty(tsdn, arena, &chunk_hooks, ndirty_limit,
        &purge_runs_sentinel, &purge_chunks_sentinel);
    if (npurge == 0)
        goto label_return;
    npurged = arena_purge_stashed(tsdn, arena, &chunk_hooks,
        &purge_runs_sentinel, &purge_chunks_sentinel);
    assert(npurged == npurge);
    arena_unstash_purged(tsdn, arena, &chunk_hooks, &purge_runs_sentinel,
        &purge_chunks_sentinel);

    if (config_stats)
        arena->stats.npurge++;

label_return:
    arena->purging = false;
}

这里的脉络是：

extent_node_dirty_linkage_init
arena_stash_dirty
arena_purge_stashed
arena_unstash_purged 是真正的释放过程

chunk 的释放策略

调用 chunk_dalloc_wrapper，这里面对 chunk_hooks 有个判断，默认是走 chunk_dalloc_default_impl。最终实际上会调用 mummap。

static bool
chunk_dalloc_default_impl(void *chunk, size_t size)
{

    if (!have_dss || !chunk_in_dss(chunk))
        return (chunk_dalloc_mmap(chunk, size));
    return (true);
}

bool
chunk_dalloc_mmap(void *chunk, size_t size)
{

    if (config_munmap)
        pages_unmap(chunk, size);

    return (!config_munmap);
}

void
pages_unmap(void *addr, size_t size)
{

#ifdef _WIN32
    if (VirtualFree(addr, 0, MEM_RELEASE) == 0)
#else
    if (munmap(addr, size) == -1)
#endif
    {
...
    }
}

run 的释放策略

调用 arena_run_dalloc。

如果释放这个 run 导致 chunk 闲置，则调用 arena_chunk_dalloc。这个函数最终可能会走到 arena_maybe_purge，从而导致 chunk 被回收
如果这个 run 是 dirty bit 被 set 了，则调用 arena_maybe_purge

5.2.1 如何 purge

tcache

tcache 的主要目标是优化多线程下内存分配性能：

减少锁争用：通过缓存内存块，线程可以在自己的本地缓存中处理分配和释放操作，避免频繁访问共享的全局 arena 数据结构
提高分配速度：小对象的内存分配和释放可以直接从 tcache 完成
降低内存碎片化

tsd

tsd 的主要作用是避免线程之间的资源竞争。每个线程可以有自己独立的一份分配器状态。tsd 的实现依赖于线程局部存储（Thread-Local Storage，TLS）。tsd 中就包含了 tcache 结构。tsdn 和 tsd 的区别是 tsdn 是 nullable 的。

“分配器状态”中包含 tsd_state_t 以及 MALLOC_TSD 中的项目。

struct tsd_s {
    tsd_state_t state;
#define O(n, t)                             \
    t       n;
MALLOC_TSD
#undef O
};

MALLOC_TSD 中的项目有

#define MALLOC_TSD                          \
/*  O(name,         type) */                \
    O(tcache,           tcache_t *)             \
    O(thread_allocated,     uint64_t)               \
    O(thread_deallocated,   uint64_t)               \
    O(prof_tdata,       prof_tdata_t *)             \
    O(iarena,           arena_t *)              \
    O(arena,            arena_t *)              \
    O(arenas_tdata,     arena_tdata_t *)            \
    O(narenas_tdata,        unsigned)               \
    O(arenas_tdata_bypass,  bool)                   \
    O(tcache_enabled,       tcache_enabled_t)           \
    O(quarantine,       quarantine_t *)             \
    O(witnesses,        witness_list_t)             \
    O(witness_fork,     bool)                   \

tcache_alloc_small

以 small 对象为例，尝试分配。

首先通过 tcache_alloc_easy 从对应的 bin 中选择尝试分配。如果失败，则要调用 arena_tcache_fill_small 从 arena 中申请整块 run 进行分配，run 剩余部分也归这个 tcache。

首先了解下结构

struct tcache_bin_s {
    tcache_bin_stats_t tstats;
    int     low_water;  /* Min # cached since last GC. */
    unsigned    lg_fill_div;    /* Fill (ncached_max >> lg_fill_div). */
    unsigned    ncached;    /* # of cached objects. */
    /*
     * To make use of adjacent cacheline prefetch, the items in the avail
     * stack goes to higher address for newer allocations.  avail points
     * just above the available space, which means that
     * avail[-ncached, ... -1] are available items and the lowest item will
     * be allocated first.
     */
    void        **avail;    /* Stack of available objects. */
};

在 full 的时候，需要分配 ncached_max >> lg_fill_div 到 avail 结构中。lg_fill_div 是一个从 1 开始动态调整的值。对于 small 分配，ncached_max 是 TCACHE_NSLOTS_SMALL_MIN = 20。

arena_tcache_fill_small 的实现就是调用 arena_run_reg_alloc 在 run 中分配。如果没有 run，或者 run 用完了，就 arena_bin_malloc_hard。这个函数也会设置 runcur，所以后续的调用能够使用。

void
arena_tcache_fill_small(tsdn_t *tsdn, arena_t *arena, tcache_bin_t *tbin,
    szind_t binind, uint64_t prof_accumbytes)
{
    unsigned i, nfill;
    arena_bin_t *bin;

    assert(tbin->ncached == 0);

    if (config_prof && arena_prof_accum(tsdn, arena, prof_accumbytes))
        prof_idump(tsdn);
    bin = &arena->bins[binind];
    malloc_mutex_lock(tsdn, &bin->lock);
    for (i = 0, nfill = (tcache_bin_info[binind].ncached_max >>
        tbin->lg_fill_div); i < nfill; i++) {
        arena_run_t *run;
        void *ptr;
        if ((run = bin->runcur) != NULL && run->nfree > 0)
            ptr = arena_run_reg_alloc(run, &arena_bin_info[binind]);
        else
            ptr = arena_bin_malloc_hard(tsdn, arena, bin);
        if (ptr == NULL) {
            /*
             * OOM.  tbin->avail isn't yet filled down to its first
             * element, so the successful allocations (if any) must
             * be moved just before tbin->avail before bailing out.
             */
            if (i > 0) {
                memmove(tbin->avail - i, tbin->avail - nfill,
                    i * sizeof(void *));
            }
            break;
        }
        if (config_fill && unlikely(opt_junk_alloc)) {
            arena_alloc_junk_small(ptr, &arena_bin_info[binind],
                true);
        }
        /* Insert such that low regions get used first. */
        *(tbin->avail - nfill + i) = ptr;
    }
    if (config_stats) {
        bin->stats.nmalloc += i;
        bin->stats.nrequests += tbin->tstats.nrequests;
        bin->stats.curregs += i;
        bin->stats.nfills++;
        tbin->tstats.nrequests = 0;
    }
    malloc_mutex_unlock(tsdn, &bin->lock);
    tbin->ncached = i;
    arena_decay_tick(tsdn, arena);
}

DSS(Dynamic Storage Segments)

malloc 的简单流程

主要是调用 ialloc_body。这里 usize 是对应的 size class 的最小值，即 index2size。

JEMALLOC_EXPORT JEMALLOC_ALLOCATOR JEMALLOC_RESTRICT_RETURN
void JEMALLOC_NOTHROW *
JEMALLOC_ATTR(malloc) JEMALLOC_ALLOC_SIZE(1)
je_malloc(size_t size)
{
    void *ret;
    tsdn_t *tsdn;
    size_t usize JEMALLOC_CC_SILENCE_INIT(0);

    if (size == 0)
        size = 1;

    if (likely(!malloc_slow)) {
        ret = ialloc_body(size, false, &tsdn, &usize, false);
        ialloc_post_check(ret, tsdn, usize, "malloc", true, false);
    } else {
        ret = ialloc_body(size, false, &tsdn, &usize, true);
        ialloc_post_check(ret, tsdn, usize, "malloc", true, true);
        UTRACE(0, size, ret);
        JEMALLOC_VALGRIND_MALLOC(ret != NULL, tsdn, ret, usize, false);
    }

    return (ret);
}

ialloc_body 最终通过 ialloc -> iallocztm 走到 arena_malloc。

JEMALLOC_ALWAYS_INLINE_C void *
ialloc_body(size_t size, bool zero, tsdn_t **tsdn, size_t *usize,
    bool slow_path)
{
    tsd_t *tsd;
    szind_t ind;

    if (slow_path && unlikely(malloc_init())) {
        *tsdn = NULL;
        return (NULL);
    }

    tsd = tsd_fetch();
    *tsdn = tsd_tsdn(tsd);
    witness_assert_lockless(tsd_tsdn(tsd));

    ind = size2index(size);
    if (unlikely(ind >= NSIZES))
        return (NULL);

    if (config_stats || (config_prof && opt_prof) || (slow_path &&
        config_valgrind && unlikely(in_valgrind))) {
        *usize = index2size(ind);
        assert(*usize > 0 && *usize <= HUGE_MAXCLASS);
    }

    if (config_prof && opt_prof)
        return (ialloc_prof(tsd, *usize, ind, zero, slow_path));

    return (ialloc(tsd, size, ind, zero, slow_path));
}

arena_malloc 中，如果开启了 tcache，就尝试从 tcache 去分配。否则，直接通过 arena_malloc_hard 从 arena 分配。所以在 4.5 中，大部分情况下的 small 和 large 都是通过 tcache 走的。

JEMALLOC_ALWAYS_INLINE void *
arena_malloc(tsdn_t *tsdn, arena_t *arena, size_t size, szind_t ind, bool zero,
    tcache_t *tcache, bool slow_path)
{

    assert(!tsdn_null(tsdn) || tcache == NULL);
    assert(size != 0);

    if (likely(tcache != NULL)) {
        if (likely(size <= SMALL_MAXCLASS)) {
            return (tcache_alloc_small(tsdn_tsd(tsdn), arena,
                tcache, size, ind, zero, slow_path));
        }
        if (likely(size <= tcache_maxclass)) {
            return (tcache_alloc_large(tsdn_tsd(tsdn), arena,
                tcache, size, ind, zero, slow_path));
        }
        /* (size > tcache_maxclass) case falls through. */
        assert(size > tcache_maxclass);
    }

    return (arena_malloc_hard(tsdn, arena, size, ind, zero));
}

从实现分析设计理念

最大化内存分配和释放的性能

主要体现在：

避免大量的向系统申请内存的动作
快速定位可以被用来分配的内存

快速定位上：

给定任意地址 ptr，通过 CHUNK_ADDR2BASE 就可以找到对应的 chunk 的地址，进而就可以找到对应的几个元信息。诸如 map_bias 的是全局变量，所以也能在常数时间中计算得到 bits 和 mics 的偏移。
通过 ptr 和 chunk 的地址 c，就可以通过 ((uintptr_t)ptr - (uintptr_t)chunk)>>LG_PAGE 算出 page id。

chunk 的分配上：

使用红黑树缓存之前已经分配，但是目前空闲的 chunk。使用两种红黑树来支持两种排序方式：size、address 和 address。
使用 spare 缓存最近空闲的 chunk，减少对红黑树的访问。
使用 retain 缓存之前已经分配，但是目前被释放了的 chunk。

减少内部碎片和外部碎片

chunk 的各个 page 和各个 run 的元信息都放在了 chunk 头部，减少了内部碎片。

减少线程之间的竞争

jemalloc 会创建多个 arena，每个线程由一个 arena 负责。默认创建 CPU * 4 数量的 arena。在每个 arena 中使用 nthreads 记录负责的线程数量。

每个线程分配内存时，会基于下面的逻辑选择 arena：

若 nthreads==0 已创建的 arena，则选择该 arena
若还有未创建的 arena，则选择新创建一个 arena
选择 nthreads 最少的 arena

后续版本的一些优化

background thread

线程名是 jemalloc_bg_thd。

opt_retain

主要是在 dealloc 的时候要不要调用 pages_unmap。

在 5.2.1 中引入了 opt_retain。如果开启了 opt_retain，那么进行 munmap 可能还会导致 rss 无法快速释放。原因是 purge 的时候可能使用 MADV_FREE。【Q】但实际我好像没观察到这一部分影响有多大。

bool
pages_purge_lazy(void *addr, size_t size) {
    assert(ALIGNMENT_ADDR2BASE(addr, os_page) == addr);
    assert(PAGE_CEILING(size) == size);

    if (!pages_can_purge_lazy) {
        return true;
    }
    if (!pages_can_purge_lazy_runtime) {
        /*
         * Built with lazy purge enabled, but detected it was not
         * supported on the current system.
         */
        return true;
    }

#ifdef _WIN32
    VirtualAlloc(addr, size, MEM_RESET, PAGE_READWRITE);
    return false;
#elif defined(JEMALLOC_PURGE_MADVISE_FREE)
    return (madvise(addr, size,
#  ifdef MADV_FREE
        MADV_FREE
#  else
        JEMALLOC_MADV_FREE
#  endif
        ) != 0);
#elif defined(JEMALLOC_PURGE_MADVISE_DONTNEED) && \
    !defined(JEMALLOC_PURGE_MADVISE_DONTNEED_ZEROS)
    return (madvise(addr, size, MADV_DONTNEED) != 0);
#else
    not_reached();
#endif
}

一些写法

计算 map_bias

在 arena_boot 中计算。

变量常量介绍：

chunk_npages 表示 chunk 中 page 的总数。
LG_PAGE：一个 page 是 2 ** LG_PAGE 这么大。

每次迭代算出一个 header_size，并向上取整计算 header 需要多少 page 存放。

map_bias = 0;
for (i = 0; i < 3; i++) {
    size_t header_size = offsetof(arena_chunk_t, map_bits) +
        ( ( sizeof(arena_chunk_map_bits_t) +
            sizeof(arena_chunk_map_misc_t) ) * (chunk_npages-map_bias) );
    map_bias = (header_size + PAGE_MASK) >> LG_PAGE;
}
assert(map_bias > 0);

随便取几个值跑下这个算法，可以看出，第二个迭代中，实际分配的 header 的大小 allocated_header_size 可能小于真实需要的 header 大小 real_payload_header_size。如果 bits_size + misc_size 越大，那么这个效应越明显。

#include <cstdio>
const int bits_size = 4;
const int misc_size = 128;
const int fixed_head = 1024;
int main() {
    int chunk_npages = 4096;
    int map_bias = 0;
    for (int i = 0; i < 3; i++) {
        size_t header_size = fixed_head +
            ( ( bits_size +
                misc_size ) * (chunk_npages-map_bias) );
        map_bias = (header_size + 4096) >> 12;
        int allocated_header_size = map_bias * 4096;
        int real_payload_header_size = (bits_size + misc_size) * (chunk_npages-map_bias) + fixed_head;
        printf("map_bias %u payload %u header_size %u real_payload_header_size %u allocated_header_size %u\n", map_bias, chunk_npages-map_bias, header_size, real_payload_header_size, allocated_header_size);
    }
}

map_bias 133 payload 3963 header_size 541696 real_payload_header_size 524140 allocated_header_size 544768
map_bias 128 payload 3968 header_size 524140 real_payload_header_size 524800 allocated_header_size 524288
map_bias 129 payload 3967 header_size 524800 real_payload_header_size 524668 allocated_header_size 528384

可见：

第一个迭代假设每个 page 都有一个 map_bit 和 map_misc，这种情况下算出来的 map_bias 偏大，payload 就少了。
第二个迭代因为 payload 少了，所以 header 就会偏小。

一些结构

qr

是一个 deque，或者 List 的实现。

rb

提供了红黑树相关的方法，例如 _remove 等。

红黑树被广泛使用，例如：

extent_tree_szad_
extent_tree_ad_

extent

extent 是一个基于红黑树的，用来分配内存的结构。

一般调用类似 extent_tree_szad_remove 的方法，表示从 extent 中分配一块内存。此时，会从 extent 的 szad 中删除对应内存的标记信息。

bitmap

bitmap 中记录了一些比较有趣的实现。比如 USE_TREE 提出了一个有意思的结构。

bitmap 初始都是 1，set 一个位会将它变为 0。然后有一系列的比如 full、get、set、unset 之类的方法。包括 sfu 这个方法可以查询第一个 bit 0 的位置，然后将其设置为 1。

phn

这是一个堆的实现。

base

这是 base allocator，是 jemalloc 初始化时候使用的一个低级分配方式。

void *
base_alloc(tsdn_t *tsdn, size_t size)
{
...
    usize = s2u(csize);
    extent_node_init(&key, NULL, NULL, usize, false, false);
    malloc_mutex_lock(tsdn, &base_mtx);
    node = extent_tree_szad_nsearch(&base_avail_szad, &key);
    if (node != NULL) {
        /* Use existing space. */
        extent_tree_szad_remove(&base_avail_szad, node);
    } else {
        /* Try to allocate more space. */
        node = base_chunk_alloc(tsdn, csize);
    }

一些知识

Linux 的 overcommit 机制

如果关闭 overcommit 机制，那么大部分的 OOM 现象将会消失，因为系统分配内存会更加保守
如果开启 overcommit 机制，会根据物理内存、swap 和 overcommit_ratio 判断是否能够分配出内存，依然可能会分配失败
1
CommitLimit = [swap size] + [RAM size] * vm.overcommit_ratio / 100

madvise

MADV_NORMAL
默认行为，内核根据普通访问模式处理页面。
MADV_RANDOM
随机访问，可能减少 prefetch 操作。
MADV_SEQUENTIAL
顺序访问，可以增加 prefetch。
MADV_WILLNEED
表示该内存区域将在未来使用，内核可以提前加载到内存。
MADV_DONTNEED / MADV_FREE / munmap
三个都是释放内存。后面详细描述。
MADV_REMOVE
从文件映射中移除内存区域，释放物理内存，同时在映射的文件中移除相应内容（需要文件映射）。
MADV_DONTFORK
子进程不会继承该内存区域。
MADV_DOFORK
MADV_MERGEABLE
启用内存合并（KSM，Kernel Samepage Merging），允许内核将具有相同内容的内存页面合并以节省内存。
MADV_UNMERGEABLE
MADV_HUGEPAGE
HugePages
MADV_NOHUGEPAGE
MADV_SOFT_OFFLINE
将指定区域中的坏内存页标记为不可用，但不杀死当前进程。
MADV_HWPOISON
强制将页面标记为硬件错误（仅管理员权限可用）。

三个释放内存：

munmap
即时释放物理内存(RSS)以及虚拟地址。
MADV_DONTNEED
即时释放物理内存(RSS)。保留虚拟地址。
MADV_FREE
延迟释放。

一些方法

如何编译

1
2
3

./configure --enable-prof --enable-log
make -j40 && g++ test.cpp -Iinclude -ljemalloc -o test2 && ./test2
LD_LIBRARY_PATH=./lib/ ./test2

如何 debug

指定 log 这个隐藏 conf，就可以打印日志。注意，需要 --enable-log 编译选项。

1	export MALLOC_CONF="log:."

在需要的地方可以使用

1	LOG("module_name", "formatter", ...)

然后可以指定 log:module_name，从而只打印自己想要的一部分日志。

例如可以 export MALLOC_CONF="prof:true,lg_prof_interval:1,prof_active:true,log:." 测试 dump。

Reference

https://youjiali1995.github.io/allocator/jemalloc/
基于 4.5 之前的版本
https://yaoguais.github.io/article/jemalloc/structures.html
https://github.com/leebaok/jemalloc-4.2.1-readcode/blob/readcode/readcode/more.md
https://youjiali1995.github.io/allocator/jemalloc-purge/
5.0.1 版本 purge 的改进