A Try about Profiling Jemalloc Code

A Short Introduction

Jemalloc is traditional implementation about allocation in OS, jemalloc first came into use as the FreeBSD libc allocator in 2005, and since then it has found its way into numerous applications that rely on its predictable behavior. Jemalloc has already used in thousands of real world applications to accelerate and strength allocation function. Jemalloc performs more efficient than glibc malloc (a.k.a, Ptmalloc) in a multi-thread situation, and has a higher scalability.

API

  • POSIX API

jemalloc provides some common functions to allocation or deallocation memory in system, including malloc, calloc, realloc, free, memalign and aligned_alloc.

  • NONE Standard API
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    void *mallocx( size_t size,
    int flags);
    void *rallocx( void *ptr,
    size_t size,
    int flags);
    size_t xallocx( void *ptr,
    size_t size,
    size_t extra,
    int flags);
    size_t sallocx( void *ptr,
    int flags);
    void dallocx( void *ptr,
    int flags);
    void sdallocx( void *ptr,
    size_t size,
    int flags);
    size_t nallocx( size_t size,
    int flags);
    int mallctl( const char *name,
    void *oldp,
    size_t *oldlenp,
    void *newp,
    size_t newlen); // This one provides controlling of allocation configurations
    int mallctlnametomib( const char *name,
    size_t *mibp,
    size_t *miblenp);
    int mallctlbymib( const size_t *mib,
    size_t miblen,
    void *oldp,
    size_t *oldlenp,
    void *newp,
    size_t newlen);
    void malloc_stats_print( void (*write_cb) (void *, const char *) ,
    void *cbopaque,
    const char *opts); // For debuging code
    size_t malloc_usable_size( const void *ptr);
    void (*malloc_message)( void *cbopaque,
    const char *s);

Compiling

  • You can use several ways to build up your jemalloc library, please refer to this link for more details
  • My procedures for integrating jemalloc into an application:
    1. use jemalloc-config script in bin directory to looking your building root directory
    2. Run ./configure and run the make in bash
    3. use LD_PRELOAD environment variable to add libjemalloc to LD environment
    4. compiling the execution code like this cc ex_stats_print.c -o ex_stats_print -Ijemalloc-config –includedir-Ljemalloc-config –libdir-Wl,-rpath,jemalloc-config –libdir -ljemalloc jemalloc-config --libs
  • If you want change some details and use some surprise functions in jemalloc library:
    1. You can use MALLOC_CONF environment variable to control/tune jemalloc, like this export MALLOC_CONF="prof:true,lg_prof_sample:1,prof_accum:false,prof_prefix:jeprof.out", and there is link about how to use this options
    2. Or you can change these configuration before making jemalloc project, these configuration options are enabled in libcs built-in jemalloc: --enable-dss, --enable-experimental, --enable-fill, --enable-lazy-lock, --enable-munmap, --enable-stats, --enable-tcache, --enable-tls, --enable-utrace, and --enable-xmalloc.
    3. Attention , Using --with-jemalloc-prefix as a option when compiling jemalloc with an API prefix, will make this library is not easy to use. You should use je_XXX to replace XXX function, and reconstruct the whole program.

Implementation Details

  • How Jemalloc get memory from OS?

    • Jemlloc use mmap or sbrk to try to get memory from the lower level API. However, user can use the different strategy to control memory obtaining from mmap, sbrk or both of them.
    • Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If sbrk(2) is supported by the operating system, this allocator uses both mmap(2) and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
  • Small size allocation and large size allocation

  • Other details

This allocator uses multiple arenas in order to reduce lock contention for threaded programs on multi-processor systems. This works well with regard to threading scalability, but incurs some costs. There is a small fixed per-arena overhead, and additionally, arenas manage memory completely independently of each other, which means a small fixed increase in overall memory fragmentation. These overheads are not generally an issue, given the number of arenas normally used. Note that using substantially more arenas than the default is not likely to improve performance, mainly due to reduced cache performance. However, it may make sense to reduce the number of arenas if an application does not make much use of the allocation functions.

In addition to multiple arenas, this allocator supports thread-specific caching, in order to make it possible to completely avoid synchronization for most allocation requests. Such caching allows very fast allocation in the common case, but it increases memory usage and fragmentation, since a bounded number of objects can remain allocated in each thread cache.

Memory is conceptually broken into extents. Extents are always aligned to multiples of the page size. This alignment makes it possible to find metadata for user objects quickly. User objects are broken into two categories according to size: small and large. Contiguous small objects comprise a slab, which resides within a single extent, whereas large objects each have their own extents backing them.

Small objects are managed in groups by slabs. Each slab maintains a bitmap to track which regions are in use. Allocation requests that are no more than half the quantum (8 or 16, depending on architecture) are rounded up to the nearest power of two that is at least sizeof(double). All other object size classes are multiples of the quantum, spaced such that there are four size classes for each doubling in size, which limits internal fragmentation to approximately 20% for all but the smallest size classes. Small size classes are smaller than four times the page size, and large size classes extend from four times the page size up to the largest size class that does not exceed PTRDIFF_MAX.

Allocations are packed tightly together, which can be an issue for multi-threaded applications. If you need to assure that allocations do not suffer from cacheline sharing, round your allocation requests up to the nearest multiple of the cacheline size, or specify cacheline alignment when allocating.

Reference