My Understanding about Hierarchy of Needs
|
|
How to Build a Non-Volatile Memory DBMS (Survey)
This is a summary about a paper and a related talk in Sigmod 17
, and this paper concludes the three main ways to make use of the special features in NVM when build up a DBMS.
NVM properties:
- Byte addressable: Loads and stores un`like SSD/HDD
- High random write throughput: Orders of magnitude higher than SSD/HDD, Smaller gap between sequential & random write throughput.
- Read-write asymmetry & wear-leveling: Writes might take longer to complete compared to reads. Excessive writes to a single NVM cell can destroy it.
Evaluation NVM
- => HDD/SSD/Emulated NVM.
- Tune DRAM latency to emulate NVM (increasing latency)
- CLWB (Cache-line write back) instruction
Three levels in NVM DBMS:
1) Access Interfaces,
2) Storage Manager
3) Execution Engine.
Allocation interface
Standard functions
- Dynamic memory allocation
- Meta-data management
- Durability mechanism
- Two-step implementation
- Allocation first by using CLWB
- Issues a SFENCE to make sure changes are visible
- Two-step implementation
- Naming mechanism
- Map NVM to a well know base address,
Absolute pointer = Base address + Relative pointer
- Map NVM to a well know base address,
- Recovery mechanism
- Consider a crashing before linking the chunk to DS which occur to losing of this chunk (neither allocator nor application can reclaim the space).
File system interface
- Device -> page cache -> app buffer => Device -> app buffer
- 64 bytes cache line atomic
Storage Manager
- Logging & Recovery
- Write ahead log (
LET’S TALK ABOUT STORAGE AND RECOVERY METHODS FOR NON-VOLATILE MEMORY DATABASE SYSTEMS SIGMOD' 2015
)- Avoid duplicating data in the log and the checkpoint
- logging the meta-data not the data updates
WAL + FS API
vs.WA(meta-data)L + Allocation API
- Avoid duplicating data in the log and the checkpoint
- Write behind log (
WRITE-BEHIND LOGGING VLDB' 17
)- pros
- Write-behind logging enables instant recovery
- Improves device utilization by reducing data duplication
- Extends the device lifetime
- pros
- Write ahead log (
- Data placement
- Three-tiers: DRAM (meta-data + data) + NVM (log + data) + SSD (data)
- DRAM -> hot data (cache), NVM -> warm data, SSD -> cold data
Access Methods
- Read-write asymmetry & wear-leveling
– Writes might take longer to complete compared to reads
– Excessive writes to a single NVM cell can destroy it - Write-limited access methods
- NVM-aware B+Tree, hash table
- NVM-aware B+Tree:
- In leaf node, using linear scan (unsorted data) with fewer writes instead of binary search (sorted data) with more writes.
- Using unbalanced tree (fewer writes)
- NVM-aware B+Tree:
- NVM-aware B+Tree, hash table
Execution engine
- Plan executor
- reduce the number of writes (Sorting algorithm and join algorithm) by using 1) adjustment of write-intensivity knob and 2) write-limited algorithms.
- Hybrid join/sort
- Segment sort => adjust x (x% merge sort + (1-x)% selection sort)
- reduce the number of writes (Sorting algorithm and join algorithm) by using 1) adjustment of write-intensivity knob and 2) write-limited algorithms.
How to Build a Non-Volatile Memory
A Try about Profiling Jemalloc Code
A Short Introduction
Jemalloc is traditional implementation about allocation in OS, jemalloc first came into use as the FreeBSD libc
allocator in 2005, and since then it has found its way into numerous applications that rely on its predictable behavior. Jemalloc has already used in thousands of real world applications to accelerate and strength allocation function. Jemalloc performs more efficient than glibc malloc
(a.k.a, Ptmalloc) in a multi-thread situation, and has a higher scalability.
API
- POSIX API
jemalloc provides some common functions to allocation or deallocation memory in system, including malloc, calloc, realloc, free, memalign and aligned_alloc
.
- NONE Standard API1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950void *mallocx( size_t size,int flags);void *rallocx( void *ptr,size_t size,int flags);size_t xallocx( void *ptr,size_t size,size_t extra,int flags);size_t sallocx( void *ptr,int flags);void dallocx( void *ptr,int flags);void sdallocx( void *ptr,size_t size,int flags);size_t nallocx( size_t size,int flags);int mallctl( const char *name,void *oldp,size_t *oldlenp,void *newp,size_t newlen); // This one provides controlling of allocation configurationsint mallctlnametomib( const char *name,size_t *mibp,size_t *miblenp);int mallctlbymib( const size_t *mib,size_t miblen,void *oldp,size_t *oldlenp,void *newp,size_t newlen);void malloc_stats_print( void (*write_cb) (void *, const char *) ,void *cbopaque,const char *opts); // For debuging codesize_t malloc_usable_size( const void *ptr);void (*malloc_message)( void *cbopaque,const char *s);
Compiling
- You can use several ways to build up your jemalloc library, please refer to this link for more details
- My procedures for integrating jemalloc into an application:
- use
jemalloc-config
script inbin
directory to looking your building root directory - Run
./configure
and run themake
in bash - use
LD_PRELOAD
environment variable to add libjemalloc to LD environment - compiling the execution code like this
cc ex_stats_print.c -o ex_stats_print -I
jemalloc-config –includedir-L
jemalloc-config –libdir-Wl,-rpath,
jemalloc-config –libdir -ljemallocjemalloc-config --libs
- use
- If you want change some details and use some surprise functions in jemalloc library:
- You can use
MALLOC_CONF
environment variable to control/tune jemalloc, like thisexport MALLOC_CONF="prof:true,lg_prof_sample:1,prof_accum:false,prof_prefix:jeprof.out"
, and there is link about how to use this options - Or you can change these configuration before making jemalloc project, these configuration options are enabled in libcs built-in jemalloc:
--enable-dss, --enable-experimental, --enable-fill, --enable-lazy-lock, --enable-munmap, --enable-stats, --enable-tcache, --enable-tls, --enable-utrace, and --enable-xmalloc
. - Attention , Using
--with-jemalloc-prefix
as a option when compiling jemalloc with an API prefix, will make this library is not easy to use. You should useje_XXX
to replaceXXX
function, and reconstruct the whole program.
- You can use
Implementation Details
How Jemalloc get memory from OS?
- Jemlloc use
mmap
orsbrk
to try to get memory from the lower level API. However, user can use the different strategy to control memory obtaining frommmap
,sbrk
or both of them. - Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If sbrk(2) is supported by the operating system, this allocator uses both mmap(2) and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
- Jemlloc use
Small size allocation and large size allocation
Other details
This allocator uses multiple arenas in order to reduce lock contention for threaded programs on multi-processor systems. This works well with regard to threading scalability, but incurs some costs. There is a small fixed per-arena overhead, and additionally, arenas manage memory completely independently of each other, which means a small fixed increase in overall memory fragmentation. These overheads are not generally an issue, given the number of arenas normally used. Note that using substantially more arenas than the default is not likely to improve performance, mainly due to reduced cache performance. However, it may make sense to reduce the number of arenas if an application does not make much use of the allocation functions.
In addition to multiple arenas, this allocator supports thread-specific caching, in order to make it possible to completely avoid synchronization for most allocation requests. Such caching allows very fast allocation in the common case, but it increases memory usage and fragmentation, since a bounded number of objects can remain allocated in each thread cache.
Memory is conceptually broken into extents. Extents are always aligned to multiples of the page size. This alignment makes it possible to find metadata for user objects quickly. User objects are broken into two categories according to size: small and large. Contiguous small objects comprise a slab, which resides within a single extent, whereas large objects each have their own extents backing them.
Small objects are managed in groups by slabs. Each slab maintains a bitmap to track which regions are in use. Allocation requests that are no more than half the quantum (8 or 16, depending on architecture) are rounded up to the nearest power of two that is at least sizeof(double). All other object size classes are multiples of the quantum, spaced such that there are four size classes for each doubling in size, which limits internal fragmentation to approximately 20% for all but the smallest size classes. Small size classes are smaller than four times the page size, and large size classes extend from four times the page size up to the largest size class that does not exceed PTRDIFF_MAX.
Allocations are packed tightly together, which can be an issue for multi-threaded applications. If you need to assure that allocations do not suffer from cacheline sharing, round your allocation requests up to the nearest multiple of the cacheline size, or specify cacheline alignment when allocating.
Reference
Paper Reading (Allocation of NVRAM)
Introduction about NVRAM
- Non-volatile byte-adressable memory
- Faster and more durable than flash
- Higher capacity, cheaper, and less energy consumption than DRAM
- Comparison between different kinds of storage medium.
Types | Latency |
---|---|
DRAM | 0.01us |
NVRAM | 0.05~0.1us |
Flash | 100us |
Disk | 10000us |
NVRAM and DRAM hybrid memory system
- Combine advantages of NVRAM and DRAM
- Using DRAM as Cache to accelerate memory access
- Using NVRAM to persistent data
Some challenges about using NVM
- NVRAM may have inconsistent states if stores are reordered
- Compiler-Level Reordering, CPU Out-Of-Order Execution
- Volatile CPU caches hold data that is not on NVRAM
- Incomplete updates need to be rolled back (restoring/recovering)
- NVRAM memory has to be dynamically managed (reducing writes)
- Pointers to NVRAM objects need to be valid after a restart (How to recover meta-data of allocation)
NVM_malloc
How to get memory from system
- Using
mmap
to get memory region from application’s virtual memory - Instead of creating one giant file, multiple files are mapped into contiguous, anonymous mmap space
- All translations are done by the MMU without kernel involvement or other software overhead (Bypass feature)
- Using
Two-step allocation
- Reserve Data -> (Prepare Data) -> Activate Memory
- Reserve: Get resource from heap memory region for using, recovered as “Free”
- Activate Memory: Establish links from other objects, recovered as “In use”
Faster execution
- Making DRAM as Cache to storing meta-data (VHeader: A volatile copy)
- Reserve step can be executed in DRAM
- 64 B (cacheline-sized) headers ensure atomic flushes to NVRAM
DRAM. | NVRAM
VHeader. <--|--> Header --> Block
|
Pmem (a.k.a, NVML)
Makalu
PAllocator
Learning About Memory Allocation
- Basic Posix Allocation API:
|
|
Catalog
- PtMalloc (a.k.a, glibc malloc)
- TCMalloc (By Google)
- JeMalloc (By FreeBSD)
- Hoard (By MIT)
- To be continued.
Paper Reading (3)
Data Tiering in Heterogeneous Memory Systems
Heterogeneous memory architectures coupling DRAM with high capacity, low cost, but also lower performance NVM. couple NVM with a smaller amount of DRAM as cache.
Contributions:
- high performance -> placing some applications data structures in small DRAM
- different memory access patterns (sequential, random, pointer chasing)
- pointer chasing: every operation will occur a cache miss when traveling a list
- X-mem -> map objects to data structures
- automatically placing data-structures in appropriate memory types
- According access pattern and automates the task of mapping memory regions of that data structure to DRAM or NVM to optimize performance
Design (features)
- Using a unique tag (Hash, Tree, Buffer) for each data structures
- Binding NVM and DRAM to NUMA node
- Select the Solution Type
Dependent | Independent | |
---|---|---|
Sequential | NA | Streaming |
Non-sequential | Pointer chasing | Random |
A Write-friendly Hashing Scheme for Non-volatile Memory Systems
This is a naive paper, and the idea is trivial
This paper introduces a newly hash scheme, which reduce frequency of write operation in NVM
Existing Hashing Schemes:
- Chained Hash
- Linear Hash
- 2-choice Hash
- Cuckoo Hash
Hash Scheme Structure
12345678| |---| |---| |---| |---| |---| |---| |---| | Layer-1\ / \ / \ / \ /| |---------| |---------| |---------| | Layer-2\ / \ /\ / \ /| | | | Layer-3........... Layer-L- every two nearly table items share a item in next layer
- cache coherence
- remove some levels in the bottom -> shortening finding path
Atomic, SpinLock, Mutex, Thread_Lock, Which Is The Fastest One?
When we write a concurrent program, we have various methods to guarantee our program is linearizable and serializable, which means different threads will write or read a area of same memory or variable sequentially. But the efficiency of these methods is very different. Next I will introduce four methods: atomic, spin lock, mutex and thread lock
, and compare these four methods to analysis which one is the fastest one (maybe is not the best one).
I use time ./exec
to estimate the running time of programs.
Atomic
|
|
Spin Lock
I use c++ atomic variable to implement a spin lock likely structure, because C++11
doesn’t provide a implementation of spin lock.
|
|
Mutex
I use c++11
mutex, not the mutex in unix
API, to doing test.
|
|
thread lock
There is not existing a thread lock method in C++11, so I use pthread_mutex_lock
and pthread_mutex_unlock
in pthread lib to instead. But thread library in C++11 is based on pthread, so I think this is a equally and right testing.
|
|
evaluation
I use two
threads in one process, and every thread do 10M times of increment operations.
methods | time |
---|---|
without any lock | 0.24s |
atomic | 1.08s |
spin lock | 2.90s |
mutex | 5.82s |
thread lock | 6.81s |
In a nutshell, making variable becoming atomic is the fastest one.
This is the link of all codes.
All Kinds of Allocations
memalign
jemalloc
valloc
malloc / realloc / calloc
kmalloc / vmalloc (kernel level)