Some Useful Technologies About RDMA(2)

Posted on 2017-10-19

ODP

TOS

Dynamically Connected Transport

My Understanding about Hierarchy of Needs

Posted on 2017-10-09

Level 5
Level 4
Level 3
Level 2
Level 1

How to Build a Non-Volatile Memory DBMS (Survey)

Posted on 2017-10-05

This is a summary about a paper and a related talk in Sigmod 17, and this paper concludes the three main ways to make use of the special features in NVM when build up a DBMS.

NVM properties:

1. Byte addressable: Loads and stores un`like SSD/HDD
2. High random write throughput: Orders of magnitude higher than SSD/HDD, Smaller gap between sequential & random write throughput.
3. Read-write asymmetry & wear-leveling: Writes might take longer to complete compared to reads. Excessive writes to a single NVM cell can destroy it.

Evaluation NVM

=> HDD/SSD/Emulated NVM.
- Tune DRAM latency to emulate NVM (increasing latency)
- CLWB (Cache-line write back) instruction

Three levels in NVM DBMS:

1) Access Interfaces,
2) Storage Manager
3) Execution Engine.

Allocation interface

Standard functions
- Dynamic memory allocation
- Meta-data management
- Durability mechanism
  - Two-step implementation
    - Allocation first by using CLWB
    - Issues a SFENCE to make sure changes are visible
- Naming mechanism
  - Map NVM to a well know base address, Absolute pointer = Base address + Relative pointer
- Recovery mechanism
  - Consider a crashing before linking the chunk to DS which occur to losing of this chunk (neither allocator nor application can reclaim the space).
File system interface
- Device -> page cache -> app buffer => Device -> app buffer
- 64 bytes cache line atomic

Storage Manager

Logging & Recovery
- Write ahead log (LET’S TALK ABOUT STORAGE AND RECOVERY METHODS FOR NON-VOLATILE MEMORY DATABASE SYSTEMS SIGMOD' 2015)
  - Avoid duplicating data in the log and the checkpoint
    - logging the meta-data not the data updates
    - WAL + FS API vs. WA(meta-data)L + Allocation API
- Write behind log (WRITE-BEHIND LOGGING VLDB' 17)
  - pros
    1. Write-behind logging enables instant recovery
    2. Improves device utilization by reducing data duplication
    3. Extends the device lifetime
Data placement
- Three-tiers: DRAM (meta-data + data) + NVM (log + data) + SSD (data)
- DRAM -> hot data (cache), NVM -> warm data, SSD -> cold data

Access Methods

Read-write asymmetry & wear-leveling
– Writes might take longer to complete compared to reads
– Excessive writes to a single NVM cell can destroy it
Write-limited access methods
- NVM-aware B+Tree, hash table
  - NVM-aware B+Tree:
    - In leaf node, using linear scan (unsorted data) with fewer writes instead of binary search (sorted data) with more writes.
    - Using unbalanced tree (fewer writes)

Execution engine

Plan executor
- reduce the number of writes (Sorting algorithm and join algorithm) by using 1) adjustment of write-intensivity knob and 2) write-limited algorithms.
  - Hybrid join/sort
  - Segment sort => adjust x (x% merge sort + (1-x)% selection sort)

How to Build a Non-Volatile Memory

Posted on 2017-10-05

A Try about Profiling Jemalloc Code

Posted on 2017-09-27

A Short Introduction

Jemalloc is traditional implementation about allocation in OS, jemalloc first came into use as the FreeBSD libc allocator in 2005, and since then it has found its way into numerous applications that rely on its predictable behavior. Jemalloc has already used in thousands of real world applications to accelerate and strength allocation function. Jemalloc performs more efficient than glibc malloc (a.k.a, Ptmalloc) in a multi-thread situation, and has a higher scalability.

API

POSIX API

jemalloc provides some common functions to allocation or deallocation memory in system, including malloc, calloc, realloc, free, memalign and aligned_alloc.

NONE Standard API

void *mallocx(	size_t size,
 	int flags);
 
void *rallocx(	void *ptr,
 	size_t size,
 	int flags);
 
size_t xallocx(	void *ptr,
 	size_t size,
 	size_t extra,
 	int flags);
 
size_t sallocx(	void *ptr,
 	int flags);
 
void dallocx(	void *ptr,
 	int flags);
 
void sdallocx(	void *ptr,
 	size_t size,
 	int flags);
 
size_t nallocx(	size_t size,
 	int flags);
 
int mallctl(	const char *name,
 	void *oldp,
 	size_t *oldlenp,
 	void *newp,
 	size_t newlen);    // This one provides controlling of allocation configurations
 
int mallctlnametomib(	const char *name,
 	size_t *mibp,
 	size_t *miblenp);
 
int mallctlbymib(	const size_t *mib,
 	size_t miblen,
 	void *oldp,
 	size_t *oldlenp,
 	void *newp,
 	size_t newlen);
 
void malloc_stats_print(	void (*write_cb) (void *, const char *) ,
 	void *cbopaque,
 	const char *opts);  // For debuging code
 
size_t malloc_usable_size(	const void *ptr);
 
void (*malloc_message)(	void *cbopaque,
 	const char *s);

Compiling

You can use several ways to build up your jemalloc library, please refer to this link for more details
My procedures for integrating jemalloc into an application:
1. use jemalloc-config script in bin directory to looking your building root directory
2. Run ./configure and run the make in bash
3. use LD_PRELOAD environment variable to add libjemalloc to LD environment
4. compiling the execution code like this cc ex_stats_print.c -o ex_stats_print -Ijemalloc-config –includedir-Ljemalloc-config –libdir-Wl,-rpath,jemalloc-config –libdir -ljemalloc jemalloc-config --libs
If you want change some details and use some surprise functions in jemalloc library:
1. You can use MALLOC_CONF environment variable to control/tune jemalloc, like this export MALLOC_CONF="prof:true,lg_prof_sample:1,prof_accum:false,prof_prefix:jeprof.out", and there is link about how to use this options
2. Or you can change these configuration before making jemalloc project, these configuration options are enabled in libcs built-in jemalloc: --enable-dss, --enable-experimental, --enable-fill, --enable-lazy-lock, --enable-munmap, --enable-stats, --enable-tcache, --enable-tls, --enable-utrace, and --enable-xmalloc.
3. Attention , Using --with-jemalloc-prefix as a option when compiling jemalloc with an API prefix, will make this library is not easy to use. You should use je_XXX to replace XXX function, and reconstruct the whole program.

Implementation Details

How Jemalloc get memory from OS?
- Jemlloc use mmap or sbrk to try to get memory from the lower level API. However, user can use the different strategy to control memory obtaining from mmap, sbrk or both of them.
- Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If sbrk(2) is supported by the operating system, this allocator uses both mmap(2) and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
Small size allocation and large size allocation
Other details

This allocator uses multiple arenas in order to reduce lock contention for threaded programs on multi-processor systems. This works well with regard to threading scalability, but incurs some costs. There is a small fixed per-arena overhead, and additionally, arenas manage memory completely independently of each other, which means a small fixed increase in overall memory fragmentation. These overheads are not generally an issue, given the number of arenas normally used. Note that using substantially more arenas than the default is not likely to improve performance, mainly due to reduced cache performance. However, it may make sense to reduce the number of arenas if an application does not make much use of the allocation functions.

In addition to multiple arenas, this allocator supports thread-specific caching, in order to make it possible to completely avoid synchronization for most allocation requests. Such caching allows very fast allocation in the common case, but it increases memory usage and fragmentation, since a bounded number of objects can remain allocated in each thread cache.

Memory is conceptually broken into extents. Extents are always aligned to multiples of the page size. This alignment makes it possible to find metadata for user objects quickly. User objects are broken into two categories according to size: small and large. Contiguous small objects comprise a slab, which resides within a single extent, whereas large objects each have their own extents backing them.

Small objects are managed in groups by slabs. Each slab maintains a bitmap to track which regions are in use. Allocation requests that are no more than half the quantum (8 or 16, depending on architecture) are rounded up to the nearest power of two that is at least sizeof(double). All other object size classes are multiples of the quantum, spaced such that there are four size classes for each doubling in size, which limits internal fragmentation to approximately 20% for all but the smallest size classes. Small size classes are smaller than four times the page size, and large size classes extend from four times the page size up to the largest size class that does not exceed PTRDIFF_MAX.

Allocations are packed tightly together, which can be an issue for multi-threaded applications. If you need to assure that allocations do not suffer from cacheline sharing, round your allocation requests up to the nearest multiple of the cacheline size, or specify cacheline alignment when allocating.

Reference

Paper Reading (Allocation of NVRAM)

Posted on 2017-09-18

Introduction about NVRAM

Non-volatile byte-adressable memory
Faster and more durable than flash
Higher capacity, cheaper, and less energy consumption than DRAM
Comparison between different kinds of storage medium.

Types	Latency
DRAM	0.01us
NVRAM	0.05~0.1us
Flash	100us
Disk	10000us

NVRAM and DRAM hybrid memory system

Combine advantages of NVRAM and DRAM
Using DRAM as Cache to accelerate memory access
Using NVRAM to persistent data

Some challenges about using NVM

NVRAM may have inconsistent states if stores are reordered
- Compiler-Level Reordering, CPU Out-Of-Order Execution
Volatile CPU caches hold data that is not on NVRAM
Incomplete updates need to be rolled back (restoring/recovering)
NVRAM memory has to be dynamically managed (reducing writes)
Pointers to NVRAM objects need to be valid after a restart (How to recover meta-data of allocation)

NVM_malloc

How to get memory from system
- Using mmap to get memory region from application’s virtual memory
- Instead of creating one giant file, multiple files are mapped into contiguous, anonymous mmap space
- All translations are done by the MMU without kernel involvement or other software overhead (Bypass feature)
Two-step allocation
- Reserve Data -> (Prepare Data) -> Activate Memory
- Reserve: Get resource from heap memory region for using, recovered as “Free”
- Activate Memory: Establish links from other objects, recovered as “In use”
Faster execution
- Making DRAM as Cache to storing meta-data (VHeader: A volatile copy)
- Reserve step can be executed in DRAM
- 64 B (cacheline-sized) headers ensure atomic flushes to NVRAM

DRAM.       |  NVRAM                                 
VHeader. <--|-->  Header --> Block
            |

Pmem (a.k.a, NVML)

Makalu

PAllocator

Learning About Memory Allocation

Posted on 2017-09-18

Basic Posix Allocation API:


void *malloc(	size_t size);
 
void *calloc(	size_t number,
 	size_t size);
 
int posix_memalign(	void **ptr,
 	size_t alignment,
 	size_t size);
 
void *aligned_alloc(	size_t alignment,
 	size_t size);
 
void *realloc(	void *ptr,
 	size_t size);
 
void free(	void *ptr);

Memory Allocator Introduction (In Chinese)
Jemalloc Learning (In Chinese)
Catalog
- PtMalloc (a.k.a, glibc malloc)
- TCMalloc (By Google)
- JeMalloc (By FreeBSD)
- Hoard (By MIT)
- To be continued.

Paper Reading (3)

Posted on 2017-08-31

Data Tiering in Heterogeneous Memory Systems

Heterogeneous memory architectures coupling DRAM with high capacity, low cost, but also lower performance NVM. couple NVM with a smaller amount of DRAM as cache.
Contributions:
1. high performance -> placing some applications data structures in small DRAM
2. different memory access patterns (sequential, random, pointer chasing)
  - pointer chasing: every operation will occur a cache miss when traveling a list
3. X-mem -> map objects to data structures
  - automatically placing data-structures in appropriate memory types
4. According access pattern and automates the task of mapping memory regions of that data structure to DRAM or NVM to optimize performance
Design (features)
1. Using a unique tag (Hash, Tree, Buffer) for each data structures
2. Binding NVM and DRAM to NUMA node
3. Select the Solution Type

	Dependent	Independent
Sequential	NA	Streaming
Non-sequential	Pointer chasing	Random

A Write-friendly Hashing Scheme for Non-volatile Memory Systems

This is a naive paper, and the idea is trivial
This paper introduces a newly hash scheme, which reduce frequency of write operation in NVM
Existing Hashing Schemes:
- Chained Hash
- Linear Hash
- 2-choice Hash
- Cuckoo Hash

Hash Scheme Structure

| |---| |---| |---| |---| |---| |---| |---| |   Layer-1
  \   /       \   /       \   /       \   /
   | |---------| |---------| |---------| |      Layer-2
      \       /               \       /
       \     /                 \     /
         | |                     | |            Layer-3
                  ...........                   Layer-L

every two nearly table items share a item in next layer
cache coherence
remove some levels in the bottom -> shortening finding path

Atomic, SpinLock, Mutex, Thread_Lock, Which Is The Fastest One?

Posted on 2017-08-31

When we write a concurrent program, we have various methods to guarantee our program is linearizable and serializable, which means different threads will write or read a area of same memory or variable sequentially. But the efficiency of these methods is very different. Next I will introduce four methods: atomic, spin lock, mutex and thread lock, and compare these four methods to analysis which one is the fastest one (maybe is not the best one).

I use time ./exec to estimate the running time of programs.

Atomic

#include <future>
#include <iostream>
#include <thread>
#include <atomic>
using namespace std;
atomic<int> counter(0);
int thread_num = 2;
const int limit_iter = 10000000;
void f() {
	for (int i = 0; i < limit_iter; i ++ ) {
		counter ++;
	}
}
int main() {
	thread a[thread_num];
	for (int i = 0; i < thread_num; i ++ ) {
	 	a[i] = thread(f);
	}
	for (int i = 0; i < thread_num; i ++ ) {
	 	a[i].join();
	}
	cout << counter << endl;
}

Spin Lock

I use c++ atomic variable to implement a spin lock likely structure, because C++11 doesn’t provide a implementation of spin lock.

#include <future>
#include <iostream>
#include <thread>
#include <atomic>
using namespace std;
atomic_flag locked = ATOMIC_FLAG_INIT;
inline void lock() {
 	locked.test_and_set(std::memory_order_acquire);
}
inline void unlock() {
 	locked.clear(memory_order_acquire);
}
int counter(0);
int thread_num = 2;
const int limit_iter = 10000000;
void f() {
	for (int i = 0; i < limit_iter; i ++ ) {
		lock();
		counter ++;
		unlock();
	}
}
int main() {
	thread a[thread_num];
	for (int i = 0; i < thread_num; i ++ ) {
	 	a[i] = thread(f);
	}
	for (int i = 0; i < thread_num; i ++ ) {
	 	a[i].join();
	}
	cout << counter << endl;
}

Mutex

I use c++11 mutex, not the mutex in unix API, to doing test.

#include <future>
#include <iostream>
#include <thread>
#include <atomic>
using namespace std;
int counter(0);
mutex mu;
int thread_num = 2;
const int limit_iter = 10000000;
void f() {
	for (int i = 0; i < limit_iter; i ++ ) {
//		lock_guard<mutex> lock(mu); // Or this line can provide the same effect
		mu.lock();
		counter ++;
		mu.unlock();
	}
}
int main() {
	thread a[thread_num];
	for (int i = 0; i < thread_num; i ++ ) {
	 	a[i] = thread(f);
	}
	for (int i = 0; i < thread_num; i ++ ) {
	 	a[i].join();
	}
	cout << counter << endl;
}

thread lock

There is not existing a thread lock method in C++11, so I use pthread_mutex_lock and pthread_mutex_unlock in pthread lib to instead. But thread library in C++11 is based on pthread, so I think this is a equally and right testing.


#include <future>
#include <iostream>
#include <thread>
#include <atomic>
#include <pthread.h>
#include <unistd.h>
using namespace std;
atomic<int> counter(0);
int thread_num = 2;
const int limit_iter = 10000000;
pthread_mutex_t locked;
void f() {
	for (int i = 0; i < limit_iter; i ++ ) {
		pthread_mutex_lock(&locked);	
		counter ++;
		pthread_mutex_unlock(&locked);	
	}
}
int main() {
 	pthread_mutex_init(&locked, NULL);
	thread a[thread_num];
	for (int i = 0; i < thread_num; i ++ ) {
	 	a[i] = thread(f);
	}
	for (int i = 0; i < thread_num; i ++ ) {
	 	a[i].join();
	}
	cout << counter << endl;
}

evaluation

I use two threads in one process, and every thread do 10M times of increment operations.

methods	time
without any lock	0.24s
atomic	1.08s
spin lock	2.90s
mutex	5.82s
thread lock	6.81s

In a nutshell, making variable becoming atomic is the fastest one.
This is the link of all codes.

All Kinds of Allocations

Posted on 2017-08-30

memalign
jemalloc
valloc
malloc / realloc / calloc
kmalloc / vmalloc (kernel level)

ODP

Tag

TOS