Some Useful Technologies About RDMA (1)

Posted on 2017-08-21

SGL (scatter / gather list)

An array of S/G elements which exists in a WR that according to the used opcode either collects data from multiple buffers and sends them as a single stream or takes a single stream and breaks it down to numerous buffers

Without Signal
Batch Multi-request to One
Fence

Paper Reading (2)

Posted on 2017-08-21

NOVA

Conclusion:
Adapts conventional log-structured file system techniques to exploit the fast random access that NVMs provide.
- separate logs for each inode to improve concurrency
- stores file data outside the log to minimize log size and reduce garbage collection costs
- based on hardware-based NVMM emulator
NOVA’s logs provide metadata, data, and mmap atomicity and focus on simplicity and reliability, keeping complex metadata structures in DRAM to accelerate lookup operations
Contributes:
1. It extends existing log-structured file system techniques to exploit the characteristics of hybrid memory systems.
2. It describes atomic mmap, a simplified interface for exposing NVMM directly to applications with a strong consistency guarantee.
3. It demonstrates that NOVA outperforms existing journaling, shadow paging, and log-structured file systems running on hybrid memory systems.
4. It shows that NOVA provides these benefits across a range of proposed NVMM technologies.
Challenges:
1. Performance: NVMM has lower latency -> Direct Access (DAX) avoids extra copy between NVMM and DRAM.
2. Write Reordering: cache hierarchies (modern processor) -> reorder store operations -> do not guarantee when update will reach NVMMs.
  explicitly flushing caches and issuing memory barriers (fence) -> clflush instrument.
3. Atomicity: POSIX-style file system require operations to be atomic (e.g. Rename) -> Disk only provides 8-byte atomic operation
  - three technologies to apply atomic: Journaling, Shadow Paging, Log-structuring
Three observations:
1. Data structure which support fast search is hard to implement in NVMM
2. Complexity if cleaning logs stems is unnecessary due to cheaper random access
3. Using single log limits concurrency
Design Decisions:
1. Keep logs in NVMM and indexes in DRAM (radix tree)
  - persistent log is simple and efficient
  - volatile tree structure has no consistency overhead
2. Give each inode its own log: Allow concurrent updates across files without sync (high concurrency/scalability)
  - Per-CPU NVMM free list, journal and inode table
  - concurrent transactions and (de)allocation
3. Use logging and lightweight journaling for complex atomic updates, making sure atomic
  - append data to logs
  - update the log tail to commit the updates (avoid duplicate writes)
4. Implement the log as a singly linked list (4 KB NVMM pages)
  - allocate log space is easy
  - perform log cleaning at fine grained (Page level)
  - reclaiming log pages require a few pointer assignments
5. Do not log file data
6. Copy-on-write for file data
  - Log only contains meta-data
  - Log is short
Fast garbage collection
- Log is a linked list and only contains meta-data -> fast GC deletes dead log pages from list (No copying)
- Two different techs:
  1. Fast GC: deleting dead log page from linked list
  2. Through GC: valid log < 50% -> format a new log and atomically replace the old one -> only copy meta-data

NV-Tree

A consistent and cache-optimized B+Tree with reducing CPU cacheline flush
Motivations:
- Keeping consistency causes a high amplification of the CPU cacheline invalidation and flush with increasing cache misses significantly
- the consistency cost due to flush mostly comes from flushing shifted entries in order to keep LN sorted
- not all the data needs to be consistent to keep the entire data structure consistent
Overview:
- Design
  1. Selectively Enforce Data Consistency
  2. Keep Entries in LN Unsorted
  3. Organizing IN in Cache-optimized Format
- IN - PLN - LN

Some Key Concepts in Transaction System

Posted on 2017-08-16

Journaling File System

A journaling file system is a file system that keeps track of changes not yet committed to the file system’s main part by recording the intentions of such changes in a data structure known as a “journal”, which is usually a circular log.

There are two methods, physical journals or logical journals, to maintain and operate logs in file system. Besides, you can use alternative methods: (1)Software updates, (2)Log-structured file systems, or (3)Copy-on-write file systems to instead JFS.

In my view, JFS is a log based file system, likes a transaction system which will replay the log when recovering. But thats will bring a two-times write penalty (one to file, another to log).

Shadowing Page

In computer science, shadow paging is a technique for providing atomicity and durability (two of the ACID properties) in database systems. A page in this context refers to a unit of physical storage (probably on a hard disk), typically of the order of 1 to 64 KiB.

Soft Update

Soft updates is an approach to maintaining file system meta-data integrity in the event of a crash or power outage. Soft updates work by tracking and enforcing dependencies among updates to file system meta-data. Soft updates are an alternative to the more commonly used approach of journaling file systems.

Soft Updates increases file system efficiency by ordering the disk writes so that every write creates a consistent state.
This allows all writes to happen asynchronously and to avoid seeks wherever possible.

There exists less documents about soft update, and I cannot understand the details of the procedures when doing soft update. Emmmm, I will complete this chapter in the future.

Log-structured file systems

A log-structured filesystem is a file system in which data and metadata are written sequentially to a circular buffer, called a log. This concept is similar with JFS, but the difference is that data in LSFS is stored in log. A LSFS will append its data to the head of log and treats its storage as a circular log.

Pros and Cons:
1. Write throughput on optical and magnetic disks is improved because they can be batched into large sequential runs and costly seeks are kept to a minimum.
2. Writes create multiple, chronologically-advancing versions of both file data and meta-data. Some implementations make these old file versions nameable and accessible, a feature sometimes called time-travel or snapshotting. This is very similar to a versioning file system.
3. Recovery from crashes is simpler. Upon its next mount, the file system does not need to walk all its data structures to fix any inconsistencies, but can reconstruct its state from the last consistent point in the log.

Reference

[1]. Wikipedia
[2]. Xu, Jian, and Steven Swanson. “NOVA: a log-structured file system for hybrid volatile/non-volatile main memories.” USENIX conference on file and storage technologies (2016): 323-338.
[3]. Josephson, William, et al. “DFS: A file system for virtualized flash storage.” usenix conference on file and storage technologies 6.3 (2010).
[4]. Balmau, Oana Maria, et al. “TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores.” USENIX ATC” 17. No. EPFL-CONF-228863. 2017.
[5]. Ganger, Gregory R., et al. “Soft updates: a solution to the metadata update problem in file systems.” ACM Transactions on Computer Systems 18.2 (2000): 127-153.

Paper Reading (1)

Posted on 2017-08-14

HiKV

A persistent key-value store, constructing a hybrid index in hybrid memory
1. well-performed scalability
2. crash consistency guaranteeing
Hybrid memory (DRAM and NVM) systems are byte-addressable – providing similar performance for sequential and random access
Issue:
1. Either hash-index or B+Tree cannot efficiently support all (Put/Get/Update/Delete/Scan) operation.
2. Scan operation becomes important, scan operation is slow in Hash index (without sorted)
Features:
1. A hybrid index consisting of a hash index
  in NVM and a B+-Tree index in DRAM which supports rich KV operation
2. Concurrency schemes => high scalability
3. Ordered-write consistency + specific hash design => atomic write => crash consistency with reduced NVM write
Operation procedure:
1. writing KV items to NVM
2. writing newly-added index entry to hash index(NVM)
3. insert Put request to updating queue in memory
4. back-end thread gets the request and operates the B+tree in background

RAIAD

persistent single machine
based on Log-Structured Merge(LSM Tree)
Three Components: Memory(C) / Disk(L) / Commit Log(CL)
Three motivations:
1. Data skew unawareness
2. Premature and iterative compaction, data skew increases the probability that multi L file has the same key
3. Duplicate writes: CL => L + C => L
Two operations:
1. Flushing, when C or CL is full
2. Compaction, Merge files(SStables) in background, overlapping key ranges
Fit with different workloads
1. Skewed => Triad-Mem => Flushing and Compaction
2. In-Between => Triad-Disk => Compaction
3. Uniform => Triad-Log => Flushing
optimization techs
1. Keep hot keys in memory / Flush only cold keys / Keep hot keys in Commit Log (hots => Duplicate keys)
2. Defer file compaction until the overlap between files becomes large enough, merge lots of duplicate keys
3. Convert CL to a special L0 SSTable(CL-SSTable). Avoid flushing the memory together, play in LSMs and using them in a manner similar to SSTables.

Soft Updates Made Simple and Fast on Non-volatile Memory

NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems

Octopus

Design
1. Shared persistent memory pool => reduce data copy
2. self-identified meta-data RPC => reduce response latency
3. Client-active data I/O => rebalance CPU/network overhead
4. Collect-dispatch transaction efficient CC (RDMA CAS lock)

REWIND: Recovery Write-Ahead System for In-Memory Non-

Volatile Data-Structures

Review About CC(concurrency control)

Posted on 2017-08-03

As we know, concurrency control play a vital role in many computer science fields, especially in operating systems, multiprocessors, and databases. A fine concurrent control strategy can assure a correct result when multi processes access the same data concurrency.

If every processes execute serially, we do not need such a method to control multi-access at the same time, but without the overlap to reduce execution time as well. If we can not cooperate every processes.We have to face A-B-A problem, the lost update problem, the dirty read problem and etc.

There are three main categories of concurrency control mechanisms:

Optimistic: Assume every transaction is a good one(without blocking). If this transaction will bring conflicts, aborting this, and then re-executing this transaction. If there exists lots of abort, it is easy to deduce that the overhead for re-executing is huge.
Pessimistic: Blocking transaction, unless we make sure any operations of this transaction cannot occur abort.
Semi-optimistic

There are four major methods to do concurrency control, and these methods may overlap or be combined in some cases:

Locking (e.g., Two-phase locking - 2PL) - Controlling access to data by locks assigned to the data. Access of a transaction to a data item (database object) locked by another transaction may be blocked (depending on lock type and access operation type) until lock release.
Serialization graph checking (also called Serializability, or Conflict, or Precedence graph checking) - Checking for cycles in the schedule’s graph and breaking them by aborts.
Timestamp ordering (TO) - Assigning timestamps to transactions, and controlling or checking access to data by timestamp order.
Commitment ordering (or Commit ordering; CO) - Controlling or checking transactions’ chronological order of commit events to be compatible with their respective precedence order.

Besides, we can use three methods to support methods above includes:

Multiversion concurrency control (MVCC) - Increasing concurrency and performance by generating a new version of a database object each time the object is written, and allowing transactions’ read operations of several last relevant versions (of each object) depending on scheduling method.
Index concurrency control - Synchronizing access operations to indexes, rather than to user data. Specialized methods provide substantial performance gains.
Private workspace model (Deferred update) - Each transaction maintains a private workspace for its accessed data, and its changed data become visible outside the transaction only upon its commit (e.g., Weikum and Vossen 2001). This model provides a different concurrency control behavior with benefits in many cases.

In conclusion, CC is a important technology to improve ACID attributes in transaction. In different application situations, we must apply different concurrency control method to avoid incorrect status of system.

Tips About Using Cmake

Posted on 2017-07-25

I have used Makefile for more than 3 years, and I like to use it for its easy-to-use. But I have to learn how to use Cmake(I hate this) due to my project is based on OpenCV, which based on Cmake when compiling. So I will write some tips/hints for myself:

Basic functions

# 
cmake_minimum_required(VERSION 2.8) # Required version for Cmake
project(XXX) # project name
add_executable(XXX) # compile

Multi-files & multi-dirs
1
# TODO
Define compile options
1
# TODO
Add dependencies and testers
1
# TODO
Miscellaneous
1
# TODO

the first step to CUDA

Posted on 2017-07-08

Introduction

As we all known, GPU is especially suitable for massive data-parallel computation. Now GPU plays a vital role in machine leaning(e.g. deep learning), HPC and graphic. We can use a common-used library named CUDA, which developed by Nvidia and can use GPU hardware directly, to solve many complex computational problems in a more efficient way than on a CPU. But in order to write excellent CUDA programs, there is some important concepts we need to understand, like blocks, cache in chips, shared memory in GPU and etc. Below, I will list these key concepts and give some explanations and code examples.

A Tour of Different NVIDIA GPUs

//TODO some comparision between different Nvidal GPU hardware.

A `hello world` example

This is CUDE program print Hello world from different threads in different blocks. And this is a 3-D slices in block level and a 2-D slices in thread level.

#include <stdio.h>
#include <assert.h>
// CUDA runtime
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <helper_cuda.h>
#ifndef MAX
#define MAX(a,b) (a > b ? a : b)
#endif
__global__ void testKernel()
{
    printf("[%d, %d]:Hello world\n",\
            blockIdx.y*gridDim.x+blockIdx.x,\
            threadIdx.z*blockDim.x*blockDim.y+threadIdx.y*blockDim.x+threadIdx.x,\
            val);
}
int main(int argc, char **argv)
{
    int devID;
    cudaDeviceProp props;
    // This will pick the best possible CUDA capable device
    devID = findCudaDevice(argc, (const char **)argv);
    //Get GPU information
    checkCudaErrors(cudaGetDevice(&devID));
    checkCudaErrors(cudaGetDeviceProperties(&props, devID));
    printf("Device %d: \"%s\" with Compute %d.%d capability\n",
           devID, props.name, props.major, props.minor);
    printf("printf() is called. Output:\n\n");
    //Kernel configuration, where a two-dimensional grid and
    //three-dimensional blocks are configured.
    dim3 dimGrid(2, 2);
    dim3 dimBlock(2, 2, 2);
    testKernel<<<dimGrid, dimBlock>>>();
    cudaDeviceSynchronize();
    // cudaDeviceReset causes the driver to clean up all state. While
    // not mandatory in normal operation, it is good practice.  It is also
    // needed to ensure correct operation when the application is being
    // profiled. Calling cudaDeviceReset causes all profile data to be
    // flushed before the application exits
    cudaDeviceReset();
    return EXIT_SUCCESS;
}

What’s Kernel?

CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

A kernel must use the __global__ to declare this function is executing in CUDA threads. If you want to use this kernel function in other fields, you need to use <<<>>> to run kernel codes in host. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

As an illustration, the following sample code adds two vectors A and B of size N and stores the result into vector C:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
	int i = threadIdx.x;
	C[i] = A[i] + B[i];
}
int main()
{
	...
	// Kernel invocation with N threads
	VecAdd<<<1, N>>>(A, B, C);
	...
}

Get More example source codes

You can type cuda-install-samples-8.0.sh [installing path] in command line to get lots of useful example source codes to teach yourself how to make your CUDA programs be more perfect.

How to config and look up your RDMA deivce

Posted on 2017-06-29

RDMA technology is built upon Infiniband NIC, and OFED package is the basic software stack for RDMA. I will show some methods to look up and test your Infiniband NIC and make sure your RDMA library works well.

Using lspci to look up your IB’s Firmware version.
1
lspci | grep Mellanox
Start your opensm service (a service for running RDMA).
1
/etc/init.d/opensmd start[restart...]
Get information of the whole Infiniband network.
1
2
3
ibhosts
iblinkinfo
ibswitches

A useful tools works to check whether your IB’s status.

ibstat
ibstatus
ib_devinfo
ib_device
ifconfig ibx
cat /sys/class/infiniband/mlx4_0/*

Tools to benchmark the Throughput(Mpps)/Latency(us)/Bandwidth(Gb) of RNIC.
1
ib_[read|write|atomic|send]_[bw|lat]

pingpong test

1	ib_[rc\|uc\|srq\|task\|ud\|xsrq\|cc]_pingpong

Hello World

Posted on 2017-06-26

Hello everyone! I’m Teng Ma. Now I’m a Phd student in Tsinghua University. My main topics of interest include distributed database, key-value store, stream computing and RDMA technology.