Use QEMU install a Fedora VM and build up virtual RDMA NIC

Posted on 2019-03-24

Goal: I need to add live migration support for the PVRDMA device under QEMU. So I need to install a QEMU from source code and make the migration is supported via RDMA.

First step, install QEMU from source code. Please notice you only require to compile the module under x86_64 arch. To support install with visual interfaces, you should enable sdl with apt install libsdl2-dev

git clone git@github.com:qemu/qemu.git
sudo ../configure --target-list=x86_64-linux-user,x86_64-softmmu --enable-sdl
make -j
make install

Second step, build a Fedora VM from image. In this step, we should carefully choose the candidate image. A recommended method is to use a workstation version of OS image. After that, mount the image to execute a standard install processing.

1
2
3

sudo proxychains wget https://mirrors.tuna.tsinghua.edu.cn/fedora/releases/27/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-27-1.6.iso
sudo qemu-img create ubuntu.img 10G
sudo qemu-system-x86_64 --enable-kvm -m 4096 -smp 4 ubuntu.img -drive file=Fedora-Workstation-Live-x86_64-27-1.6.iso,media=cdrom,index=1 -drive file=ks.iso,mle=ttyS0 ks=cdrom:/ks.cfg" -kernel tmp/isolinux/vmlinuz -initrd tmp/isolinux/initrd.img

TODO

The Usages of Intel RTM

Posted on 2019-03-04

The emerging HTM (Hardware Transactional Memory) technology can reduce the extra overhead (i.e., Lock) during doing transactions. Recently years, Intel make its hardware solutions, RTM (Restricted Transactional Memory), is available as a mature commercial products. There are three main instructions: XBEGIN, XEND, and XABORT for supporting HTM. Intel Intrinsics library natively support such three software interfaces in C/C++.

To start/end the transactional code region, programmer should use XBEGIN/XEND. The XABORT instruction allows programmers to abort the execution of an RTM region explicitly. The XABORT instruction takes an 8-bit immediate argument that is loaded into the EAX register becoming available to software following an RTM abort.

A simple example is just like following codes.

while (1) { // keep trying
    int status = _xbegin(); // set status = -1 and start transaction
    if (status == _XBEGIN_STARTED) { // status == XBEGIN_STARTED == -1
        (*g) ++; // non atomic increment of shared global variable
        _xend(); // end transaction
        break; // break on success
    } else { //
        x_abort(0xff);
    } //
}

Firstly, you should check whether your CPU platform is support RTM or not. Just run cat /proc/cpuinfo, the answer is yes if rtm is shown in the flags item. To compile such RTM codes, you should include the C header named immintrin.h add an extra compiler flags -mrtm

A full descriptions of RTM’s APIs is available here

Garbage Collection

Posted on 2019-02-07

Abstraction of Garbage Collection

Manual Memory Management: C/C++
- Problems: 1. Memory Leak; 2. Double Free; 3. Use-after-frees
What’s Garbage: Objects that won’t be used again
Types of Garbage Collectors:
- Incremental vs. stop-the-world:
- An incremental collector is one that runs concurrently
  with the program.
- A stop-the-world collector pauses program execution to
  look for garbage.
- Compacting vs non-compacting:
- A compacting collector is one that moves objects around
  in memory.
- A non-compacting collector is one that leaves all objects
  where they originated.

Algorithms

The simplest framework: reference counting
- Each object should have a reference count (refcount)
- Create/Delete a reference will increase/decrease the value of refcount
- When refcount is zero, it should be reclaimed.
- More: Reference Cycles is a set of objects that cyclically refer to one another.
- Because all the objects are referenced, all have nonzero refcounts and are never reclaimed.
- An implementation in C++: shared_ptr
Mark-and-Sweep: The Intuition
- Any objects (not) reachable from the root set are (not) reachable
- Marking Phase: Find reachable objects. + Sweeping phase: Reclaim free memory
- Four States: Marked, Enqueued, Unknown, Deallocated

Baker’s Algorithm

Move all of the root set to the enqueued list.
While the enqueued list is not empty: 
    Move the first object from the enqueued list to the marked list.
    For each unknown object referenced, add it to the enqueued list.
At this point, everything reachable is in marked and everything unreachable is in unknown.
Concatenate the unknown and deallocated lists
    Deallocates all garbage in O(1).
Move everything from the marked list to the unknown list.
    Can be done in O(1).
    Indicates objects again must be proven reachable on next scan.

Stop-and-Copy:
- Improve Locality => compaction
- Increasing Allocation Speed: Free-List
  - General: 10-20 assembly instructions
  - Stack Allocation: one
- Advantages:
  1. Implementation simplicity (compared to mark-and-sweep).
  2. Fast memory allocation; using OS-level tricks, can allocate in a single assembly instruction.
  3. Excellent locality; depth-first ordering of copied objects places similar objects near each other.
- Disadvantages:
  1. Requires half of memory to be free at all times.
  2. Collection time proportional to number of bytes used by objects.

Hybrid Approaches

Motto: Objects die young (short live object)
Several Layer “Generations”:
1. Eden: Stop-and-copy strategy
2. Survivor Objects: move elements to it when OOM
3. Tenured Objects: Objects that survive long enough

TO BE CONTINUE

RDMA Trouble Shooting Tools

Posted on 2018-12-19

Monitor Traffic
1. add options mlx4_core log_num_mgm_entry_size=-1 to /etc/modprobe.d/mlx4.conf
2. restart the driver via /etc/init.d/openibd restart
3. use ibdump
check what is the global pause configuration
- ethtool -a eth2 or ethtool -A eth2
Diagnose
- ibdiagnet
Check Connections and Switch
- ibswitches
- ibdev2netdev
- iblinkinfo

Coroutine Makes System More Effciency

Posted on 2018-11-04

Stackless v.s. Stackful

SeqLock: An Optimistic and Write Prefered

Posted on 2018-11-04

Foreword

Lock is a crucial component in computer system including database, file system, parallel computation system, etc. It limit different threads to access the same resource simultaneously. It can be seen as a mutual exclusion concurrency control policy. These multiple accessing phenomena can be divided into three kinds: Write-Write conflict, Write-Read conflict, Read-Read conflict (not exist). Write-Read Lock (a.k.a., Exclusive Lock & Shared Lock) can stop the same resource accessing from multiple writer and support multi-reader and single write (SWMR) naturally. The Write-Read Lock also has two main kinds of algorithm: Read-Preferred and Write preferred. Unfortunately, this lock mechanism always assume the conflict may happening (pessimistic), and is un-friendly to the process which will block the process and even bring with deadlock.

SeqLock: An Optimistic Lock

The sequential lock (seq-lock) is composed with a read lock (sequencer) and a write lock to avoid the multi writer accessing.
This mechanism is also known as retry lock. When read operation is finished, the reader will check the lock again. Readers read the sequence number before and after reading the shared data. If the sequence number is odd on either occasion, a writer had taken the lock while the data was being read and it may have changed. If the sequence numbers are different, a writer has changed the data while it was being read. In either case readers simply retry (using a loop) until they read the same even sequence number before and after.

The following paragraph is from wikipedia.

The reader never blocks, but it may have to retry if a write is in progress; this speeds up the readers in the case where the data was not modified, since they do not have to acquire the lock as they would with a traditional read-write lock. Also, writers do not wait for readers, whereas with traditional read-write locks they do, leading to potential resource starvation in a situation where there are a number of readers (because the writer must wait for there to be no readers). Because of these two factors, seqlocks are more efficient than traditional read-write locks for the situation where there are many readers and few writers. The drawback is that if there is too much write activity or the reader is too slow, they might livelock (and the readers may starve).

The pseudo is here. (Please notice that there are multiple writers here)

Lock write_lock
Write_Lock():
  write_lock.lock()
  atomic_increment(SN)
Write_Unlock():
  atomic_increment(SN)
  write_lock.unlock()
Reader_Lock():
  do 
    ret = atomic_read(SN)
  while ret % 2 == 0  // is odd
  start_sn = ret
Reader_Unlock():
  return start_sn != atomic_read(SN)
// Usage: To read data
do
  Reader_Lock()
  // DO read
while Reader_unlock()

A Simple Implementation

Here is a toy. In my codes, I use STL atomic variable to implement spin-lock (test and set) and sequencer. In the distributed system with RDMA, RDMA atomic verbs can replace the CAS and FAA operation of the gcc complier to build up distributed seq-lock.
Another simple implementation can be find here

Concepts of SkipList: Implementation, Concurrency and Lock-free

Posted on 2018-11-04

Concept

Ordered Data Structure
Combining Multi-layer linked list together, the node number of every layer will exponential decline with a factor K and K = 1 / r which r is a critical threshold for skiplist.
Each node has N pointer to its next nodes, the N is known as the degree of this node.
Support: O(log n) CRUD (Create/Read/Update/Delete)

Find Operation

Find (K):
  ptr = head->next_list[max_level - 1]
  for i in max_level to 0:
    find the last node (last_node) whose key is large than K
    ptr = pre_node->next_list[i]
    // updates[i] = pre_node
  if ptr->key == K
    return ptr
  return nullptr // return updata array if you need a insert

Insert Operation

Insert (K, V):
  ptr, updates = Find(K)
  if ptr == nullptr:
    new_node = NewSkipListNode()
    new_degree = skiplist_rand()
    for i in 0 to max_level:
      new_node->next_node[i] = updates->next_list[i]
      updates->next_list[i] = new_node

A fine blog will describe more details.

Vary Implementations

My Implementation is here
Google nbds is here

Research Areas

NVM based data structure
- How to design skiplist with less write overhead
- How to persist skiplist in NVM
- How to recover skiplist in NVM
Distributed SkipList
- Partition/Across-Layers?
- How to gain a higher concurrency
Write Optimization
- Write-Optimize Skip List PODS’ 17
  - Cache higher level of the data structures (closest to root)
  - TO DO ….

Issues when concurrent accessing

Write-Write Conflict
Dirty Read
Fine-grained lock for every node
COW (copy on write) will cause O (n) copying overhead

Lock-Free

Distributed Concurrency Control

Posted on 2018-10-14

Read Phenomena

Dirty Read
Non-repeatable Read
Phantom Read

Isolation Level

Serializable
Repeatable Read
Read Committed
Read Uncommitted

CMU 15-645 Distributed OLTP

Posted on 2018-10-05

OLTP v.s. OLAP

On-line Transaction Processing
- Short-lived TX
- Small footprint
- Repetitive operation
On-line Analytical processing
- Long running queries
- Complex joins
- Exploratory queries

Architectures:

Shared Everything -> Shared Memory -> Shared Disk -> Shared Nothing
- Shared Nothing (Only via network): Easy to increase capacity and hard ensure consistency at the same time

Data Transparency

Partitioned (Sharding) v.s. Replicated
Horizontal Partitioning (Round-robin/Hash/Range)

Distributed Transactions

Centralized v.s. Decentralized

Distributed CC

Need to allow multi-TXs to execute simultaneously across multi-nodes
- Factors: Replication, Network Communication Overhead, Node Failure, Clock skew
- Atomic Commit Protocol
- 2PC, 3PC, Paxos, Raft, ZAB

CMU 15-645 Recovery

Posted on 2018-10-05

When discussing recovery, we have to mention ARIES system (full name: Algorithms for Recovery and Isolation Exploiting Semantics)

LSN (Log Sequence Number)

Types
| Name | Where | Definition |
|—|—|—|
| flushdLSN | RAM | Last LSN in log on disk |
| pageLSN | @page | Newest update to page |
| recLSN | @page | Oldest update to page |
| lastLSN | Tj | Last action of TX Tj |
| MasterRecord | Disk | LSN of latest checkpoint |