[GSoC 03] Decouple Data Structures from Libcds

Posted on 2018-06-01

Replace Components in Libcds

Build up the source trees

TODO

[GSoC 02] Deployment and Configuration

Posted on 2018-05-27

Hardware Environment

My personal laptop, a MacBook Pro 15’ inch (2015).
A Dell PowerEdge Server runs Ubuntu 14.04 and equips with dual 8-core CPUs (Intel Xeon E5-2640 v2, 2.0 GHz), 96 GB memory space, and a Mellanox ConnectX-3 InfiniBand NIC (MT27500, 40 Gbps).

Software Environment

Boost Library 1.60
- I compile the binary and install this in MacOS, and directly install libboost via apt-get tool in Ubuntu.
Hwloc Library (Hierarchical view of the machine)
GCC-5.4
- The server has a old version 4.8 of GCC, so C++17 cannot be used in the environment.
- I upgrade gcc and use update-alternatives to switch new version on.
CMake
- Update my cmake from a private source list.

HPX deployment

Add my_test/,ctags,’CMakeFiles/‘ to .gitignore for neglecting some unnecessary files.
Make a new directory named my_test and run cmake in this directory and indicate the path of hwloc and libboost.
Run make -j16 to build up and generate the static library.

Libcds

make a test directory and run CMake (add -DCDS_DISABLE_128BIT_ATOMIC flag for 64-bit building).
make and make install libcds

Libcds overview

The core functions are all in the cds directory.

cds/
├── algo
│   └── flat_combining
├── compiler
│   ├── clang
│   ├── gcc
│   ├── icl
│   └── vc
├── container
│   ├── details
│   ├── impl
│   ├── striped_map
│   └── striped_set
├── details
├── gc
│   └── details
├── intrusive
│   ├── details
│   ├── impl
│   └── striped_set
├── lock
├── memory
├── opt
├── os
├── sync
├── threading
├── urcu
└── user_setup

In algo, there are some basic algorithm used in data structures, like flat combining (an OCC algorithm)

cds/algo
├── atomic.h
├── backoff_strategy.h
├── base.h
├── bitop.h
├── bit_reversal.h
├── elimination.h
├── elimination_opt.h
├── elimination_tls.h
├── flat_combining
│   ├── defs.h
│   ├── kernel.h
│   └── wait_strategy.h
├── flat_combining.h
├── int_algo.h
└── split_bitstring.h

In most cases, Libcds uses RCU (read-copy-update) or HP (hazard pointer) to handle concurrent. In conclusion, the main types of container including:

Queue
Stack
HashMap/HashSet
List/KVList
Binary Tree/AVL Tree
There are some special algorithms for different data structure to provide lock-free, like MichaelList, Feldman Method and so on.

Test Libcds

Make a new directory, and type make test to show the benchmarks.
Some of the test case cannot pass, so I post an issue in libcds github homepage. (https://github.com/khizmax/libcds/issues/121/ )
Firstly, I try to change my allocator (default is Glibc allocator) to TCMalloc. Then I try to build libcds with asan instrumentation and run the test.
On the other hand, I also run some examples to test the correctness of these algorithms.

[GSoC 01] HPX and libcds

Posted on 2018-05-15

I am glad to see my proposal is accepted by STE||AR Group, and I am eager to make contributions to this community. To push me to focus on my GSoC project, I decide to write a serial article to give a brief introduction to my project.

Firstly, I want to give my understand about HPX, HPX is a library to provide everything you want in HPC (high performance computation). HPX includes most common used parallel algorithms and containers. Especially, some standard STL containers are available in HPX.

STL containers such as vectors/maps/sets/etc are not thread safe. One cannot safely add or remove elements from one of these containers in one thread, while iterating or adding/removing in another thread without potentially catastrophic consequences (usually segmentation faults leading to eventual program failure). So these containers must upport concurrent access. Some work has begun on implementing concurrent structures in HPX, a concurrent unordered map with reader/writer lock and a partial implementation of concurrent vector exist, but they have not all been completed, do not have unit tests and need to be unified into an hpx::concurrent namespace. On the other hand, this lock based implementations are low efficiency, we hope to make use of lock free tech to complete our libraries.

A number of libraries implementing thread safe (sometimes lockfree) containers already exist that can be used for ideas and where code uses a boost compatible license can be integrated into HPX. The aim of the project is to collect as much information and as many implementations of threads safe containers and create or integrate them into the HPX library.

Fortunately, we find an open source library named libcds. Moreover, we get the authorization to migrate these containers to our namespace.

libcds is a data structure library including queue, stack, priority queue, binary tree, list, hash table, hash map, etc. Al these data structures are lock-free and have more than one implementation by using different algorithms.

[References]

Algorithms Behind Modern Storage Systems [Reproduced]

Posted on 2018-05-11

The original article is here

A brief introduction:

As we know, there are two kinds of workloads in database: one is write-heavy and the other is read-heavy. With read-heavy workload, using B+Tree/B-Tree will bring with read optimization. But it will be expensive when doing modification operation (update, insert or deletion) due to we need to merge/split nodes or move items in nodes. B-Tree is 1) Sorted, 2) Self-balancing, 3) Logarithmic lookup time and 4) Mutable. With write-heavy workload, LSM-Trees (log-structured merge-tree) is an immutable disk-resident write-optimized data structure. It will transform most of the modification operations to a log entry and put it to SSTable (Sorted String Tables). The upper layer of LSM-Tree is an in-memory skip-list or binary search tree to accelerate read operations. When its size reaches to the threshold, this data structure will be flushed to disk with compaction. LSM-Tree will be good for point-query and scan operation, but will not be friendly to find operation.

In order to batch these operation together for writing to disk with one IO operation, we need to use WAL (write ahead log) to guarantee atomicity and durability.

In conclusion, there is no convenient to achieve higher performance for both write and read operations, so we need to choose our data structure carefully.

Survey of NVM Allocator

Posted on 2018-04-20

Goals

A good allocator should:
- Memory Usage (fragmentation/overhead)
- Performance (Tput)
- Scalability (linearly)
- Correctness
- Robustness
A good NVM allocator should:
- Recovery (time overhead)
- Programablility (easy-to-use)
- TODO….

Recovery

Essentially, bitmap is a general method to maintain the peristent status of allocator. With bitmap, every bit can indicate this block is in-used or not

Defragmentation

Start-of-art

nvm_malloc (VLDB’ 16)
- Small-size => segregated-fit, larger-size => best-fit
- Three-size allocation strategy
Makalu (OOPLSA’ 16)
- Keeping volatiled pointer is still valid by using a fixed pool address (MAP_FIXED flag of mmap)
NVML (intel Pmem)
- Support concurrent => Local thread cache
- Minimize fragmentation => A chunk (256KB) is divided into blocks of 8X the class size.
PAllocator
- Small/Larger/Huge Allocator

Gprof, Valgrind, Gperf, etc

Posted on 2018-04-10

There are several analysis tools to measure our program and give suggestions to improve performance.
Now, we will introduce two common-used tools: gprof and valgrind

Gprof
Gprof is a profiling program which collects and arranges statistics on your programs. It will insert codes to head and tail of each functions to collect running information by sampling some time points. When program running is end, it will create a file named gmon.out. And we can run profile tool to analysis occupied time of different functions.

Add -pg to your compiling file (e.g. CMake or Makefile)
Running your program exe
Typing gprof exe in the command
There is another easy-to-use tool gprof2dot to visualize your analytics results. You can use gprof exe | gprof2dot | dot -Tpdf -o result.pdf to generate a pdf file to display the results. gprof2dot is a python tool, and you can use pip to install it.

Valgrind
Valgrind was originally designed to be a free memory debugging tool for Linux on x86, but has since evolved to become a generic framework for creating dynamic analysis tools such as checkers and profilers. Callgrind is a competent of valgrind for profiling CPU usage. With Callgrind, you can look up the running time of different functions. Similarly, another tool Kcachegrind is useful to provide a visual interface for beginner. Fortunately, you don’t need to add extra compiling flags, and you just need following three steps:

Compile your project as usual
Running callgrind with valgrind --tool=callgrind ./cpuload
kcachegrind profile.callgrind

Using these out-of-box tools will benefit to hint us to optimize our codes well. Besides, other performance library like libpapi need us to insert codes to our program, and it’s not good for programmability. For me, I prefer to use these performance profile tools to detect my codes.

What's a good paper?

Posted on 2018-04-10

A good paper will:

Motivate a significant problem
Propose an interesting, compelling solution
Demonstrate the practicality and benefits of the solution
Draw appropriate conclusions
Clearly describe the paper’s contributions
Clearly articulate the advances beyond previous work

Academic Writing Tips

Posted on 2018-03-04

Paper Roadmap
- Problem -> Ideas -> Methods -> Algorithm -> Implementation -> Experiments -> Conclusion
Logic
- Transfer words (link sentences)
- Connection between points and points
- Claim the whole section in the head
Writing Paradigm
- Problem X is important (in theory or application)
- Work A, B, C… are state of the art method for X
- A, B, C… have weakness not solved
- Analyze the weakness and propose a new method D
- Experimentally compare D with A, B, C
- Why does D work? Why E, F, G do not work?
- Strngth and weakness of D
- Future work of D
Rejection. Why?
1. Not enough Innovation
  - Has done before
  - Too straightforward/naive (Simple is good, but straightforward is bad)
  - Limited modifications/improvement
  - Intuitive technique, without serious theory proof
2. Experiments cannot convice readers or need more experiments
  - Lack of comparsion (with state-of-art methods)
  - Unfair comparsion
  - Unbelievable Results
  - The advancement is limited
3. Don’t claim a clear Motivation
  - Which problem will be solved? And why?
    - ignored problems
    - exsiting problems with not excellent solution
4. Discussion or Conclusion is not rigorous
  - The discussion about previous works is not appropriate
    - One-sided
    - Inaccurate about advantages/disadvantages
    - Casual logic is unreasonable
5. Others:
  - Architecture Chaos and lack of logic
  - Lower level mistakes
  - It’s hard for reader to understand / Don’t provide enough background
  - Lack of Citations
  - Have confictions with reviewers
Fluent writing
- From lower level to higher level (You know your work well, but the readers not)
- From old to new information
  - Never introduce new concept at the beginning of a sentence
  - Must explain the abbreviation before using it
- Section/Paragraph
  - Abstract the contents before you want to introduce a new topic
Abundant Figure and Table
- Figure
  - Self explaination
  - Easy to understand
- Table
  - Sorted!
  - The important one must put into the header or tail
Tips
1. One paper, single central message/contribution
2. Story must be Top-down
3. Outlines -> Section -> Subsection -> Paragraph
4. First => Introduction + Related Works
5. Last but most important=>Conclusion + Abstract + Refine Title
6. Figure => Almost the more the better
7. TODO
Reference:
- 撰写学术论文的经验和教训

Different Levels of Write Read Lock

Posted on 2017-12-17

Read write lock is a synchronization technology to solve multi read and write problem. There are three priorities: Read-prefering RW lock, Write-prefering RW lock, and Unspecified priority RW locks.

There are some libraries provide read write lock functions, like:

POSIX standard WR lock pthread_rwlock_t (This WR lock provides different priorities)
std::shared_mutex in C++17
boost::shared_mutex and boost::upgrade_mutex in boost library

The read-copy-update (a.k.a. RCU) is another method to solve mutli reader and write problem. Another solution is seqlock for fewer readers.

A simple persudo implementation:

void begin_read() {
  lock_r();
  counter ++;
  if (counter == 1) lock_w();
  unlock_r()
void end_read() {
  lock_r();
  counter --;
  if (counter == 0) unlock_w();
  unlock_r();
}
void begin_write() {
  lock_w();
}
void end_write() {
  unlock_w();
}

More details about implementation in C++11 (use atomic variable as spinlock), please see this.

BTH, this simple WR_lock implement method do not gurantee writer has a higher priority.

C++20 Transaction Memory

Posted on 2017-10-20

Transaction Memory is a concurrency mechanism, which provides atomic and isolated compared with normal memory access. Transaction memory is easy to use, and simplify concurrent programming without lock or extra concurrent control method.

C++20 implements Transaction Memory, and makes this becomes a standard in C++. I think this is the inevitable trend with rapid development in multi-core programming.

In gcc-7, programmer can use more than three ways to use Transaction Memory, and I will introduce how to write codes by taking advantage of Transaction Memory

Below is a example with using synchronized keyword to achieve synchronized of multi threads.

#include <iostream>
#include <vector>
#include <thread>
int f()
{
    static int i = 0;
    synchronized { // begin synchronized block
        std::cout << i << " -> ";
        ++i;       // each call to f() obtains a unique value of i
        std::cout << i << '\n';
        return i; // end synchronized block
    }
}
int main()
{
    std::vector<std::thread> v(10);
    for(auto& t: v)
        t = std::thread([]{ for(int n = 0; n < 10; ++n) f(); });
    for(auto& t: v)
        t.join();
}

The output in screen is like that:

0 -> 1
1 -> 2
...
99 -> 100

Only one thread can access this variable i at any time.