CMU 15-645 Logging Schemes

Posted on 2018-10-05

Recovery Algorithm

Today: ensure DBMS can recover from a failure (during normal TX processing)
Idea: Recover to a state with ACID guarantee

Classifications

Storage Types: Volatile Storage (DRAM/SRAM) => Non-volatile Storage (HDD, SSD, NVM) => Stable Storage (multiple storage devices)
Failure Types: Transaction / System / Storage Media
- Transaction Errors: Logical errors / Internal State Errors
- System Errors: Software (Exceptions, e.g., divide by zero) / hardware (Power down)
- Fail-Stop Assumption: Non-volatile Storage will not be corrupted
- Storage Media Failure: Game Over! WTF

Use volatile memory for faster access:

Procedures (Buffer Pool):
1. Copy target record into memory
2. Perform the writes in memory
3. Write Dirty records back to disks
Issue: There are two transactions: T1, T2. T1 is still uncommitted but T2 is already committed. At the same time, the dirty records is written back to non-volatile device. T1 needs to roll back. What’s happened?
- Steal + Force:
  1. Never have to undo changes of an aborted TX because the changes were not written to disk
  2. Never have to redo changes of a committed TX because all the changes are guaranteed to be written to disk at commit time

Shadow Paging

Automatically switch roles between master and shadow
Undo: Remove the shadow pages, switch to the master
Cons:
1. Copying the entire page table is expensive (a index structure like B+Tree, Path copying)
2. Commit Overhead is high

WAL (Write Ahead Log)

Format <Tid, Object Id, Previous Value, Current Value> (Undo + Redo)
Trade-off:
1. When should we write log entries to disk?
2. When should we write dirty records to disk?

Log Styles

Physical Log: Record the data changes in the database
Logical Log: Higher level operations (e.g., UPDATE, INSERT)
- Cons: higher time costing to recover
- Pros: Less space needed

Checkpoint

Actions:
1. Output onto stable storage all log records currently residing in main memory
2. Output to the disk all modified blocks
3. Write a <CHECKPOINT> entry to the log and flush to stable storage
Recovery: Redo committed TXs after checkpoint + Undo uncommitted TXs after checkpoint

CMU 15-645 Concurrency Control TimeStamp

Posted on 2018-10-04

Principle

If TS(Ti) < TS(Tj), then it must be Ti appears before Tj

Implementation:

System Clock/Logical Counter/Hybrid

CMU 15-645 Concurrency Control summary

Posted on 2018-10-04

Before reading this article, you should know:

Transaction: a sequence of read and write operations
- processing: BEGIN -> (executing) -> COMMIT/ABORT
- ACID
- Atomicity: All-or-nothing
- Consistent: It looks correct to me
- Isolation: every TX is isolated from others (as if alone)
- Durability: store in non-volatile devices (survive failures)

How to make sure transaction is atomic

Approach 1: Logging
Approach 2: Shadow Paging

The essential of concurrency control

Serial Schedule

The definition of conflict

They are by different transactions
They are on the same object and at least one of them is a write
RW/WR/WW conflict:
- Unrepeatable reads
- Reading uncommitted data (a.k.a., Dirty Reads)
- Overwriting committed data

Dependency Graphs (Precedence Graph)

conflict: acyclic
blind write: write a value without read it

2PL

Phase 1: Growing => get all required locks from manager [grants/denies]
Phase 2: Shrinking => release all locks it gained
Strict 2PL: release all locks at the end of TX

2PL Deadlock Prevention

Wait-Die (old waits for young)
Wound-Wait (young waits for old)

Peer-to-Peer Evolution

Posted on 2018-09-20

GPUDirect

Eliminate the need to make a redundant copy in CUDA host memory
Eliminate CPU bandwidth and latency bottlenecks

PeerDirect

Eliminate the need to make a redundant copy in host memory
Direct path for data exchange

PeerDirect Async

Control RDMA device from the GPU
Reduce CPU utilization

Transaction Protocols

Posted on 2018-09-19

Catalog

Concurrency Control
- Two-phase Locking (2PL)
- NO_WAIT
- WAIT_DIE
- Timestamp Ordering (TIMESTAMP)
- Multi-version concurrency control (MVCC)
- Optimistic concurrency control (OCC)
- Deterministic (CALVIN)
Commitment Protocols
- Two-phase Commit (2PC)

Classify from two views

Pessimistic v.s. Optimistic
Locked-based v.s. Timestamp-based

Comparison from another view

Pessimistic
- Validate -> Read -> Compute -> Write
Optimistic
- Read -> Compute -> Validate -> Write

2-PL

Basic procedures:
- Expanding (Growing) Phase: lock are acquired
- Shrinking (Contracting) Phase: lock are released
Centralized 2PL: only one scheduler
Distributed 2PL: at each site, scheduler handle lock requests at that site

Timestamp Ordering based algorithm

In a distributed environment, TS requires globally unique and monotonically increasing
A global counter is not easy to maintain and local clock are unsuitable. So, a unique site identifier is needed (e.g., {local time, site indentifier})

Based TO:

rts(x): largest TO of any read on x, wts(x): largest TO of any write on x

for Ri(x):
  if ts(Ti) < wts(x):
  then reject Ri(x)
  else accept Ri(x)
   rts(x) <- ts(Ti)
for Wi(x):
  if ts(Ti) < rts(x) and ts(Ti) < wts(x):
  then reject Wi(x)
  else accept Wi(x)
   wts(x) <- ts(Ti)

Pitfall

Lock vs. Latch
- Locking mechanism can protect transaction ACID
- Latching mechanism can protect data structure safe
TO (Timestamp ordering) algorithm always will cause a deadlock situation (several transactions wait for each other)
- However, a intervention from outside is needed
- Wait-for Graph (has a cycle => deadlock)

Survey of Distributed Resource Management

Posted on 2018-08-17

Foreword

Recently, a new hot trending of rack-scale architecture named disaggregation was proposed to improve the data center. In industry, Intel RSD and and HP the machine will soon be available as real world DC resolutions. Some researches like Lego, infiniswap aim to find out the correct deployment guidelines and how to optimize disaggregation architecture. Disaggregation architecture divide machines into working one and resource one (blade server). This deployment make full use of decoupling and hence gain a higher resource utilization and release the working machine from heavy tasks of resource management.
However, similar as a standalone architecture, disaggregation architecture also has a need of resource management. Moreover, in a distributed environment, keeping allocation status consistency and persistence will be harder, as well as bringing a huge overhead in network.
The simplest design is using RPC to finish a allocation/release request of resource and naturally, also leading to a poor performance due to the frequently network communication.

Infiniswap (NSDI’ 17) exposes block device IO interface to VMM (virtual memory management). It divides entire address space to many slabs. Every slab is fixed size. Slabs from the same device can be mapped to multiple remote machines’ memory for performance and load balance.
The INFINISWAP daemon runs in the user space and only participates in control plane activities. Specifically, it responds to slab-mapping requests from INFINISWAP block devices, preallocates its local memory when possible to minimize time overheads in slab-mapping initialization, and proactively evicts slabs, when necessary, to ensure minimal impact on local applications. All control plane communications take place using RDMA SEND/RECV.
Lego (OSDI’ 18) seems to adopt a two-level resource management mechanism. The disaggregation back-end provides coarse grained management and front-end server is responsible for fine grained allocation.

Trade-off

In a common sense, above two methods only fit infrequent allocation scenario, especially the resource is block device. When the disaggregation resource is byte-addressable memory or non-volatile memory, the frequent allocation/release is unavoidable and hence let allocation become a performance bottleneck.

Start Stream Join from SoccerJoin to SplitJoin

Posted on 2018-08-17

Stream Join

Dealing with unbounded, “infinity”, input. The output must be produced in real time.
time-base (a expired time for tuples) v.s. count-based (fixed number of tuples)
meeting multi-core => parallel

Kang’s three steps join algorithm

assuming tuple r arrives from input stream R:

Scan stream S’s window to find tuple matching r.
Insert new tuple r into window for stream R.
Invalidate all expired tuples in stream R’s window.

Stream Join RoadMap

   Kang's algorithm (Naive) 
=> Cell Join 
=> Soccer (HandShake) Join 
=> Low Latency HandShake Join 
=> Split Join

Cell Join

For heterogeneous architecture

Handshake Join

Apply classical three-step procedure in small window.
Point-to-point links between processing core.
Missed-join pair problem => Two-phase forwarding (asymmetric core synchronization protocol): Whenever the right core C_right places a tuple s into its left send queue, it still keep a copy of s in the local window but marked it as forward. The forwarded tuple remains available for joining on C_right until the second phase of tuple forwarding, which is initiated by an ack message from C_left

Autonomic load balancing =>

1
2
3

if R-window is larger than that of our right neighbor
   place oldest ri into rightSendQueue ;
   mark ri as forwarded ;

Low latency handshake join

observation: higher latency due to that tuples may have to queue for longer periods of time before they encounter matching partners.

Split Join

Different work-flow: Top-down data flow
Don’t need to rely on central coordinator
A distributed network (Distributor), a set of independent join cores (JoinCore/JC), a result gathering network (Combiner/Merger)
Overall:

Distributed Tree: K-ary Tree
Parallelism: Sub-window
- Join core: left-region -> storing Tuple r in sliding window; right-region -> processing of the replicated copy of Tuple r
- Case: A new coming Tuple r, compare r with all the tuples in the S stream sub-windows. Then, assigning r to a storage based on a round-robin selection.

Expired Policy:

Time-based: a lifespan l

Expiration Process(t, sub-window X) begin
  i = the end of sub-window X;
      while ti.timestamp - t.timestamp > Time Window Size do
          omit ti from sub-window X;
          i = i−1;

Count-based: every sub-window is fixed size

Particular join algorithm:

Processing Core(t, sub-window X) begin
    forall ti-tuple in sub-window X do
        compare ti-tuple with t; if match then
            emit the matched result;
        if i = 0 (mod ordering precision) then
            emit punctuation star;

RAP (relaxed adjustable punctuation) strategy

Punctuation-based ordering
outer ordering (strict/relax)
inner ordering (the result of a new inserted tuple was in time order)
Employing a combiner rather than a merger
Tuning to produce a punctuation after joining one newly inserted tuple (strict outer ordering) or after N tuples (relax outer ordering)

Data-flow when merging

N-ary Merger(t) begin
    foreach right(or left)-region of core1..N in sequence do
        while a resulting tuple (t) is available in output buffer till the first star do
            pop t from join core’s output buffer;
            push t to Merger’s output buffer;
    push out the end of result star;

New Features in OFED

Posted on 2018-07-18

In Mellanox OFED version 4.3, a massive of new features have been released. These techniques will be useful to develop RDMA based applications and hence improve the performance. I will introduce some core techniques as following.

Advanced Transport

Dynamically Connected Transport: Dynamically Connected transport (DCT) service is an extension to transport services to enable a higher degree of scalability while maintaining high performance for sparse traffic. Utilization of DCT reduces the total number of QPs required system wide by having Reliable type QPs dynamically connect and disconnect from any remote node. DCT connections only stay connected while they are active. This results in smaller memory footprint, less overhead to set connections and higher on-chip cache utilization and hence increased performance.

Optimized Memory Access

Contiguous Pages: Contiguous Pages improves performance by allocating user memory regions over physical contiguous pages. It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv_reg_mr.

Possible Value1 | Description
ANON | Use current pages ANON small ones.
HUGE | Force huge pages.
CONTIG | Force contiguous pages.
PREFER_CONTIG | Try contiguous fallback to ANON small pages. (Default)
PREFER_HUGE | Try huge fallback to ANON small pages.
ALL | Try huge fallback to contiguous if failed fallback to ANON small pages.

Memory Window: Memory Window allows the application to have a more flexible control over remote access to its memory. Memory Windows are intended for situations where the application wants to:
- grant and revoke remote access rights to a registered region in a dynamic fashion with less of a performance penalty
- grant different remote access rights to different remote agents and/or grant those rights over different ranges within registered region
Inline Receive: The inline Optimization is only available for RDMA_Send/RDMA_Write. When Inline-Receive is active, the HCA may write received data in to the receive WQE or CQE. Using Inline-Receive saves PCIe read transaction since the HCA does not need to read the scatter list, therefore it improves performance in case of short receive-messages. Usage: Inline-Receive on the requestor side is possible only if the user chooses IB(V)_SIGNAL_ALL_WR.
ODP (On Demand Page): On-Demand-Paging (ODP) is a technique to alleviate much of the shortcomings of memory registration. Applications no longer need to pin down the underlying physical pages of the address space, and track the validity of the mappings. Rather, the HCA requests the latest translations from the OS when pages are not present, and the OS invalidates translations which are no longer valid due to either non-present pages or mapping changes. ODP does not support contiguous pages.
ODP can be further divided into 2 subclasses: Explicit and Implicit ODP.
- Explicit ODP: In Explicit ODP, applications still register memory buffers for communication, but this operation is used to define access control for IO rather than pin-down the pages. ODP Memory Region (MR) does not need to have valid mappings at registration time.
- Implicit ODP: In Implicit ODP, applications are provided with a special memory key that represents their complete address space. This all IO accesses referencing this key (subject to the access rights associated with the key) does not need to register any virtual address range.

Hyper's Rules for Parallelization

Posted on 2018-06-27

Rule #1: No random writes to non-local memory
- Chunk the data, redistribute, and then each core sorts/works on local data.
Rule #2: Only perform sequential reads on non-local memory
- This allows the hardware prefetcher to hide remote access latency.
Rule #3: No core should ever wait for another
- Avoid fine-grained latching or sync barriers.

Survey of DB Join Algorithm

Posted on 2018-06-14

A tour of Join

Join is a popular operation among Database Systems- A standard join concept is to combine multiple columns from one or more tables in relational database- ANSI-standard SQL specifies five types of JOIN: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS- The join operation will produce a temporary table which containing the results meeting the conditions- In the followings, we will first introduce different join operation types- We assume that have two tables named R and S-

1- CROSS JOIN: CROSS JOIN can be seen as the Cartesian products of two tables- The sum of items is |R| * |S|-
2- Equi-join & Natural Join: Natural Join is a special case of Equi-join- We can write as this via SQL sentences:

1
2
3

SELECT *
FROM employee JOIN department
  ON employee-DepartmentID = department-DepartmentID;

- We need to use several comparators (one in mostly cases) to make the choosing rules- Natural Join uses equality comparisons between `R` and `S`-

3- Outer Join: Outer Join includes left outer joins, right outer joins, and full outer joins (depending on which table’s rows are retained (left, right, or both))-
4- Self Join: A table join with itself-

Join Algorithms

We can divide join algorithms to three main types: loop join, sort merge join and hash join

1- Nested loop join: The most naive one with a O(n^2) complexity (like a Bruce-force algorithm)-

For each tuple r in R do
  For each tuple s in S do
    If r and s satisfy the join condition
       Then output the tuple <r,s>

2- Block-nested loop (BNL) & Hash Join: Load R to memory as a hash table, and scan S

Raw hash join
Radix join
- Radix Partition: Translation look-aside buffers (TLBs) => multi-pass radix partitioning join
```
Foreach input tuple t do
  k = hash(t)
  p[k][pos[k]] = t
  pos[k] ++
```
No-partition hash join: All worker threads populate a shared hash table with all tuples of R-
3- Sort merge join: Sort R and S by a same order, and then use two-points algorithm to distinct and produce the results-
Raw sort merge join
- Multi-way method
- Multi-pass method
Massive parallel sort merge join (MPSM)