Teng's Blog

A coder and systemer


  • Home

  • Archives

  • Tags

CMU 15-645 Logging Schemes

Posted on 2018-10-05

Recovery Algorithm

  • Today: ensure DBMS can recover from a failure (during normal TX processing)
  • Idea: Recover to a state with ACID guarantee

Classifications

  • Storage Types: Volatile Storage (DRAM/SRAM) => Non-volatile Storage (HDD, SSD, NVM) => Stable Storage (multiple storage devices)
  • Failure Types: Transaction / System / Storage Media
    • Transaction Errors: Logical errors / Internal State Errors
    • System Errors: Software (Exceptions, e.g., divide by zero) / hardware (Power down)
    • Fail-Stop Assumption: Non-volatile Storage will not be corrupted
    • Storage Media Failure: Game Over! WTF

Use volatile memory for faster access:

  • Procedures (Buffer Pool):
    1. Copy target record into memory
    2. Perform the writes in memory
    3. Write Dirty records back to disks
  • Issue: There are two transactions: T1, T2. T1 is still uncommitted but T2 is already committed. At the same time, the dirty records is written back to non-volatile device. T1 needs to roll back. What’s happened?
    • Steal + Force:
      1. Never have to undo changes of an aborted TX because the changes were not written to disk
      2. Never have to redo changes of a committed TX because all the changes are guaranteed to be written to disk at commit time

Shadow Paging

  • Automatically switch roles between master and shadow
  • Undo: Remove the shadow pages, switch to the master
  • Cons:
    1. Copying the entire page table is expensive (a index structure like B+Tree, Path copying)
    2. Commit Overhead is high

WAL (Write Ahead Log)

  • Format <Tid, Object Id, Previous Value, Current Value> (Undo + Redo)
  • Trade-off:
    1. When should we write log entries to disk?
    2. When should we write dirty records to disk?

Log Styles

  • Physical Log: Record the data changes in the database
  • Logical Log: Higher level operations (e.g., UPDATE, INSERT)
    • Cons: higher time costing to recover
    • Pros: Less space needed

Checkpoint

  • Actions:
    1. Output onto stable storage all log records currently residing in main memory
    2. Output to the disk all modified blocks
    3. Write a <CHECKPOINT> entry to the log and flush to stable storage
  • Recovery: Redo committed TXs after checkpoint + Undo uncommitted TXs after checkpoint

CMU 15-645 Concurrency Control TimeStamp

Posted on 2018-10-04

Principle

  • If TS(Ti) < TS(Tj), then it must be Ti appears before Tj

Implementation:

  • System Clock/Logical Counter/Hybrid

CMU 15-645 Concurrency Control summary

Posted on 2018-10-04

Before reading this article, you should know:

  • Transaction: a sequence of read and write operations
    • processing: BEGIN -> (executing) -> COMMIT/ABORT
    • ACID
    • Atomicity: All-or-nothing
    • Consistent: It looks correct to me
    • Isolation: every TX is isolated from others (as if alone)
    • Durability: store in non-volatile devices (survive failures)

How to make sure transaction is atomic

  • Approach 1: Logging
  • Approach 2: Shadow Paging

The essential of concurrency control

  • Serial Schedule

The definition of conflict

  • They are by different transactions
  • They are on the same object and at least one of them is a write
  • RW/WR/WW conflict:
    • Unrepeatable reads
    • Reading uncommitted data (a.k.a., Dirty Reads)
    • Overwriting committed data

Dependency Graphs (Precedence Graph)

  • conflict: acyclic
  • blind write: write a value without read it

2PL

  • Phase 1: Growing => get all required locks from manager [grants/denies]
  • Phase 2: Shrinking => release all locks it gained
  • Strict 2PL: release all locks at the end of TX

2PL Deadlock Prevention

  • Wait-Die (old waits for young)
  • Wound-Wait (young waits for old)

Peer-to-Peer Evolution

Posted on 2018-09-20

GPUDirect

  • Eliminate the need to make a redundant copy in CUDA host memory
  • Eliminate CPU bandwidth and latency bottlenecks

PeerDirect

  • Eliminate the need to make a redundant copy in host memory
  • Direct path for data exchange

PeerDirect Async

  • Control RDMA device from the GPU
  • Reduce CPU utilization

Transaction Protocols

Posted on 2018-09-19

Catalog

  • Concurrency Control
    • Two-phase Locking (2PL)
    • NO_WAIT
    • WAIT_DIE
    • Timestamp Ordering (TIMESTAMP)
    • Multi-version concurrency control (MVCC)
    • Optimistic concurrency control (OCC)
    • Deterministic (CALVIN)
  • Commitment Protocols
    • Two-phase Commit (2PC)

Classify from two views

  • Pessimistic v.s. Optimistic
  • Locked-based v.s. Timestamp-based

Comparison from another view

  • Pessimistic
    • Validate -> Read -> Compute -> Write
  • Optimistic
    • Read -> Compute -> Validate -> Write

2-PL

  • Basic procedures:
    • Expanding (Growing) Phase: lock are acquired
    • Shrinking (Contracting) Phase: lock are released
  • Centralized 2PL: only one scheduler
  • Distributed 2PL: at each site, scheduler handle lock requests at that site

Timestamp Ordering based algorithm

  • In a distributed environment, TS requires globally unique and monotonically increasing
  • A global counter is not easy to maintain and local clock are unsuitable. So, a unique site identifier is needed (e.g., {local time, site indentifier})
  • Based TO:
    • rts(x): largest TO of any read on x, wts(x): largest TO of any write on x
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      for Ri(x):
      if ts(Ti) < wts(x):
      then reject Ri(x)
      else accept Ri(x)
      rts(x) <- ts(Ti)
      for Wi(x):
      if ts(Ti) < rts(x) and ts(Ti) < wts(x):
      then reject Wi(x)
      else accept Wi(x)
      wts(x) <- ts(Ti)

Pitfall

  • Lock vs. Latch
    • Locking mechanism can protect transaction ACID
    • Latching mechanism can protect data structure safe
  • TO (Timestamp ordering) algorithm always will cause a deadlock situation (several transactions wait for each other)
    • However, a intervention from outside is needed
    • Wait-for Graph (has a cycle => deadlock)

Survey of Distributed Resource Management

Posted on 2018-08-17

Foreword

Recently, a new hot trending of rack-scale architecture named disaggregation was proposed to improve the data center. In industry, Intel RSD and and HP the machine will soon be available as real world DC resolutions. Some researches like Lego, infiniswap aim to find out the correct deployment guidelines and how to optimize disaggregation architecture. Disaggregation architecture divide machines into working one and resource one (blade server). This deployment make full use of decoupling and hence gain a higher resource utilization and release the working machine from heavy tasks of resource management.
However, similar as a standalone architecture, disaggregation architecture also has a need of resource management. Moreover, in a distributed environment, keeping allocation status consistency and persistence will be harder, as well as bringing a huge overhead in network.
The simplest design is using RPC to finish a allocation/release request of resource and naturally, also leading to a poor performance due to the frequently network communication.

Related works

Infiniswap (NSDI’ 17) exposes block device IO interface to VMM (virtual memory management). It divides entire address space to many slabs. Every slab is fixed size. Slabs from the same device can be mapped to multiple remote machines’ memory for performance and load balance.
The INFINISWAP daemon runs in the user space and only participates in control plane activities. Specifically, it responds to slab-mapping requests from INFINISWAP block devices, preallocates its local memory when possible to minimize time overheads in slab-mapping initialization, and proactively evicts slabs, when necessary, to ensure minimal impact on local applications. All control plane communications take place using RDMA SEND/RECV.
Lego (OSDI’ 18) seems to adopt a two-level resource management mechanism. The disaggregation back-end provides coarse grained management and front-end server is responsible for fine grained allocation.

Trade-off

In a common sense, above two methods only fit infrequent allocation scenario, especially the resource is block device. When the disaggregation resource is byte-addressable memory or non-volatile memory, the frequent allocation/release is unavoidable and hence let allocation become a performance bottleneck.

Start Stream Join from SoccerJoin to SplitJoin

Posted on 2018-08-17

Stream Join

  • Dealing with unbounded, “infinity”, input. The output must be produced in real time.
  • time-base (a expired time for tuples) v.s. count-based (fixed number of tuples)
  • meeting multi-core => parallel
    figure

Kang’s three steps join algorithm

  • assuming tuple r arrives from input stream R:
  1. Scan stream S’s window to find tuple matching r.
  2. Insert new tuple r into window for stream R.
  3. Invalidate all expired tuples in stream R’s window.

Stream Join RoadMap

1
2
3
4
5
Kang's algorithm (Naive)
=> Cell Join
=> Soccer (HandShake) Join
=> Low Latency HandShake Join
=> Split Join

Cell Join

  • For heterogeneous architecture

Handshake Join

  • Apply classical three-step procedure in small window.
  • Point-to-point links between processing core.
  • Missed-join pair problem => Two-phase forwarding (asymmetric core synchronization protocol): Whenever the right core C_right places a tuple s into its left send queue, it still keep a copy of s in the local window but marked it as forward. The forwarded tuple remains available for joining on C_right until the second phase of tuple forwarding, which is initiated by an ack message from C_left
  • Autonomic load balancing =>

    1
    2
    3
    if R-window is larger than that of our right neighbor
    place oldest ri into rightSendQueue ;
    mark ri as forwarded ;

    figure
    figure

Low latency handshake join

  • observation: higher latency due to that tuples may have to queue for longer periods of time before they encounter matching partners.

Split Join

  • Different work-flow: Top-down data flow
    figure
  • Don’t need to rely on central coordinator
  • A distributed network (Distributor), a set of independent join cores (JoinCore/JC), a result gathering network (Combiner/Merger)
  • Overall:
    figure
  1. Distributed Tree: K-ary Tree
  2. Parallelism: Sub-window
    • Join core: left-region -> storing Tuple r in sliding window; right-region -> processing of the replicated copy of Tuple r
    • Case: A new coming Tuple r, compare r with all the tuples in the S stream sub-windows. Then, assigning r to a storage based on a round-robin selection.
  3. Expired Policy:

    • Time-based: a lifespan l

      1
      2
      3
      4
      5
      Expiration Process(t, sub-window X) begin
      i = the end of sub-window X;
      while ti.timestamp - t.timestamp > Time Window Size do
      omit ti from sub-window X;
      i = i−1;
    • Count-based: every sub-window is fixed size

  4. Particular join algorithm:

    1
    2
    3
    4
    5
    6
    Processing Core(t, sub-window X) begin
    forall ti-tuple in sub-window X do
    compare ti-tuple with t; if match then
    emit the matched result;
    if i = 0 (mod ordering precision) then
    emit punctuation star;
  5. RAP (relaxed adjustable punctuation) strategy

    • Punctuation-based ordering
    • outer ordering (strict/relax)
    • inner ordering (the result of a new inserted tuple was in time order)
    • Employing a combiner rather than a merger
    • Tuning to produce a punctuation after joining one newly inserted tuple (strict outer ordering) or after N tuples (relax outer ordering)
    • Data-flow when merging
      figure
      1
      2
      3
      4
      5
      6
      N-ary Merger(t) begin
      foreach right(or left)-region of core1..N in sequence do
      while a resulting tuple (t) is available in output buffer till the first star do
      pop t from join core’s output buffer;
      push t to Merger’s output buffer;
      push out the end of result star;

New Features in OFED

Posted on 2018-07-18

In Mellanox OFED version 4.3, a massive of new features have been released. These techniques will be useful to develop RDMA based applications and hence improve the performance. I will introduce some core techniques as following.

Advanced Transport

  • Dynamically Connected Transport: Dynamically Connected transport (DCT) service is an extension to transport services to enable a higher degree of scalability while maintaining high performance for sparse traffic. Utilization of DCT reduces the total number of QPs required system wide by having Reliable type QPs dynamically connect and disconnect from any remote node. DCT connections only stay connected while they are active. This results in smaller memory footprint, less overhead to set connections and higher on-chip cache utilization and hence increased performance.

Optimized Memory Access

  • Contiguous Pages: Contiguous Pages improves performance by allocating user memory regions over physical contiguous pages. It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv_reg_mr.

    1
    2
    3
    4
    5
    6
    7
    Possible Value1 | Description
    ANON | Use current pages ANON small ones.
    HUGE | Force huge pages.
    CONTIG | Force contiguous pages.
    PREFER_CONTIG | Try contiguous fallback to ANON small pages. (Default)
    PREFER_HUGE | Try huge fallback to ANON small pages.
    ALL | Try huge fallback to contiguous if failed fallback to ANON small pages.
  • Memory Window: Memory Window allows the application to have a more flexible control over remote access to its memory. Memory Windows are intended for situations where the application wants to:

    • grant and revoke remote access rights to a registered region in a dynamic fashion with less of a performance penalty
    • grant different remote access rights to different remote agents and/or grant those rights over different ranges within registered region
  • Inline Receive: The inline Optimization is only available for RDMA_Send/RDMA_Write. When Inline-Receive is active, the HCA may write received data in to the receive WQE or CQE. Using Inline-Receive saves PCIe read transaction since the HCA does not need to read the scatter list, therefore it improves performance in case of short receive-messages. Usage: Inline-Receive on the requestor side is possible only if the user chooses IB(V)_SIGNAL_ALL_WR.
  • ODP (On Demand Page): On-Demand-Paging (ODP) is a technique to alleviate much of the shortcomings of memory registration. Applications no longer need to pin down the underlying physical pages of the address space, and track the validity of the mappings. Rather, the HCA requests the latest translations from the OS when pages are not present, and the OS invalidates translations which are no longer valid due to either non-present pages or mapping changes. ODP does not support contiguous pages.
    ODP can be further divided into 2 subclasses: Explicit and Implicit ODP.
    • Explicit ODP: In Explicit ODP, applications still register memory buffers for communication, but this operation is used to define access control for IO rather than pin-down the pages. ODP Memory Region (MR) does not need to have valid mappings at registration time.
    • Implicit ODP: In Implicit ODP, applications are provided with a special memory key that represents their complete address space. This all IO accesses referencing this key (subject to the access rights associated with the key) does not need to register any virtual address range.

Hyper's Rules for Parallelization

Posted on 2018-06-27
  • Rule #1: No random writes to non-local memory
    • Chunk the data, redistribute, and then each core sorts/works on local data.
  • Rule #2: Only perform sequential reads on non-local memory
    • This allows the hardware prefetcher to hide remote access latency.
  • Rule #3: No core should ever wait for another
    • Avoid fine-grained latching or sync barriers.

Survey of DB Join Algorithm

Posted on 2018-06-14

A tour of Join

Join is a popular operation among Database Systems- A standard join concept is to combine multiple columns from one or more tables in relational database- ANSI-standard SQL specifies five types of JOIN: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS- The join operation will produce a temporary table which containing the results meeting the conditions- In the followings, we will first introduce different join operation types- We assume that have two tables named R and S-

1- CROSS JOIN: CROSS JOIN can be seen as the Cartesian products of two tables- The sum of items is |R| * |S|-
2- Equi-join & Natural Join: Natural Join is a special case of Equi-join- We can write as this via SQL sentences:

1
2
3
SELECT *
FROM employee JOIN department
ON employee-DepartmentID = department-DepartmentID;

- We need to use several comparators (one in mostly cases) to make the choosing rules- Natural Join uses equality comparisons between `R` and `S`-

3- Outer Join: Outer Join includes left outer joins, right outer joins, and full outer joins (depending on which table’s rows are retained (left, right, or both))-
4- Self Join: A table join with itself-

Join Algorithms

We can divide join algorithms to three main types: loop join, sort merge join and hash join

1- Nested loop join: The most naive one with a O(n^2) complexity (like a Bruce-force algorithm)-

1
2
3
4
For each tuple r in R do
For each tuple s in S do
If r and s satisfy the join condition
Then output the tuple <r,s>

2- Block-nested loop (BNL) & Hash Join: Load R to memory as a hash table, and scan S

  • Raw hash join
  • Radix join
    • Radix Partition: Translation look-aside buffers (TLBs) => multi-pass radix partitioning join
      Foreach input tuple t do
        k = hash(t)
        p[k][pos[k]] = t
        pos[k] ++
      
  • No-partition hash join: All worker threads populate a shared hash table with all tuples of R-
    3- Sort merge join: Sort R and S by a same order, and then use two-points algorithm to distinct and produce the results-
  • Raw sort merge join
    • Multi-way method
    • Multi-pass method
  • Massive parallel sort merge join (MPSM)

Sort vs- Hash

Distributed Hash Join

123…5
Teng Ma

Teng Ma

...

49 posts
34 tags
GitHub Stackoverflow Linkedin
© 2019 Teng Ma
Powered by Hexo
Theme - NexT.Muse