std::bodun::blog

Four Years into PhD

Sun, 20 Apr 2025 00:00:00 +0000

I just submitted another paper to SOSP 2025, and it’s hard to believe it’s been nearly four years since I started my PhD. A lot has changed since my last post about my PhD journey—looking back, I seemed pretty desperate then.

So here I am, reflecting on the past few years. I feel far more confident now—not just in my research decisions but in navigating the space of SysML in general.

When I started, I wasn’t sure about pretty much anything. But one thing I was certain about was ML inference. Admittedly, I didn’t grasp its full complexity or what compelling research directions existed, if at all. But I remembered reading in INFaaS that inference workloads account for 90% of ML infrastructure costs in AWS. That fact alone gave me hope—if inference drives such high traffic, it must be, or will become, important in the future.

Yet I kept wondering: Is there anything I can do at the model level? Many systems ML papers treated models as fixed-sized black boxes with deterministic execution latency and resource consumption. This assumption felt limiting.

That period was rough. Tons of methods for dynamic DNNs had already been proposed—early exit strategies, Mixture of Experts (MoE), model ensembles—but there just wasn’t a clear justification for designing systems specifically to optimize these approaches.

Honestly, I’m not sure how I made it through besides furiously searching ‘‘dynamic neural network’’, hoping to find something worth pursuing. Of course, the rise of ChatGPT changed everything, but that’s another story.

Everything shifted after my first paper was accepted to EMNLP. That was the moment I realized publishing isn’t as impossible as it seemed. Before, I kept trying to build an end-to-end system, a process that consumed far too much time.

Instead, I learned that starting with a clear motivation and simply writing it down first is a much better approach. Writing helps untangle confusion—it forces clarity.

The next few projects moved at a much faster pace. What changed? Honestly, the biggest shift was that I stopped fixating on the future. The long-term uncertainty used to kill my productivity—it felt overwhelming. But I realized that instead of worrying about whether a proposed method might fail, it’s far more productive to just write the next paragraph in Overleaf and move forward. After, if you write a paper, there exists a conference that is willing to accept it.

I’ll stop here and revisit this topic once I fully recover from the SOSP grind. For now, time to rest.

Stichable Neural Networks

Sun, 01 Sep 2024 00:00:00 +0000

TLDR; the Stichable Neural Networks paper includes some interesting concepts. It allows the creation of multiple neural networks with varying complexity and performance trade-offs from a family of pretrained models.

Key Principles

How to choose anchors from well-performed pretrained models in a model family
The design of stitching layers
The stitching direction and strategy
Simple but effective training strateg

A key question about combining sub-networks from different pretrained models is how to maintain accuracy. The paper concludes that the final performance of these combinations is nearly predictable due to an interpolation-like performance curve between anchors. This predictability allows for selective pre-training of stitches based on various deployment scenarios.

The Choice of Anchors

Anchors that are pretrained on different tasks can learn very different representations due to the large distribution gap of different domains. Therefore, the selected anchors should be consistent in terms of the pretrained domain.

The Stitching Layer and its Initialization

SN-Net is built upon pretrained models. Therefore, the anchors have already learned good representations, which allows to directly obtain an accurate transformation matrix by solving the least squares problem:

$$||AM_o - B|| = min||AM - b||_F$$

where $A \in R^{N \times D_1}$ and $B \in R^{N \times D_2}$ are two feature maps of the same spatial size but with different number of hidden dimensions.

This function indicates a closed form expression based on singular value decomposition, in which case the optimal solution can be achieved through an orthogonal projection in the space of matrices:

$$M_o = A^\dagger B$$

where $A^\dagger$ denotes the Moore-Penrose pseudoinverse of $A$.

Where to Stitch

SN-Net takes Fast-to-Slow as the default stitching direction, meaning it will stitch bigger and slower network after smaller and faster networks to achieve better model performance. Besides, it also proposes a nearest stitching strategy by limiting the stitching between two anchors of the nearest model complexity/performance.

Way to Stitch

Prior works shows neighboring layers dealing with the same scale feature maps share similar representations. Therefore, SN-Net uses slideing window: where the same window shares a common stitching layer.

Stitching Space

The stitching space is controlled by the configuring the sliding window kernel size $k$ and step size $s$.

Training Strategy

The training algorithm of SN-Net can be described as:

The training algorithm can be summarized as:

Firstly define a configuration set that contains all possible stitches
Initialize all stitching layers with least-squares matching
At each training iteration, we randomly sample a stitch and follow the standard training process as in common practices

Blog Archive

Wed, 28 Aug 2024 00:00:00 +0000

This is an archive including blogs I find useful or interesting. Hopefully the updates will keep coming.

Hardware

Server the home

ML

Security

null program

Datacenter

SemiAnalysis

TensorIR Transformation

Tue, 30 Aug 2022 00:00:00 +0000

In the previous post, we’ve explored how to write primitive functions in TensorIR. Here, we will see how to transform TensorIR into other (potentially more performant) variants. The content is drived from the mlc course taught by Tianqi Chen.

Batched BMM ReLu

A batched matrix multiplication followed by a ReLu operation can be expressed using numpy as:

def lnumpy_mm_relu_v2(A: np.ndarray, B: np.ndarray, C: np.ndarray):
 Y = np.empty((16, 128, 128), dtype="float32")
 for n in range(16):
 for i in range(128):
 for j in range(128):
 for k in range(128):
 if k == 0:
 Y[n, i, j] = 0
 Y[n, i, j] = Y[n, i, j] + A[n, i, k] * B[n, k, j]
 for n in range(16):
 for i in range(128):
 for j in range(128):
 C[n, i, j] = max(Y[n, i, j], 0)

Translating the numpy code into TensorIR we get:

@tvm.script.ir_module
class MyBmmRule:
 @T.prim_func
 def bmm_relu(A: T.Buffer[(16, 128, 128), "float32"],
 W: T.Buffer[(16, 128, 128), "float32"],
 Y: T.Buffer[(16, 128, 128), "float32"]):
 T.func_attr({"global_symbol": "bmm_relu", "tir.noalias": True})
 # we must to allocate the buffer here!
 Y_ = T.alloc_buffer([16, 128, 128], dtype="float32")
 for n, i, j, k in T.grid(16, 128, 128, 128):
 with T.block("M"):
 vn = T.axis.spatial(16, n)
 vi = T.axis.spatial(128, i)
 vj = T.axis.spatial(128, j)
 vk = T.axis.reduce(128, k)
 with T.init():
 Y_[vn, vi, vj] = T.float32(0)
 Y_[vn, vi, vj] += A[vn, vi, vk] * W[vn, vk, vj]
 for n, i, j in T.grid(16, 128, 128):
 with T.block("R"):
 vn = T.axis.spatial(16, n)
 vi = T.axis.spatial(128, i)
 vj = T.axis.spatial(128, j)
 Y[vn, vi, vj] = T.max(Y_[vn, vi, vj], T.float32(0))

Our ultimate goal is to transform the TensorIR above to the following form:

@tvm.script.ir_module
class TargetModule:
 @T.prim_func
 def bmm_relu(A: T.Buffer[(16, 128, 128), "float32"], B: T.Buffer[(16, 128, 128), "float32"], C: T.Buffer[(16, 128, 128), "float32"]) -> None:
 T.func_attr({"global_symbol": "bmm_relu", "tir.noalias": True})
 Y = T.alloc_buffer([16, 128, 128], dtype="float32")
 for i0 in T.parallel(16):
 for i1, i2_0 in T.grid(128, 16):
 for ax0_init in T.vectorized(8):
 with T.block("M_init"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + ax0_init)
 Y[n, i, j] = T.float32(0)
 for ax1_0 in T.serial(32):
 for ax1_1 in T.unroll(4):
 for ax0 in T.serial(8):
 with T.block("M_update"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + ax0)
 k = T.axis.reduce(128, ax1_0 * 4 + ax1_1)
 Y[n, i, j] = Y[n, i, j] + A[n, i, k] * B[n, k, j]
 for i2_1 in T.vectorized(8):
 with T.block("R"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + i2_1)
 C[n, i, j] = T.max(Y[n, i, j], T.float32(0))

Before we perform the transformation, let’s understand what the transformed TensorIR is doing by looking at several loops here.

First, taking a look at

for i1, i2_0 in T.grid(128, 16):
 for ax0_init in T.vectorized(8):
 with T.block("M_init"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + ax0_init)
 Y[n, i, j] = T.float32(0)

The code block is initializing the Y matrix to be 0. But it does so by initializing every 8 consecutive elements in each row of Y using a vectorized operation (which might be faster).

The next loop is bit tricky:

for ax1_0 in T.serial(32):
 for ax1_1 in T.unroll(4):
 for ax0 in T.serial(8):
 with T.block("M_update"):
 n, i = T.axis.remap("SS", [i0, i1])
 j = T.axis.spatial(128, i2_0 * 8 + ax0)
 k = T.axis.reduce(128, ax1_0 * 4 + ax1_1)
 Y[n, i, j] = Y[n, i, j] + A[n, i, k] * B[n, k, j]

This loop is actually performing the matrix multiplication of A and B. We mutiply a row in A with a column in B and sum up the result into a number.

Here, i is mapped to i1, which means we access A one row at a time.i k = T.axis.reduce(128, ax1_0 * 4 + ax1_1) means we access one row in matrix A and one column in matrix B sequentially duing mutiplying, while applying unrolling in hope for better access efficency (\(128 = 32\times 4))). j = T.axis.spatial(128, i2_0 * 8 + ax0) really just means accessing each column sequentially, nothing special.

Perform Transformation

To perform tranformation on any TensorIP, it’s very important to follow the steps listed below:

Get block
Get loops
Organize loops by split, reorder, compute_at/reverse_compute_at
Decompose reduction
vectorize/unroll/parallel

Applying step 1, 2, and 3, we first get the block from the original TensorIR:

sch = tvm.tir.Schedule(MyBmmRule)
# Step 1. Get blocks
block_M = sch.get_block("M", func_name="bmm_relu")

# Step 2. Get loops
n, i, j, k = sch.get_loops(block_M)

# Step 3. Organize loops
k0, k1 = sch.split(k, factors=[32, 4])
j0, j1 = sch.split(j, factors=[16, 8])

The reason we split k and j in such a way is: we already mentioned k dimension is accessed sequentially but with unrolling (4) applied; when matrix Y is initialized, a vectorized operation (applied on 8 elements) is applied to dimension j, or every 8 elements in one row(TVM is row-major, therefore might be faster).

But the next question is: how do we reorder the spitted loop? I spent a lot of time trying to figure that out. Turns out the simplest way is to write out the implementation in numpy and proceed from there. Remember, we’ve already splitted k and j, which are used during matrix multiplication, so our new matrix multipliation in numy would be:

for j0 in range(16):
 for k0 in range(32):
 for k1 in range(4):
 for j1 in range(8):
 Y[i, 8*j0+j1] += A[i, 4*k0 + k1] * B[4*k0+k1, 8*j0+j1]

Because we move the the next column in B after traversing the previous column, we will put j1 at the innermost loop. Therefore, the transformation for TensorIR would be:

sch.reorder(j0, k0, k1, j1)

We can print out the transformed TensorIR with print(sch.mod.script()):

@tvm.script.ir_module
class Module:
 @tir.prim_func
 def bmm_relu(A: tir.Buffer[(16, 128, 128), "float32"], B: tir.Buffer[(16, 128, 128), "float32"], C: tir.Buffer[(16, 128, 128), "float32"]) -> None:
 tir.func_attr({"global_symbol": "bmm_relu", "tir.noalias": True})
 Y = tir.alloc_buffer([16, 128, 128], dtype="float32")
 for n in tir.parallel(16):
 for i, j_0, k_0, k_1, j_1 in tir.grid(128, 16, 32, 4, 8):
 with tir.block("M"):
 vn, vi = tir.axis.remap("SS", [n, i])
 vj = tir.axis.spatial(128, j_0 * 8 + j_1)
 vk = tir.axis.reduce(128, k_0 * 4 + k_1)
 tir.reads(A[vn, vi, vk], B[vn, vk, vj])
 tir.writes(Y[vn, vi, vj])
 with tir.init():
 Y[vn, vi, vj] = tir.float32(0)
 Y[vn, vi, vj] = Y[vn, vi, vj] + A[vn, vi, vk] * B[vn, vk, vj]
 for n, i, j in tir.grid(16, 128, 128):
 with tir.block("R"):
 vn, vi, vj = tir.axis.remap("SSS", [n, i, j])
 tir.reads(Y[vn, vi, vj])
 tir.writes(C[vn, vi, vj])
 C[vn, vi, vj] = tir.max(Y[vn, vi, vj], tir.float32(0))

Now, we just need to move the ReLu operation (for n, i, j in tir.grid(16, 128, 128):) into the loop above:

block_M = sch.get_block("M", func_name="bmm_relu")
sch.reverse_compute_at(block_M, j0)

Step 4 involves seperating initialization and matrix multiplication, therefore we use M_init = sch.decompose_reduction(block_M, k0), which results in:

@tvm.script.ir_module
class Module:
 @tir.prim_func
 def bmm_relu(A: tir.Buffer[(16, 128, 128), "float32"], B: tir.Buffer[(16, 128, 128), "float32"], C: tir.Buffer[(16, 128, 128), "float32"]) -> None:
 # function attr dict
 tir.func_attr({"global_symbol": "bmm_relu", "tir.noalias": True})
 # body
 # with tir.block("root")
 Y = tir.alloc_buffer([16, 128, 128], dtype="float32")
 for n in tir.parallel(16):
 for i, j_0 in tir.grid(128, 16):
 for j_1_init in tir.serial(8):
 with tir.block("M_init"):
 vn, vi = tir.axis.remap("SS", [n, i])
 vj = tir.axis.spatial(128, j_0 * 8 + j_1_init)
 tir.reads()
 tir.writes(Y[vn, vi, vj])
 Y[vn, vi, vj] = tir.float32(0)
 for k_0, k_1, j_1 in tir.grid(32, 4, 8):
 with tir.block("M_update"):
 vn, vi = tir.axis.remap("SS", [n, i])
 vj = tir.axis.spatial(128, j_0 * 8 + j_1)
 vk = tir.axis.reduce(128, k_0 * 4 + k_1)
 tir.reads(Y[vn, vi, vj], A[vn, vi, vk], B[vn, vk, vj])
 tir.writes(Y[vn, vi, vj])
 Y[vn, vi, vj] = Y[vn, vi, vj] + A[vn, vi, vk] * B[vn, vk, vj]
 for ax0 in tir.serial(8):
 with tir.block("R"):
 vn, vi = tir.axis.remap("SS", [n, i])
 vj = tir.axis.spatial(128, j_0 * 8 + ax0)
 tir.reads(Y[vn, vi, vj])
 tir.writes(C[vn, vi, vj])
 C[vn, vi, vj] = tir.max(Y[vn, vi, vj], tir.float32(0))

The final step is easy, just apply vectorize/parallel/unroll onto corresponding loop:

n, i, j_0, j_1_init = sch.get_loops(M_init)
sch.vectorize(j_1_init)

n, i, j_0, i2_1 = sch.get_loops(block_R)
sch.vectorize(i2_1)

block_M_update = sch.get_block("M_update", func_name="bmm_relu")
n, i, j_0, k_0, k_1, j_1 = sch.get_loops(block_M_update)

Print out the final TensorIR to find out its final form ( ͡❛ ͜ʖ ͡❛).

Dive into TensorIR

Sun, 28 Aug 2022 00:00:00 +0000

TensorIR is a compiler abstraction for optimizing programs with tensor computation primitives in TVM. Imagine a DNN task as a graph, where each node represents a tensor computation. TensorIR explains how each node/tensor computation primitive in the graph is carried out. This post explains my attempt to implement 2D convolution using TensorIR. It is derived from the Machine Learning Compilation course offered by Tianqi Chen.

Implement 2D Convolution

2D convolution is a common operation in image processing. The image below captures how 2D convolution operates. I won’t go into details here. But you can find plenty information online regarding convolution.

First, we initialize both the input matrix and the weight matrix:

# batch, input_channel_dim, image_height, image_width, output_channel_dim, kernel_width & height
N, CI, H, W, CO, K = 1, 1, 8, 8, 2, 3
# output_height, output_width, assuming kernel has stride=1 and padding=0
OUT_H, OUT_W = H - K + 1, W - K + 1
data = np.arange(N*CI*H*W).reshape(N, CI, H, W)
weight = np.arange(CO*CI*K*K).reshape(CO, CI, K, K)

We can validate the results using torch.nn.functional.conv2d() from PyTorch.

One thing Tianqi recommended for starters is to write the implementation first in numpy, and then translate the numpy implementation to TensorIR. I started my implementation directly from TensorIR, before totally getting confused. So here’s how I approach the problem.

First, and perhaps most importantly, you should figure out the accessing pattern of the output matrix, and gradually fill up the compute rules for each element in the output matrix. So, we know the output matrix has a shape of (N, CO, OUT_H, OUT_w) (which corresponds to batch, number of output channels, output height, and output width). The numpy loop will look like:

for b in np.arange(0, N):
 for co in np.arange(0, CO):
 for h in np.arange(0, OUT_H):
 for w in np.arange(0, OUT_W):
 Y[b, co, h, w] = 0

Here, we access element in the output matrix one by one and initialize each element to be 0. Next, we will try to figure out how to compute each element. We know each element in the output matrix is just the sum of element-wise multiplication of both the 2D convolutional kernel (1 by 3 by 3) and the corresponding area in the input matrix (1 by 3 by 3):

for b in np.arange(0, N):
 for co in np.arange(0, CO):
 for h in np.arange(0, OUT_H):
 for w in np.arange(0, OUT_W):
 # init to 0
 Y[b, co, h, w] = 0
 # 2d conv kernel
 for ci in np.arange(0, CI):
 for kh in np.arange(0, K):
 for kw in np.arange(0, K):
 # reduction
 Y[b, co, h, w] += A[b, ci, h+kh, w+kw] * W[co, ci, kh, kw]

We can verify the function has the same output as torch.nn.functional.conv2d() from PyTorch.

The next part is to translate the numpy code into TensorIR. I won’t go into every the details of every single line here, but you can find all explanations from this note.

The nested loop can be encapsulated using T.grid() like this:

@tvm.script.ir_module
class MyConv:
 @T.prim_func
 def conv2d(data: T.Buffer[(N, CI, H, W), "int64"],
 weight: T.Buffer[(CO, CI, K, K), "int64"],
 result: T.Buffer[(N, CO, OUT_H, OUT_W), "int64"]):
 T.func_attr({"global_symbol": "conv2d", "tir.noalias": True})
 # loop through each elem in the output matrix
 for b, o, h, w in T.grid(N, CO, OUT_H, OUT_W):
 # kernel access pattern
 for kc, kh, kw in T.grid(CI, K, K):

Next, we define the block (a basic unit of computation in TensorIR). A block contains a set of block axes (vi, vj, vk) and computations defined around them. Here, we define the property about each block axes:

class MyConv:
 @T.prim_func
 def conv2d(data: T.Buffer[(N, CI, H, W), "int64"],
 weight: T.Buffer[(CO, CI, K, K), "int64"],
 result: T.Buffer[(N, CO, OUT_H, OUT_W), "int64"]):
 T.func_attr({"global_symbol": "conv2d", "tir.noalias": True})
 # impl
 for b, o, h, w in T.grid(N, CO, OUT_H, OUT_W):
 for kc, kh, kw in T.grid(CI, K, K):
 with T.block("A"):
 vb = T.axis.spatial(N, b)
 vc_o = T.axis.spatial(CO, o)
 vh = T.axis.spatial(OUT_H, h)
 vw = T.axis.spatial(OUT_W, w)
 vc_i = T.axis.reduce(CI, kc)
 vw_h = T.axis.reduce(K, kh)
 vw_w = T.axis.reduce(K, kw)

The outer loop all receives T.axis.spatial(), because we access each element in the output matrix element by element (spatially), without doing anything else. On the other hand, we see parameters in the innter loop receives T.axis.reduce(). Remember, each element in the output matrix is just the sum of element-wise multiplication of both the 2D convolutional kernel (1 by 3 by 3) and the corresponding area in the input matrix (1 by 3 by 3). Therefore, after the element-wise multiplication finishes, we need perform a reduction operation over all three axes. More concretely, we will sum up all elements in the row(K), column(K), and channel(CI): (1, 3, 3) -> (1)

@tvm.script.ir_module
class MyConv:
 @T.prim_func
 def conv2d(data: T.Buffer[(N, CI, H, W), "int64"],
 weight: T.Buffer[(CO, CI, K, K), "int64"],
 result: T.Buffer[(N, CO, OUT_H, OUT_W), "int64"]):
 T.func_attr({"global_symbol": "conv2d", "tir.noalias": True})
 # impl
 for b, o, h, w in T.grid(N, CO, OUT_H, OUT_W):
 for kc, kh, kw in T.grid(CI, K, K):
 with T.block("A"):
 vb = T.axis.spatial(N, b)
 vc_o = T.axis.spatial(CO, o)
 vh = T.axis.spatial(OUT_H, h)
 vw = T.axis.spatial(OUT_W, w)
 vc_i = T.axis.reduce(CI, kc)
 vw_h = T.axis.reduce(K, kh)
 vw_w = T.axis.reduce(K, kw)

 with T.init():
 result[vb, vc_o, vh, vw] = T.int64(0)
 # compute rule
 result[vb, vc_o, vh, vw] += data[vb, vc_i, vh+vw_h, vw+vw_w] * weight[vc_o, vc_i, vw_h, vw_w]

Pathways: Google's New ML System

Thu, 31 Mar 2022 00:00:00 +0000

Google recently released the paper about its new ML system called Pathways. I’m a bit surprised since I expect it to introduce a brand new model architecture. In fact, this paper is not easy to digest at all. I feel like it’s written for people who spent many years developing ML frameworks. Anyway, we will try to understand why it is developed and how it works. Also, you should check this post (in Chinese). This post explains many concepts in Pathways much more clearly. Many contents here are credited to this post.

This paper spends a long time discussing single-controller and multi-controller. It’s really confusing to understand all these SPMD, MPMD, single-controller, and multi-controller stuffs. Pathways claims the future ML framework should go back to single-controller. By “back” I mean ML frameworks were originally single-controller, then they adopted multi-controller. Now, we are going back to single-controller again.

Single-Controller

TensorFlow v1 is a classic example of single-controller system. The high level idea is the user would define a dataflow graph through a Python client. This graph is then submitted to the session.run (runtime system). The system consists of a single master and many other workers. The mater will compile and the dataflow graph submitted by the client, then divides the graph into sub-graphs. Then the master submits those subgraphs to other workers.

In this case, each worker computes its own share of sub-graph. The client + master are the controller.

Fig. control messages (oranges lines) need to go through slow DCN between Ctrlr and hosts

As the paper suggests, dispatching computations in a single-controller system requires communnication across (data center network) DCN. All the orange lines are control messages flowing through DCN. We can see the workers are idle for a long time between each step, even though there’s no gap between adjust steps on the controller.

The controller submits jobs to all workers in each step, then waits all workers to finish computing their own sub-graphs. The problem is: 1) waiting for all workers to finish computation in a lock-step fashion is inefficient; 2) send and wait for control messages (orange line) is costly since these messages go through slow DCN.

Multi-Controller Systems

Contrary to single-controller systems, multi-controller systems like Jax adopts a different philosophy. Under multi-controller systems, each worker shares the same code and executes different stage/branch of the code. This is why they are called SPMD systems (single-program-multiple-data).

Fig. Dispatching jobs only happens locally on hosts without going through DCN

Take MPI process as an example, every MPI process is an entrance (client) to the program (In single-controller systems, only the client-master can be the entrance).

Since multi-controller systems doesn’t have a centralized coordinator, all workers in can initiate communication with each other, using much faster channels such as PCIe or NVLink. In the multi-controller graph, the black dotted lines represents message between hosts and devices (through PCIe); the communication between devices happens through fast NVLink. So we don’t have the big overhead introduced by DCN.

If you want to get a taste of how PyTorch vs TensorFlow v1’s (multi-controller vs single-controller) programming style feels like, here are two examples: Writing Distributed Applications with PyTorch and End-to-End Tutorial for Distributed TensorFlow 1.x.

Going Back to Single-Controller

We could stick with multi-controller systems forever. If every worker node shares symmetric workloads and communications (like all-reduce, all-gather, etc.), then there’s nothing to be worried about. After all, multi-controller seems much more efficient than single-controller based on what we’ve discussed so far.

However, pipeline parallelism changes the story. Under pipeline parallelism, different workers in the pipeline will execute at different programs. Thus we have MPMD (multi-program-multi-data). For example, we can have one worker doing convolution for batch 1 while another worker is doing encoding work on batch 2. At each stage of the pipeline, the worker is doing different jobs on a different data batch (think of a CPU pipeline where each stage is executing different instructions).

Take the above graph as an example, assume we have three workers 1, 2, 3 from top to bottom. Each worker is performing asymmetric workloads and doing irregular point-to-point communications (instead of symmetric communications like all-gather). Obviously, multi-controller doesn’t fit into this kind of workload. How do you write a single copy of code that does all these irregular communications under multi-process scenarios?

Thus, Pathways proposes we should go back to single-controller, so that we can let the master node handle all these nasty communication patterns.

Deadlock

Single-Controller brings back gang-scheduling and centralized coordinator. The reason to use gang-scheduling and centralized coordinator is to help preventing deadlocks. However, the rational behind this design decision is hard to interpret from reading the paper. I’m going to use the post from Jinhui Yan (the developer behind OneFlow) to explain why gang-scheduling and centralized coordinator prevent deadlocks.

Gang-scheduling is essential in the case of TPUs, since they are single-threaded and only run non-preemptible kernels, so the system will deadlock if communicating computations are not enqueued in a consistent order.

We can think of a computing device as a FIFO task queue (e,g. CUDA streams, TPU, or CPU…). Each FIFO task queue essentially have a stream of tasks to process.

Src. Jinhui Yan

The paper emphasizes that TPUs are single-threaded and only run non-preemptible kernels. That means we can think of each TPU as a single FIFO task queue. Once we enqueue a task, it can not be preempted from the queue. We need to wait until this task finishes its computation before we can execute the next task in the queue. This is a problem!

Src. Jinhui Yan

Imagine we have two devices (1 and 2), represented as two FIFO queues. Device 1 chooses to enqueue task A first and then B; device 2 decides to enqueue task B first and then A. Both tasks A and B are performing an all-scatter operation. Therefore, task A on device 1 needs to wait for messages from task A on device 2. Similarly, task B on device 2 needs to wait for messages from task B on device 1.

This is a classical example of deadlock in operating systems.

Solutions to Deadlock

Using gang-scheduling helps preventing deadlocks, because it enforces a global enqueueing order across multiple FIFO queues, instead of letting each queue handling tasks separately.

The paper also mentions allowing device (e.g. GPUs) to execute tasks concurrently can prevent deadlocks. This is because concurrency eliminates the non-preemption property which is required for deadlocks to happen.

Src. Jinhui Yan

If each devices allows concurrency executions (each device has multiple queues), then the task on one queue can be preemptied to allow the other task start executing, thus no deadlock (this is not strictly the case, the post explains an interesting scenario in NCCL where deadlocks can still happen if there are too many communications).

FlexFlow

Tue, 22 Feb 2022 00:00:00 +0000

FlexFlow is a deep learning framework that discovers a fast parallelization strategy for distributed DNN training. It uses SOAP (Sample-Operation-Attribute-Parameter) search space of parallelization strategies. in short, FlexFlow automates the parallelization of model training.

The four elements in SOAP search space represent something that can be sliced into smaller chunks. For example, sample and parameter can be thought of as slicing training data and model parameters. Operation describes how operations (e.g. matmul, add, etc.) can be parallelized. Attribute further describes how to partition a sample.

Problem Inputs

Since FlexFlow is about searching for solutions, the framework is given two inputs: an operator graph $\mathcal{G}$, which include all operations and state in a DNN model, and a device topology $\mathcal{D}$. Both are described as graphs.

Each node $o_i \in \mathcal{G}$ is an operation (e.g. matmul). Each edge $o_i, o_j \in \mathcal{G}$ is a tensor. In contrast, each node $d_i \in \mathcal{D}$ is a computing device, and edge edge $(d_i, d_j) \in \mathcal{D}$ is hardware connection (e.g. NVLink, network link, etc.), Each edge are also labeled with its bandwidth and latency.

The FlexFlow optimizer uses the operator graph $\mathcal{G}$ and the device topology graph $\mathcal{D}$ to generate a discovered strategy to a distributed runtime.

How to search for parallelization strategies

Ultimately, FlexFlow is trying to achieve two things: find parallelization configuration on the operator graph $\mathcal{G}$, and map the output the device topology $\mathcal{D}$.

For an operation $o_i$, it is given parallelizable dimensions $\mathcal{P}_i$, which is the set of all divisible dimensions in its output tensor. The paper provides a 1D convolution example:

For data parallelism, we can see the input data is splitted into smaller micro-batches. In model parallelism, the batch dimension remains the same, while the model is splitted and handles the same input data. The intuition is for a given tensor, there exists many ways to divide it.

There are many dimensions in $\mathcal{P}_i$, each single parallelization configuration is denoted as $c_i$. Therefore, the product of all $c_i$, represented as $|c_i|$, is the total number of divided output tensors.

Each parallelization configuration $c_i$ partitions the operation $o$ into $|c_i|$ tasks. (denoted as $t_{i:1}…, t_{i|c_i|}$). Each task represents a divided operation and is assigned to a device. The paper claims that, given the output tensor of a task and its operation type, we can infer the input tensors to execute each task. It gives an example of dividing the matmul operation:

Given the output tensor is splitted across its sample (batch) dimension and feature dimension, and the task type is matmul, we can use these information to infer the input tensors $X$ and $W$.

graph TD; Operator-Graph-->Parallelization-Strategy; Device-Topology-->Parallelization-Strategy;

The parallelization configurations $c_i$ for each operation $o_i$ is combined in a final configuration $\mathcal{S}$.

Building Task Graph

Now we have the operation graph $\mathcal{G}$, the device topology graph $\mathcal{D}$, and the parallelization strategy $\mathcal{S}$, we can construct the task graph.

graph TD; Operator-Graph-->Task-Graph; Device-Topology-->Task-Graph; Parallelization-Strategy-->Task-Graph;

In essence, the task graph specifies the dependencies between each computation and communication task. The task graph is denoted as $\mathcal{T} = (\mathcal{T}_N , \mathcal{T}_E)$. If two tasks are assigned to the same computation device (e.g. same GPU), no communication task is required. Otherwise, we add a communication task to $\mathcal{T}_E$. For example, given a operator graph with a set of configurations $\mathcal{S}$:

The task graph will reflect the logical dependency between each task:

Each computation task is also marked with its average execution time exeTime (from running on the real device multiple times). A communication task’s exeTime is calculated by dividing the tensor size by the bandwidth.

Use Simulation to Estimate Execution Overhead

Now that we have the task graph with all dependencies specified, it’s time to evaluate (or simulate) the execution time of the whole task graph.

In essence, we know how a model is partitioned and placed in a cluster, we need to figure out how to schedule the execution.

The simplest way to simulate the task graph execution is as follows:

Given a task graph, if there are some task nodes that doesn’t have an input/s, meaning such tasks represent the beginning layers of a neural network, then they are put into a ready queue waiting to be executed.
Next, we dequeue the task from the ready queue based on the ready time (the time it is enqueued), or the previously executed task’s finish time.
After this task finishes (simulated) execution, we look at other tasks that depend on this just-finished-execution task, if the other tasks’ dependees all finish execution, then this task can be put into the ready queue.

However, we haven’t seen how the task graph $\mathcal{T}$ might change once we update the configuration of an operation node $o_i$. FlexFlow only propose a new parallelization strategy by change the configuration of a single operation $o_i$ at a time. Therefore, whenever we generate a new configuration for an operator, we only need to re-simulate task involved in the portion of the execution timeline that changes. It means we can generate a new task graph from a previous task graph, thus speeding up the simulation process.

Execution Optimizer

Previously, we assumed the parallelization strategy is generated through some black box function. In fact, the execution optimizer is in charging of taking an operator graph and a device topology as inputs to find an efficient parallelization strategy.

In fact, the optimizer uses Markov chain Monte Carlo (MCMC) method to sample generated parallelization configurations. It uses the simulation cost as an oracle so that the proposed new configuration will be more likely to be sampled from the ones with less simulation overhead. This method is very greedy but the author argue it can potentially escape from local minimum.

Add Mermaid to Hugo with Dark Mode

Tue, 15 Feb 2022 00:00:00 +0000

Recently, I was revisiting materials in Deep Learning. I need tools that generate diagrams easily. Drawing the graphs from scratch and upload them individually to the image hosting platform is a daunting process. This is when Mermaid comes into rescue. Now I can generate diagrams directly using Markdown. Here’s how to do it inside a Hugo site.

I use the etch theme, but this process should apply to all sites using Hugo. First, we create a new file /layouts/shortcodes/mermaid.html. We fill up mermaid.html with:

<script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
<script>
 let isDark = window.matchMedia('(prefers-color-scheme: dark)').matches;
 let mermaidTheme = (isDark) ? 'dark' : 'default';
 let mermaidConfig = {
 theme: mermaidTheme,
 logLevel: 'fatal',
 securityLevel: 'strict',
 startOnLoad: true,
 arrowMarkerAbsolute: false,

 er: {
 diagramPadding: 20,
 layoutDirection: 'TB',
 minEntityWidth: 100,
 minEntityHeight: 75,
 entityPadding: 15,
 stroke: 'gray',
 fill: 'honeydew',
 fontSize: 12,
 useMaxWidth: true,
 },
 flowchart: {
 diagramPadding: 8,
 htmlLabels: true,
 curve: 'basis',
 },
 sequence: {
 diagramMarginX: 50,
 diagramMarginY: 10,
 actorMargin: 50,
 width: 150,
 height: 65,
 boxMargin: 10,
 boxTextMargin: 5,
 noteMargin: 10,
 messageMargin: 35,
 messageAlign: 'center',
 mirrorActors: true,
 bottomMarginAdj: 1,
 useMaxWidth: true,
 rightAngles: false,
 showSequenceNumbers: false,
 },
 gantt: {
 titleTopMargin: 25,
 barHeight: 20,
 barGap: 4,
 topPadding: 50,
 leftPadding: 75,
 gridLineStartPadding: 35,
 fontSize: 11,
 fontFamily: '"Open-Sans", "sans-serif"',
 numberSectionStyles: 4,
 axisFormat: '%Y-%m-%d',
 topAxis: false,
 },
 };
 mermaid.initialize(mermaidConfig);
</script>

This setup allows us to change Mermaid-generated diagrams’ theme based on the website’s current (light/dark) theme. This configuration is borrowed from the Setup.md from mermaid-js (except the theme part). You can find more information there about configuring mermaid.

You can also do this in /partials, but it will slow down the loading time because the mermaid js file is always loaded, regardless whether you are actually using mermaid.

Next, we add the follow lines to the file /layouts/shortcodes/mermaid.html:

<center>
<div class="mermaid">
 {{.Inner}}
</div>
</center>

Feel free to remove the <center> tag if you want to customize the diagram’s layout. And… we are done!

Here is an example sequenceDiagram. You should see that this diagram will adjust its theme accordingly based on light/dark mode. We use the example code from mermaid doc (just uncomment mermaid in the shortcode {{/*< mermaid >*/}}):

{{/*< mermaid >*/}}
sequenceDiagram
 participant Alice
 participant Bob
 Alice->>John: Hello John, how are you?
 loop Healthcheck
 John->>John: Fight against hypochondria
 end
 Note right of John: Rational thoughts <br/>prevail!
 John-->>Alice: Great!
 John->>Bob: How about you?
 Bob-->>John: Jolly good!
{{/*< /mermaid >*/}}

sequenceDiagram participant Alice participant Bob Alice->>John: Hello John, how are you? loop Healthcheck John->>John: Fight against hypochondria end Note right of John: Rational thoughts
prevail! John-->>Alice: Great! John->>Bob: How about you? Bob-->>John: Jolly good!

This diagram will adjust its theme based on light/dark theme. You can find more features from the Mermaid website.

Cross Entropy Loss

Sun, 13 Feb 2022 00:00:00 +0000

Many deep learning tasks involve classification, where a model outputs a series of probabilities for their corresponding labels. The goal is to correctly predict a given input’s label. Mathematically, it means generating max probabilities for the correct label. The probabilities are generated through a process called softmax.

The softmax function outputs a vector $\hat{y}$, which represents estimated conditional probabilities of each class given an input $x$, For example, $\hat{y}_1 = P(y=\textrm{car}\ |\ x)$. Assume we have many features $x^{(i)}$ and their corresponding labels $y^{(i)}$. Then outputs of the model can be expressed succinctly as

\[ P(Y\ |\ X) = \prod^{k}_{i=1} P(y^{(i)} | \ x^{(i)}) \]

Our goal is to maximize $P(Y | X)$. This is equivalent to minimizing the negative log-likelihood $ -\textrm{log} P(Y\ |\ X) = \sum^{k}_{i=1} -\textrm{log} P(y^{(i)} | \ x^{(i)}) $.

This loss function called the cross-entropy loss. It is widely used in many classification tasks. Our objective is to reduce the value of this loss function. This is equivalent to maximizing the predicted probability for the correct label.

To see why this works. Let take a toy example. Suppose we have three classes. Our model produces a vector with three probabilities for each input given.

import numpy as np

# produces two probability vector for two inputs
y_hat = np.array([[0.1, 0.3, 0.6], [0.2, 0.3, 0.5]])

The label is represented as the indices of the probabilities in y_hat, which will give us the generated probability for a the correct label.

y = np.array([0, 2])

Then, we implement the cross-entropy loss function as:

def cross_entropy(y_hat, y):
 return - np.log(y_hat[range(len(y_hat)), y])

Finally, we calculate the loss value for our given probability vectors:

cross_entropy(y_hat, y)

The result is array([2.30258509, 0.69314718]). In the first output [0.1, 0.3, 0.6], the label is at index 0. But our model gives max probability to index 2, and only $0.1$ to the label, thus the greater loss value. In the second probability vector [0.2, 0.3, 0.5], we made the right prediction as we give the max probability to index 2 corresponding to the label, thus the smaller loss value.

Maximum Likelihood for Classification

Mon, 24 Jan 2022 00:00:00 +0000

Let’s say we want to classify an input text $y$ and give it a label $x$. Formally, we want to find:

\[ \textrm{argmax} P(x | y) \]

By Bayes’ rule this is the same as

\[ \textrm{argmax} \frac{P(y|x)P(y)}{P(x)} \]

Suppose we have five documents as training data and one document as the input as testing data. Our objective is to give a label to the test sentence.

Credit: Eunsol Choi

Let’s define the probability of class as ($N$ is the total number of classes)

\[ p(x) = \frac{count(x)}{N} \]

and the probability of a word appearing given a class label (total number of vocabs)

\[ p(w_i|x) = \frac{count(w_i,x) + 1}{count(x) + |V|} \]

The conditional probabilities for $p(w_i|y)$ is

Now, we want to find out which language label should we assign the sentence “Chinese Chinese Chinese Tokyo Japan”. This is the same as asking which labels ($x$)) should we pick so that $P(W|x)P(x)$ yields the greatest value. Mathematically, we want to find out where the gradient of the function $P(W|x)P(x)$ is flat.

If we label the sentence as j (Japanese), we have $P(j | d_5) \propto \frac{1}{4}\cdot (\frac{2}{9}^3)\cdot \frac{2}{9}\cdot \frac{2}{9} \approx 0.0001$. If we calculate $P(c|d_5)$, we get 0.0003, which generates the largest value for $P(x | y)$.

Machine Learning System Resources

Sat, 08 Jan 2022 00:00:00 +0000

This is my personal list of resources related to machine learning systems. Feel free to drop me an email if you think there’s something worth mentioning. I will try to update this page frequently to include the most recent stuffs in mlsys.

Resources

Facebook’s external large-scale work
NGC Container Doc: great for development, without having to manually install CUDA, pytorch, and other dependencies.
Awesome-System-for-Machine-Learning: A curated list of research in machine learning systems (MLSys). Paper notes are also provided.
Mastering LLM Techniques: Inference Optimization: summary of techniques used for LLM deployment
LLM Visualization

Courses

Deep Learning Systems: Algorithms and Implementation
Machine Learning Compilation: offered by Tianqi Chen, intro to ML compiler. Open to all.
15-884: Machine Learning Systems: offered by Tianqi Chen at CMU.
CSE 291F: Advanced Data Analytics and ML Systems: offered by Arun Kumar at UCSD.
Tinyflow: tutorial code on how to build your own Deep Learning System in 2k Lines.
CSE 599W: Systems for ML: offered by Tianqi Chen at UW.
CS8803-SMR: Special Topics: Systems for Machine Learning: offered by Alexey Tumanov at Georgia Tech. [schedule]
CS 744: Big Data Systems: offered by Aditya Akella back at UW-Madison.
CS 329S: Machine Learning Systems Design: offered by Stanford.
EECS 598: Systems for AI: offered by Mosharaf Chowdhury at UMich.
CS 378: Systems for Machine Learning: offered by Aditya Akella at UT.
Machine Learning Systems (Fall 2019): from UCB
ECE 382V SysML: Computer Systems and Machine Learning Interplay: taught by Neeraja Yadwadkar at UT.

Labs & Faculties

Tutorials

The Illustrated Transformer: best introduction for transformer
Dive into Deep Learning: interactive deep learning book with code, math, and discussions.
CS231n: Convolutional Neural Networks for Visual Recognition
Pytorch-Internals: must-read for PyTorch basics.
Differential Programming with JAX
Getting started with JAX (MLPs, CNNs & RNNs)
Physics-based Deep Learning
机器学习系统：设计和实现
Dive into Deep Learning Compiler
Autodidax: JAX core from scratch: really really good resource for learning Jax internals.
Extending JAX with custom C++ and CUDA code
Vector, Matrix, and Tensor Derivatives
ML Memory Optimization: slides from UW. Visualization of dataflow graph helps understand how to optimize memory.
Pytorch模型加速系列（一）——新的Torch-TensorRT以及TorchScript/FX/dynamo
Semianalysis: many good posts
How to Load PyTorch Models 340 Times Faster with Ray
分布式深度学习系统
MLsys各方向综述
金雪锋: MindSpore 技术负责人
解读谷歌Pathways架构（一）：Single-controller与Multi-controller
Why are ML Compilers so Hard?
Alpa: Automated Model-Parallel Deep Learning
Data Transfer Speed Comparison: Ray Plasma Store vs. S3
从零开始学深度学习编译器
一文搞懂 TorchDynamo 原理

Seminars

Stanford MLSys Seminar

Papers

This section could potentially be extremely long..

Training

Really broad topic…

The Llama 3 Herd of Models: really great paper explaining how SOTA models are trained in real world!

LLM

You an also refer to Awesome-LLM

Large Transformer Model Inference Optimization
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face: use LLM as the controller to coordinate with exernal models for complicated tasks
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity: load as sparse, compute as dense
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism: Overlapping LLM forward with KV caching computation.
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache: managing KV cache in distributed settings.
MatFormer: Nested Transformer for Elastic Inference: adaptive blocks during inference.

Inference

INFaaS: Automated Model-less Inference Serving: developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale: transformer-specific inference optimization done by DeepSpeed.
Clipper: A Low-Latency Online Prediction Serving System: a nice overview of inference system. Not SOTA but a good starter.
Orloj: Predictably Serving Unpredictable DNNs: shares similarity to Clipper, but targeting models that may yield unpredictable performance.
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving: how to multiplex devices to serve multiple model while meeting latency constraint.
MOSEL: Inference Serving Using Dynamic Modality Selection: dynamic modality selections for accuracy and SLO tradeoff
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models: finer-grained LLM serving
Orca: A Distributed Serving System for Transformer-Based Generative Models: iteration-based LLM inference
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: inference on heterogeneous accelerators using max-flow
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving: automatic configuration of inter and intra-node parallelism for model for both prefill and decoding.

Multitenancy

MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
A Survey of Multi-Tenant Deep Learning Inference on GPU: efficient resource management for multi-tenant inference.
Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction: overlap DNN operators of different models in an online fashion
THEMIS : Fair and Efficient GPU Cluster Scheduling: minimize the maximum finish time fairness across all ML apps while efficiently utilizing cluster GPUs

Dynamic Neural Network

Dynamic Neural Networks: A Survey
Dynamic Multimodal Fusion
Hash Layers For Large Sparse Models: Using hashing for MoE gating.
An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems: use evolution algorithm to update model structure during the training phase.
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping: dynamically skip FFN layers in Transformers suing cosine similarity.
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference: monotonically decrease LLM layers to ease KV cache management

Auto Placement

Federated Learning

A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Switch & ML

Memory Management

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

System Design

Pathways: Asynchronous Distributed Dataflow for ML: Google’s new DL systems, specifically designed for TPU.
Ray: A Distributed Framework for Emerging AI Applications: RiseLab’s new distributed system. Using shared memory for data communication.
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch: shared many similarities to Google’s Pathways.

Trade-off

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training: find trade-offs between DNN training performance optimization and energy consumption, by configuring batch size and GPU power limit.

Structured LLM Generation

Async Training

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models: training individual LMs before merging
Large Scale Distributed Deep Networks: parameter server

Self-play

O1 Replication Journey: A strategic Progress Report

Costs

RouteLLM: Learning to Route LLMs with Preference Data

RAG

From Local to Global: A Graph RAG Approach to Query-Focused Summarization: GraphRAG, compressing information into graphs so retrieval may contain more information
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting: SpeculativeRAG: use smaller models with a subset of retrieved documents for faster draft generation, and use the larger LM for verification.
Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models: summarize retrieval documents instead of merely just retrieving documents as is.
Seven Failure Points When Engineering a Retrieval Augmented Generation System: a study showing how different configurations in different retrieval stages may affect generation quality.
Active Retrieval Augmented Generation: determine when to perform retrieval, and how to induce the query for retrieval
MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery

Megatron with FastMoE

Wed, 01 Dec 2021 00:00:00 +0000

This is a guide on setting up Megatron-LM with FastMoE. Megatron is a transformer developed by the Applied Deep Learning Research team at NVIDIA. FastMoE enables PyTorch support for the Mixture of Experts (MoE) models. We use the FastMoE layer to replace the MLP layers in the transformer language model.

Prerequisites

Docker

We recommend using one of NGC’s recent PyTorch containers. The Megatron-LM repo uses pytorch:20.12-py3. We pull the image with:

docker pull nvcr.io/nvidia/pytorch:20.12-py3

Note: it’s possible to use the official PyTorch image. However, there are a few dependencies missing, which requires manual installation. Also, PyTorch with versions greater than 1.8 seems to have problem during forward passing so we don’t use the official PyTorch image here.

After the image is pulled successfully, we want to start a container. The NGC site contains instructions on how to start a docker image. We use the following script:

docker run --gpus all -it --rm --ipc=host -v /home/edwardhu/:/home/edwardhu/ --name pytorch-moe <image_id>

Note: we might encounter problems before starting up the docker container. Make sure we set the GPG and remote repo for the nvidia-docker2 package on the host and install required packages:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
 && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
 && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Set up FastMoE

After we spin up the container, we clone the fastmoe repo and enter project. There is a setup.py file in the root of the project. Then we execute:

USE_NCCL=1 python setup.py install

to install FastMoE. For some reason, there is a compilation error saying that broadcastUniqueNCCLID(&ncclID)’s definition can not be found. We see there is a condition check right above the error function:

#if defined(TORCH_VERSION_MAJOR) && (TORCH_VERSION_MAJOR > 1 || \
 (TORCH_VERSION_MAJOR == 1 && TORCH_VERSION_MINOR >= 8))

For some reason, the check failed despite the container has PyTorch version 1.8.0a0+1606899. According to the author, the if macro was to deal with PyTorch’s API variance between v1.7.x and v1.8.x. For now, we simply comment out the if check and force the broadcastUniqueNCCLID(&ncclID, c10d::OpType::SEND, "fastmoe_nccl_comm", rank); to be used instead of the broadcastUniqueNCCLID(&ncclID) function:

//#if defined(TORCH_VERSION_MAJOR) && (TORCH_VERSION_MAJOR > 1 || \
// (TORCH_VERSION_MAJOR == 1 && TORCH_VERSION_MINOR >= 8))
 broadcastUniqueNCCLID(&ncclID,
 c10d::OpType::SEND,
 "fastmoe_nccl_comm",
 rank);
//#else
 //broadcastUniqueNCCLID(&ncclID);
//#endif
 ncclComm_t comm;
 NCCL_SAFE_CALL(ncclCommInitRank(&comm, getSize(), ncclID, rank));
 return comm;
 }
};

Finally, we need to download vocab file for later use since the Megatron repo doesn’t have one. Here, we use the vocab file from the SDNet repo. Feel free to use something else.

Megatron-LM Setup

After we set up FastMoE, we clone the Megatron-LM repo into the container. The FastMoE’s example guide on Megatron uses Megatron v2.2 release, so we need to choose the v2.2 tag in the Megatron repo.

Next, we follow the FastMoE’s guide on Megatron and apply the clip-grad-v2.2.path and fmoefy-v2.2.patch accordingly. Instructions on how to apply patches in Linux is easy to find, for example, here is one.

RACE Dataset

After setting up Megatron-LM, we download the RACE dataset for fine-tuning downstream tasks (RACE is used with BERT evaluation, the Megatron’s repo also has several other examples using GPT, here we stick to BERT). The Megatron repo also provides instructions on how to acquire these datasets for evaluation. For now, we just want to get the fine-tuning process up and running, without caring so much about the accuracy. Therefore, we don’t need to pre-train the BERT model just yet. After the dataset finished downloading, we simply need to decompress it.

Summury

The most important line to change a model to FastMoE style is through:

# Initialize FastMoE
 if args.fmoefy:
 from fmoe.megatron import patch_forward_step, patch_model_provider

 forward_step_func = patch_forward_step(forward_step_func)
 model_provider = patch_model_provider(model_provider)

More information can be found in the fmoefy patch file.

Set up Slurm across Multiple Machines

Tue, 16 Nov 2021 00:00:00 +0000

To install Slurm, we need to have admin access to the machine. This post explains how I got Slurm running in multiple Linux servers. All servers are running on Ubuntu 18.04 LTS.

Setup Munge

First, we need to make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster. We need to create two users: slurm and munge across all servers. z

Then, we install Munge for authentication:

$ apt install munge libmunge2 libmunge-dev

To test if munge is installed successfully:

$ munge -n | unmunge | grep STATUS
STATUS: Success (0)

Next, we create a munge authentication key on one of the servers:

$ /usr/sbin/create-munge-key

After we generate munge authentication key, we copy the key /etc/munge/munge.key on that server to all other servers (overwrite the /etc/munge/munge.key on all other servers).

We need to setup the rights for munge accordingly on every server:

$ chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
$ chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
$ chmod 0755 /run/munge/

Then, we enable and start the munge service with (remember to not use sudo when running munge):

$ systemctl enable munge
$ systemctl start munge

You can then test whether munge works properly by executing:

munge -n # Generate a credential on stdout
munge -n | unmunge # Displays information about the MUNGE key 
munge -n | ssh somehost unmunge

If everything is setup properly, you shouldn’t see any error messages.

Setup Slurm

Use apt to install slurm in Ubuntu systems (make sure all nodes have the same slurm versions):

$ apt install slurm-wlm

Next, we need to configure slurm. Since we used package manager to install slurm, the version is lower than the latest release. Thus, it’s preferably to not use the official Slurm Configuration Tool. Instead, we can find the corresponding version’s configuration tool at /usr/share/doc/slurmctld/slurm-wlm-configurator.html.

After filling up the required fields in the form, we copy the generated file into /etc/slurm-llnl/slurm.conf on all nodes. Then, you can execute sinfo to check all nodes status. You can also launch jobs to see if it actually works, for example:

srun -N2 -l /bin/hostname

This should print out the hostname for all the nodes in the cluster.

Add GPU support

To add GPU support, we first create a file gres.conf in /etc/slurm-llnl/. Here is an example on one node:

Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2

Then, we add GresTypes=gpu into /etc/slurm-llnl/slurm.conf. Next, we add the GPU information to slurm.conf:

NodeName=node1 Gres=gpu:3 State=UNKNOWN

Paper Review - Dynamic Tensor Rematerialization

Tue, 09 Nov 2021 00:00:00 +0000

Dynamic Tensor Rematerialization (DTR) treats GPU memory as a large cache, where tensors can be evicted to save memory, and recomputed if needed later.

DTR’s eviction policy relies on the heuristic $h$. The heuristic assigns a value $h(t)$ to each resident tensor $t$, approximating the cost of evicting the tensor. DTR evicts the tensor with the lowest cost based on the value of $h$. $h$ can factor in arbitrary metadata.

During every operator call in PyTorch, DTR intercepts the call and performs the following tasks:

In short, whenever we perform an operation, we first recursively re-calculate all the non-resident tensors the current operation depends on, while evicting tensors we don’t need until there are enough GPU space left. To decide which tensors to evict, DTR uses the tensor with the lowest value $h$:

The heuristic $h$ evicts tensors based on three properties: staleness, size, and compute cost. It evicts tensors that are: least recently used, takes large GPU memory space, and easy to recompute. $H _{DTR}$ is computed as:

\[ h _{DTR}(s, m, c) (t) := \frac{c(t)}{m(t) \cdot s(t)’} \]

Recomputing an evicted tensor $t$ may result in recomputing many more tensors that $t$ recursively depends on. Thus, the paper proposes an improved heuristic to take the recursive recomputations into account (with more maintenance cost). These tensors are called evicted neighborhood $e ^{*} (t)$.

\[ h_ {DTR-improved}(s, m, c) (t) := \frac{c(t) + \sum _{u \in e ^{*} (t)} c(u)}{m(t) \cdot s(t)’} \]

This heuristic captures the recomputation costs for all tensors that $t$ recursively depend on.

Paper Review - Capuchin: Tensor-based GPU Memory Management for Deep Learning

Sun, 07 Nov 2021 00:00:00 +0000

This paper aims to reduce GPU memory usage during DNN training. Capuchin achieves this goal though swapping and recomputation, using tensor as unit of operation. The major question is how to balance between swapping and recomputation to achieve max resource utilization.

Swap and Recomputation Benefit

The ultimate goal of swapping and recomputation is to hide the overhead as much as possible to minimize the wait time of back-access (a tensor evicted earlier being accessed again). For swapping, we should increase the overlap between swapping and computing; for recomputation, we should use cheap operations.

Determining Tensor Re-generation Cost

For swapping, it is usually not optimal to swap back in a tensor only when we access it. The reason is copying tensor from CPU memory to GPU memory usually introduces overhead greater than the computation itself. It’s thus better to swap in a tensor earlier or proactively.

The paper uses in-trigger as the term. It means we use other tensor access between evicted-access (a tensor access that triggers the self-eviction after used in the computation) and back-access to bring back an evicted tensor a little bit earlier.

Of course, this may raise two questions:

How do we know when in-trigger should happen?
How to deal with PCIe lane interferences? E.g. one swap-in may happen later than in-trigger due to a previous swap-in still not finished.

The answer is quite simple. We use the runtime feedback at the back-access of a tensor. If the tensor is still being swapped in, it means the in-trigger time should be adjusted earlier. Note, this is based on the assumption of regular tensor access pattern in deep learning training, as illustrated in the paper.

Recomputation, on the other hand, is performed only in on-demand manner. No in-trigger is used for recomputation.

Capuchin relies on the principle that swap can be largely overlapped with computation, while recomputation will certainly incur performance penalty. Thus, it chooses swapping as the first choice until we cannot choose an in-trigger to perfectly hide prefetching overhead.

One thing to note here is when we select a tensor $T$ to be recomputed, but such tensor relies on another tensor that is evicted, then we need to recompute the parent of the evicted tensor instead. This could potentially happen multiple times if more recomputation targets tensor $T$. In short, recomputation and swapping cannot occur at the same time.

For more information, please refer to the original paper.

Starting Out PhD

Fri, 05 Nov 2021 00:00:00 +0000

Today marks the third month of my PhD life. Things finally start to become a little bit clearer. I finally have some potentially concrete ideas to work on.

Finding a research topic was the most difficult part. For several months, I was wondering around like a headless chicken, reading papers after papers: serverless, ML inference, compiler, pathlet routing, RDMA, you name it. The feeling of not having a topic was suffocating.

Talking to other people, especially people not from my own research areas, is extremely beneficial. In fact, I was able to narrow down what I want to work on after discussions with a friend of mine who was working on NLP, an area complete outside of networking. Chatting with lab mates and collaborators are also extremely helpful. They usually would ask questions I would never have thought of, and save me from spending countless hours exploring cluelessly.

To me, current system research feels like application-driven. Many research projects are designed to address a very specific challenge faced in the application level. Thus, it is very likely to find an interesting system problem in a non-system conference like KDD or even ICLR.

Handle GitHub Password Authentication Deprecation

Tue, 19 Oct 2021 00:00:00 +0000

Update: use ssh key to access the repo is strongly recommended.

Recently, GitHub deprecated the use of password for repos. You will have to generate GitHub tokens to access repos. It’s difficult for me to memorize the token without serious efforts. Fortunately, it’s easy to mitigate the problem.

After a repo is cloned, simply execute

git remote remove origin

to remote the old remote. Then, execute the following command:

git remote add origin https://<TOKEN>@github.com/<GITHUB_USERNAME>/<REPO>.git

Finally, execute the following command to setup upstream:

git push --set-upstream origin main

After this, no password is needed for git operations inside this repo. Beware you need to perform this operation for every new repo.

Consensus Problem in Distributed Systems

Mon, 18 Oct 2021 00:00:00 +0000

In a distributed system, it is common for processes to reach consensus. When all non-faulty processes terminate, we must guarantee that everyone agrees on a specific value. Unfortunately, FLP 1985 ¹ proved that no asynchronous algorithms could achieve consensus.

Why is that an issue? The key lies in the fact that asynchronous communication doesn’t preserve order of message arrivals. To fully answer that question, we must introduce the concept of bi-valent state and $v$-valent state. The state of a system consists of possible states of all processes and all message queues. Bi-valent state means the system state can reach both decisions depending message arrival time; $v$-valent states indicates the system state can only reach one decision $v$ (uni-valent).

Initial State Must be Bi-valent

We will use these facts to show why an algorithm will run infinitely under asynchronous assumptions. First, we claim that some initial state is bi-valent. This claim must be true by the proof of contradiction. Suppose we have some initial state that is uni-valent. We say two states are adjacent if all processes in both states agree on all values except one process. Two adjacent states must be either 0-valent or 1-valent (agree the global value is either 0 or 1).

For example, let’s say we have five processes in an initial uni-valent state (0-valent). We modify only one process’s value from 0 to 1 between two adjacent states. At some point in time, the system states changes from 0-valent to 1-valent (we refer to this moment as the crossover point).

We see at the crossover point the system state becomes 1-valent. Assume $p_2$ fails before the crossover point, then according to adjacent states, some decision must be made, which contradicts with our assumption that all initial states are uni-valent. Therefore, we need the initial state to be bi-valent.

Always Bi-valent

Second, given the initial state is bi-valent, we then show that for any bi-valent state, we can always deliver the next message to process $p$ while staying bi-valent. This means we can’t reach a certain decision no matter how much time has passed. More precisely, we will prove that no message can transform a system into a decided state by contradiction.

Suppose we have a system. Initially, this system state is 0-valent. We execute an event $e = (p,m)$ (process $p$ sends a message $m$). This event can either transform the system state $C$ to 0-valent ot 1-valent at some point, while system state $C$ can either be 0-valent or 1-valent.

At some point, the system state will change from $C_0$ to $C _1$. Let’s assume a different event $e’ = (p’, m’)$ causes such transition of states. Now, we know previously that event $e$ can transform the system to 0-valent or 1-valent. Now we have two situations: (1) $e$ transforms $C_0$ to 0-valent; (2) $e$ transforms $C _0$ to $C _1$. We know $e$ will tranform $C _1$ to 1-valent. Therefore, we can also say after $e$ is applied to $C_0$ and leads to state 0-valent, we use apply $e’$ to the system such that the system state becomes 1-valent. Therefore, if $p \neq p’$, then $e$ and $e’$ commute. This will leads to contradiction.

To expand on this further, suppose event $e$ leads to uni-valent state. Then, the absence of $e$ ($p$)-free will lead $C _0$ to a decided state. In other words, $p$-free will lead the system to one decided states. However, we’ve shown $e$ and $e’$ commute, and now $p$-free, $e$, and $e’$ all commute, $p$-free will, in fact, leads the system to a ‘‘decided’’ state that is actually not decided, meaning we can’t reach a consensus. This scenario is shown below:

So, given the proof that we can’t reach consensus under asynchronous assumptions, what is the best we can do in practice?

Paxos Algorithm

We will back up a little bit from the asynchronous case, and make compromises by having: (1) consistency/agreement under the case of asynchrony; (2) settle for the lack of termination.

In Paxos algorithm, we assume the majority of process are non-faulty. That means $N > 2F$. Processes have three roles: proposer, acceptor, and learner. Proposers broadcast its proposals to all acceptors; acceptors accept a proposal (e.g. based on arrival times of proposals) and broadcasts accept messages to all learners; learners decide which values to accept (e.g. the majority of a accepted value).

The problem is: even if one acceptor fails, the total number non-faulty processes will be $N-1$ and we can’t say $N-1 > 2F$ must hold.

To fix this problem, we can reduce the number proposers to one. At any moment, only one proposer is the leader. If a leader fails, a new leader becomes available and start sending out proposals.

However, this solution still suffers from the same problem: if new leader is elected, it can not know whether the old leader’s proposal will be accepted.

Therefore, it is necessary for the new leader to establish a majority. The new leader must first broadcast a $prepare$ message to all acceptors. Then acceptors will reply back with a $prepared$ message back to the new leader. Therefore, when a majority is prepared, the new leader knows whether the old leader’s proposal might be accepted.

To see why this is the case, suppose an old leader dies at some point, then a new leader comes in. This new leader broadcasts a $prepare$ message to acceptors. Some acceptors reply back. We argue that if the old leader’s proposal is accepted by the majority of acceptors, then this new leader will preserve the old proposal by evaluating the replies from acceptors and re-proposing the latest accepted value in the prepare phase.

The reason is: if there exists a majority of acceptors in the previous round, then there must exist at least one acceptor in both the old and the current round, because add the sum of majorities exceeds the total number of processes. Therefore, there must be at least one acceptor that can send its accepted value back to the new leader. Since there is only one leader at any time, we can safely say the old value will be sent to the new leader.

In summary, if the new leader sees a prepare message with a value, then that value must come from the previous round and the new leader can re-propose that value; otherwise, the new leader can safely propose a new value.

Impossibility of Distributed Consensus with One Faulty Process ↩︎

Fault Tolerance in Distributed Systems

Tue, 05 Oct 2021 00:00:00 +0000

No systems can provide fault-free guarantees, including distributed systems. However, failures in distributed systems are independent. It means only a subset of processes fail at once. We can exploit this feature and provide some degree of fault tolerance. The problem is, fault tolerance makes everything else much more difficult.

The most common fault models is the fail-stop. It means a process completely “bricks”. When a process fail-stops, no messages can emerge from this process any more. We also don’t know if this process will ever restart. In addition, we must account for all possible states of the faulty process, including unsent messages in the process’s queue. On the other hand, it’s important to point out a process that takes a long time to respond is indistinguishable from a fail-stop. The intuition is such processes and the faulty ones may take an unknown amount of time before message emerge.

We use an example here to illustrate how and why a system fails to provide fault-tolerance. We take a system that replicates data by broadcasting operations using logical timestamps. This system uses Lamport clock to update local clocks (our previous post on Lamport Distributed Mutual Exclusion explains how Lamport clock works). In short, we overwrite the value stored in a replica if an incoming message has later timestamp $ts$.

Source: Ken McMillan

This system is not fault-tolerant. Imagine when one replica receives a write (marked by red incoming “write”) request and then tries to write the value to other replicas. This replica then fail-stops right after it writes to one replica and never writes to the other replica. In this case, not all replicas see the writes, thus violating consistency.

The solution to this problem is quite simple: reliable atomic broadcast. Under atomic broadcast, all writes are all-or-nothing. It eliminates the possibility for a process to fail-stop amid broadcasting.

Now let’s take the above example and update it with additional requirements. Instead of overwriting existing values, we append writes to an array and want to ensure every replica has the same array eventually. The major difference is that replicas needs to wait for acks with higher timestamps before it can append to its array.

This system is also not fault-tolerant. If one replica fail-stops, others will wait forever on a write, because every replicas relies on acks with higher timestamps before committing the append.

Thus, we want to extend the atomic broadcast so that updates become ordered. Under ordered atomic broadcast, writes are all-or-nothing and everyone agrees on the order of updates. If we assume the system to be fully asynchronous, ordered atomic broadcasts are not possible: (1) we can’t guarantee termination under asynchrony; (2) we could lose order. Thus, we rely on the synchronous approach.

Under the synchronous assumption, we can safely say a process fails after waiting for time $t _{fail}$, where $t _{fail} = 3 t _{msg} + 2 t _{turn}$. Here, $t _{msg}$ is the message transfer latency, and $t _{turn}$ is the max time to respond to a message.

To see why $t _{fail}$ is calculated this way, we use the following example to explain the process:

Source: Ken McMillan

Imagine process $p$ sends a message to $q$ and waits for an ack from $q$. Before $q$ is able respond with the ack, it somehow crashes. The max time taken for $p$ to see an aco from $q$ would be two message transfer time plus the execution time by $q$, which is $2t _{msg} + t _{turn}$. We add $t$ to indicate the elapsed time when $p$ sends the request.

Now imagine $q$ is able to broadcast a request right before it fails. Later, another process is able to forward this request back to $p$. Then $p$ needs to wait for three message transfer time plus two message processing time before it can assume that it will no longer receive message from $q$.

Under the synchronous assumption, ordered reliable atomic broadcast works as follows:

When a client send a request to process $p$, the process records logical time $r _l(p,p)$ and physical time $r _{t}(p,p)$. Then it broadcast the request.
When process $p$ receives a message $m$, and if message $m$ contains previously unseen timestamp $t_m$, then we record the logical time we see message $m$ at $p$, denoted as $r_l(p, p_m)$ as well as the physical time $r _t(p, p_m)$. Then we broadcast the request. Finally, we send an ack back to the originator $p_m$ without updated timestamp.
When process $p$ receives a message $m$, $p$ updates $t _l(p, p_m)$. It means we update $p$’s notion of the latest timestamp of another process who just acked.
Process $p$ sets another process $q$ as “failed” (denoted by $f(p, q)$) if $t _l(p, q) \leq r _l(p, p’) < +\infty$ and $t _t < r _t(p, p’) + t _{fail}$. In short, it means if we broadcast message and don’t receive any response after time $t _{fail}$, and our last recorded logical time of process $q$ is before our broadcast, then we know process $q$ must have failed.
Then we perform updates when for all $q \neq p$, $r _l(p,p) < r _l (p, q)$ (meaning everyone else’s request time is later than mine) and $t _l(p,q) > r _l(p,p) \lor f(p, q)$. Intuitively, it means we only perform updates when we receive message from other processes after our broadcast, and when we think other processes’ timestamps are after us, or when they have all failed.

Consistency Models Explained

Thu, 23 Sep 2021 00:00:00 +0000

In a distributed system, eventual consistency provides a weak guarantee that data updates will be reflected in all nodes eventually. However, the downside of eventual consistency is that clients could potentially observe awkward intermediate states. For example, appending numbers to a client may result in states like [10], [10,13], [10,12,13].

Therefore, we need stronger consistency guarantees, which is easier to reason about. These consistency models provide various degree of consistency guarantees. However, it’s not always feasible to provide the strongest consistency guarantee. Usually, one needs to trade off consistency for availability and partition resilience (CAP theorem). Many contents here are attributed to Prof. Ken McMillan.

Global Consistency

The most primitive notion of consistency is global consistency. It means we have some history of events that is totally ordered.

It is (globally) consistency if it respects the intended sequential semantics of the events.

– Kenneth McMillan

Take read/write as an example, we have a sequence of write operations to the same memory location $12$ at various timestamps, and we want to read values from the this location:

\[ \texttt{write}(0, 12, 2) \rightarrow \texttt{write}(1, 12, 3) \rightarrow \texttt{read}(2, 12, 3) \]

If we have global consistency, every read at a given memory location $x$ will yield the most recently written value to $x$.

In reality, it’s impossible to implement a global clock in a distributed system. Therefore, no node can observe the entire history of ordered events, making global consistency hard to implement.

Linearizability

In essence, linearizability is an approximation of global consistency. The difference is linearizability is based on logical time, as opposed to physical time used in global consistency. It means we don’t care in what order the event occur physically. We just want the ordering of events to be consistent to what we know about time, and what we know about time is based on causality.

Linearizability has two assumptions:

Clients don’t have shared clock, but are able to send messages to each other. In other words, we want to create an illusion of global consistency by causality.
If an event $e_1$ ends before another event $e_2$ begins in physical time, then $e_1$ happens-before $e_2$. We define the happen before relationship as $hb(e_1, e_2)$. In simpler terms, is it possible for us to assume a causal connection between two events?

Take the following scenario as an example. We can say $hb(e_1, e_2)$ holds because we can establish a causality relation between these two events. We are able to assume $P_1$ sent a message $m_1$ to $P_2$, which caused the execution of $e_2$.

Here is another example. Here, we can not establish a causal connection between $e_1$ and $e_2$ because we can not assume $m_1$ caused the execution of $e_2$

To say a set of events, denoted as $E$, is linearizable, the following conditions must be met:

There exists a total order $<_{lin}$ over events $E$ s.t.
- $(E,\ <_{lin})$ is globally consistent.
- $hb(e_1, e_2) \rightarrow e_1 <_{lin} e_2$

In other words, a set of event $E$ is linearizable if it respects the happen-before relationship, and is totally ordered.

Let’s look at one example. Suppose we have two processes $P_1$ and $P_2$. $P_1$ writes 1 to location 0, and later reads from 0 and gets 1. $P_2$ writes 2 to location 0 before $P_1$ finishes its write.

Source: Ken McMillan

These events are linearizable. We can order these three events as:

\[ \texttt{write}(1, 0, 2) \rightarrow \texttt{write}(0, 0, 1) \rightarrow \texttt{read}(0, 0, 1) \]

We know $\texttt{write}(0, 0, 1)$ happens before $\texttt{read}(0, 0, 1)$. The read gets the most recently written (not physical time, but causality) value, satisfying global consistency. Therefore, these events are linearizable.

Here is another example showing events that are not linearizable.

Source: Ken McMillan

No matter how you order these four events, there will always be a contradiction. For example, $rd(0,0,1)$ happens after $wr(1,0,2)$ and $wr(0,0,1)$. In order to satisfy global consistency requirement (reading the most recently written value), we must order these three events as

\[ wr(1,0,2) \rightarrow wr(0,0,1)\rightarrow rd(0,0,1)\rightarrow\ ? \]

However, $wr(0,0,1)$ happens before $rd(1,0,2)$, so $rd(1,0,2)$ must be put after $wr(0,0,1)$, but that way the most recently written value would be 1 and it would be impossible to read value 2, thus violating global consistency.

Commit Points

A different and perhaps easier way of thinking linearizability is using commit points. We say a set of events $E$ is linearizable if every event can be assigned a physical commit time $t_e$ such that:

$t_e$ occurs during the execution of an event $e$.
$(E,\lt _{lin})$ is globally consistent, where $e < _{lin}\ d$ iff $t_e < t_d$

The following picture presents a scenario where we set three commit points on three write/read operations.

Source: Ken McMillan

We know these events are linearizable because the three commit points we picked respects the $\lt_{\textrm{lin}}$ relationship. The commit point $wr(0,0,1)$ is set after the commit point for $wr(1,0,2)$ and we know $wr(1,0,2)< _{lin}wr(0,0,1)$.

Sequential Consistency

We can relax the requirement of linearizability even more, which leads us to sequential consistency. Sequential consistency (Lamport) is based on slightly different assumptions compared to linearizability:

Assume clients don’t send messages to each other
$hb _{sc}(e_1, e_2)$ only holds if $e _1$ executed before $e _2$ in the same process.

These assumptions indicates each process doesn’t know the relative order of operations happening on other processes. Thus, we don’t have happen-before arc between processes.

Take the following example:

Source: Ken McMillan

We know these events meets sequential consistency by the following order. The reason is that we can’t say $hb_{sc}(wr(0,0,1), wr(1, 0, 2))$ must hold. This example would not be linearizable because $wr(0,0,1)$ happens before $wr(1,0,2)$.

\[ wr(1, 0, 2) \rightarrow wr(0,0,1) \rightarrow rd(0,0,1) \]

Take another example that is not sequentially consistent:

Source: Ken McMillan

For $rd(0,0,2)$ to be true, it must be that $hb_{sc}(wr(0,0,1), wr(1,0,2))$ holds; for $rd(1,0,1)$ to be true, iut must be that $hb_{sc}(wr(1,0,2), wr(0,0,1))$ holds. Now we have a circle of ordering constraint, thus reaching a contradiction.

Causal Consistency

Causal consistency is an even weaker consistency model compared to sequential consistency. However, unlike all the consistency models we discussed before, causal consistency only applies to read/write operations. In causal consistency model, we define a causal order on those read/write operations such that read operations must see writes in order that respects causality.

Precisely, we define a reads-from map, denoted as $RF$. $RF$ of a read event $e$ is going to produce the write operation that gave me the read value (there will be ambiguity if there are two writes writing the same value). For example, $RF(rd(1,0,2))$ will produce the value $2$, which is equal to the value written by a write operation $wr(0,0,2)$. Putting $RF$ in formal terms:

\[ RF(rd(p,a,v)) = wr(p’,a’,v’) \rightarrow a=a’ \land v = v' \]

In addition, $hb_{RF}$ is the least transitive relation such that:

$hb_{SC}(e, e’) \rightarrow hb_{RF}(e,e’)$
$RF(e’) = e \rightarrow hb_{RF}(e,e’)$. It means whoever gave me the value must happen before me, which represents a notion of causality.

We say a set of events $E$ is causally consistent if there exists a $RF$ map for $E$ such that:

For all reads $e\in E$, there is no write $e’ \in E$ such that $hb_{RF}(RF(e),e’)$ and $hb_{RF}(e’,e)$ have the same address.

In layman’s term, it says that if a write operation $e$ causes us to read a value $x$, it can’t be that there is another write operation $e’$ that happens after $e$ and writes some value to the same address. Because if there is such a write operation, then $RF$ will produce write operation $e’$ instead of $e$.

Take a previous example here:

Source: Ken McMillan

These events are causally consistent because $RF(rd(0,0,1)) = wr(0,0,1)$ and $RF(rd(1,0,2)) = wr(1,0,2)$. Thus $hb_{RF}(wr(0,0,1), rd(0,0,1))$ and $hb _{RF}(wr(1,0,2), rd(1,0,2))$. We also know we can’t say $hb _{SC}(wr(0,0,1), wr(1,0,2))$ because sequential consistency assumes no communication between processes. Therefore, $hb _{RF}(wr(0,0,1), wr(1,0,2))$ doesn’t hold, and we can safely say these events are causally consistent.

Let’s look at another example:

Source: Ken McMillan

This is because $RF(rd(2,0,3)) = wr(0,0,3)$. However, there is a write operation $wr(0,0,1)$ happening after $wr(0,0,3)$ that write 1 to location 0. Therefore, it can’t be that $wr(0,0,3)$ causes $rd(2,0,3)$ because $wr(0,0,1)$ interferes and creates a contradiction. The easiest way to detect whether a set of events is causally consistent is to see if there is a circle of dependencies.

S3 Strong Consistency

A consistency model widely used in production systems is the S3 consistency used in Amazon S3 storage service. The S3 consistency models holds:

if $hb(w_1, w_2)$ and $hb(w_2, r)$, then $RF(r) \neq w_1$
Two reads must agree on the order of writes that happen before both reads.

Here is an example that is causally consistent, but not S3 consistent:

Source: Ken McMillan

The reason is $rd(1,0,2)$ sees write options as $wr(0,0,1) \rightarrow wr(1,0,2)$ and $rd(0,0,1)$ sees $wr(1,0,2)\rightarrow wr(0,0,1)$.

However, with slight adjustment to the example, we have S3 consistency.

Source: Ken McMillan

The reason is because $hb(wr(1,0,2),rd(0,0,2)$ doesn’t hold. So even if $rd(1,0,2)$ sees write options as $wr(0,0,1) \rightarrow wr(1,0,2)$ and $rd(0,0,1)$ sees $wr(1,0,2)\rightarrow wr(0,0,1)$, only $wr(0,0,1)$ happens before both reads, thus they would agree on the ordering of writes.

Summary

Source: Ken McMillan

The consistency models discussed are only a tip of the iceberg. In fact, different storage service providers usually provide different consistency models. This may result in vendor lock-in because applications designed for one storage system may fall apart when deployed to another due to varying consistency implications.

Lamport Distributed Mutual Exclusion

Tue, 21 Sep 2021 00:00:00 +0000

Normally, having consistent event ordering in a distributed system is hard because we have no common clock. Since we don’t have a common clock to measure with, we rely on logical properties of time in the absence of clock. Here we use causality replation between events.

In essence, Causality indicates a clock $C$ is map from events to time satisfying: $e\rightarrow e’$ implies $C(e) < C(e’)$

We can synthesize a clock by a simple protocol, usually referred as scalar clock or Lamport clock:

Each process $p$ has a local clock $C(p)$.
A message send by a process is stampled with the its corresponding local clock.
On receiving $M$, set the process’s local clock to be $max(C(p), C(M)) + 1$.

This will give us a consistent total order of events in a distributed system.

Let’s take Lamport distributed mutual exclusion (DME) as an example. We use scalar clock to agree on the order of access to critical sections. Each process broadcasts a request with its local clock time. Receiver stores the request time and responds with its update local time ($max(C(p), C(M)) + 1$).

A process can only enter critical section given the condition $W$ is met: $W \equiv \forall q \neq p,\ t(p,q) > r(p,p) \land r(p,p) < r(p,q)$. $t(p, q)$ represents the latest time received by $p$ from $q$. $r(p, q)$ is the request time received by $p$ from $q$ or $+\infty$. Intuitively, it says if a process’s request time is smaller than all repsonses time and the process’s request time is smaller than all the other request time, then this process is the first one to send out the request and thus should enter critical section.

The reason why this protocol works is illustrated below:

When $p_1$ sends a request at timestamp 2 and gets a repsonse with timestamp 3, we know $p_1$ has the greatest clock value and $p_0$ will update its own clock based on the timestamp sent from $p_1$. Now $p_1$ sees the response message from $p_0$ with timestamp 3, it knows any request from $p_0$ must have already been received, because the network channel is ordered and any request sent by $p_0$ already arrived before the response with timestamp 3.

To see Lamport DME in action, we use Ivy to specify the protocol. The source file is borrowed from Ken’s presentation. The code is annotated and self-explanatory:

#lang ivy1.8

# This is an implememtation of Lamport's distributed mutual excluson
# (DME) algorithm.

include order
include network

# We start the module with a 'global' section. This contaions the
# declarations of any resources that are used in common by all
# processes. These usually include:
#
# - Data types
# - Services, such as network services
# - Immutable global parameters, such as netwrok addresses
#
# We can't have mutable global variables, since processes, being
# distributed, don't have shared memory.
#

global {

 # Our first global data type is the type of host identifiers. We
 # will have one process for each value of this type. Host
 # identifiers take on integer values from `0` to `node_max`.
 # We create the host identifier type by instantiating the
 # `host_iterable` module. It has a parameter `max` that gives the
 # maximum value of the type (and is supplied at run time).

 instance host_id : iterable

 # Since we have three kinds of messages in our protocol, we define
 # an enumerated type for the message kind with three symbolic
 # values.

 type msg_kind = {request_kind,reply_kind,release_kind}

 # In addition, we use a sequence type to represent timestamps. The
 # `unbounded_sequence` template in the `order` library gives a
 # discrete totally ordered type with a least value `0` and a
 # `next` operator.

 instance timestamp : unbounded_sequence

 # Our messages are stucts with three fields: the message kind and the
 # host identifier of the sender and a timestamp. We order messages
 # according to the timestamp. This ordering is useful in the proof
 # of correctness.

 class msg_t = {
 field kind : msg_kind
 field sender_id : host_id
 field ts : timestamp
 # definition (M1:msg_t < M2:msg_t) = ts(M1) < ts(M2)
 }

 # A useful enumerated type to describe node state:

 type state_t = {idle,waiting,critical}

 # Finally we instantiate a network service via which our processes
 # will communicate. Here, `transport.net` is a template defined in the
 # `network` library that we included above. The template takes one
 # parameter, which is the type of messages to be sent. Our instance
 # of this template is an object called `net`.

 instance net : tcp.net(msg_t)
}


# After the global section, we introduce some distribtued processes.
# A process with parameters has one instance for each value of the
# parameters. In this case we have one parameter of type `host_id`
# which means there is one process in the system for each value of
# `host_id` in the range `0..host_id.max`. The parameter is named `self`.
# This means that the process can refer to its own host identifier by
# the name `self`.

process node(self:host_id) = {

 # A process usually begins by declaring an *interface*. This
 # consists of a set of *actions* that are either calls in from the
 # environment (exports) or calls out to the environment (imports).

 # Our action is an export `request`, which our client uses to
 # request to enter the critical section. It takes no parameters.

 export action request

 # Our second action is an import `grant`. This is a callback to
 # the client indicating that is is safe to enter the critical
 # section.

 import action grant

 # Our third action is an export `release`. This is called by the
 # client when exiting the critical section, indicating it is safe to
 # another process to enter.

 export action release



 common {
 specification {

 var client_state(H:host_id) : state_t

 after init {
 client_state(H) := idle;
 }

 before request(self:host_id) {
 require client_state(self) = idle;
 client_state(self) := waiting;
 }

 before grant(self:host_id) {
 require client_state(self) = waiting;
 require client_state(X) ~= critical;
 client_state(self) := critical;
 }

 before release(self:host_id) {
 require client_state(self) = critical;
 client_state(self) := idle;
 }

 }
 }

 implementation {

 # Next we declare per-process objects. Each process needs a socket
 # on network `net` in order to communicate. We declare the socket
 # here. The socket `sock` is an instance of the template `socket`
 # declared by the network service `net`.

 instance sock : net.socket

 # We also declare some local (per-process) types and variables.

 var state : state_t

 # We also keep track of the current timestamp

 var ts : timestamp

 # Each process maintains a 'request queue', which a map from host_ids to
 # the timestamp of the current request from that host, or `0` if none.

 var request_ts(X:host_id) : timestamp

 # This map records the highest timestamp of a reply received from
 # each host.

 var reply_ts(X:host_id) : timestamp

 # Having declared our variables, we initialize them. Code in an
 # `after init` section runs on initialization of the process. You
 # aren't allowed to do much here, just assign values to local
 # variables.

 after init {
 state := idle;
 ts := 0;
 request_ts(X) := 0;
 reply_ts(X) := 0;
 }

 # Now we come to the implementation code. Here we implement our
 # exported actions, if any, and also any callback actions from the
 # services we use (i.e., actions that these services import from
 # us).

 # We start with the `request` action. This builds a request message,
 # appends it to the request queue, and broadcasts it. The action `broadcast` is
 # a local action (i.e., a subroutine) and is defined later.

 implement request {
 ts := ts.next;
 var outgoing : msg_t;
 outgoing.kind := request_kind;
 outgoing.sender_id := self;
 outgoing.ts := ts;
 broadcast(outgoing);
 request_ts(self) := ts;
 state := waiting;
 # BUG: should check waiting condition here, if host_id.max = 0
 }

 # Next we implement the callback `recv` from our network socket,
 # indicating we have an incoming message. This is called
 # `sock.recv`. It gives us as input parameters the network address
 # of the sending socket (not useful here) and the incoming
 # message.


 implement sock.recv(src:tcp.endpoint,incoming:msg_t) {

 # debug "recv" with self = self, src = src, msg = incoming;

 # First, we update out timestamp to reflect the incoming
 # message.

 ts := timestamp.max2(incoming.ts,ts).next;

 # We partly construct an outgoing message

 var outgoing : msg_t;
 outgoing.sender_id := self;
 outgoing.ts := ts;

 # What we do here depends on the kind of message.

 # When we receive a `request` message, we put it on our request queue,
 # and return a reply message to the sender.

 if incoming.kind = request_kind {
 outgoing.kind := reply_kind;
 request_ts(incoming.sender_id) := incoming.ts;
 unicast(outgoing,incoming.sender_id);
 }

 # When we receive a `release` message, the sender's request
 # must be at the head of our queue. We dequeue it.

 else if incoming.kind = release_kind {
 request_ts(incoming.sender_id) := 0;

 }

 # On a reply, we update the highest timestamp received from
 # this sender. Because of in-order devlivery, the timestamps
 # are received in increasing order, so the incoming one must
 # be the greatest so far.

 else if incoming.kind = reply_kind {
 reply_ts(incoming.sender_id) := incoming.ts;
 }

 # Having proceesed the incoming message, we might now be able
 # to enter our critical section. We do this if:
 #
 # - We are in the waiting state
 # - Our request message has the least timestamp in lexicographic order
 # - Every host has sent a reply later than our request

 # debug "waiting" with self = self, rq = request_ts(X), ts = reply_ts(X);

 if state = waiting
 & forall X. X ~= self ->
 (request_ts(X) = 0 | lexord(request_ts(self),self,request_ts(X),X))
 & reply_ts(X) > request_ts(self)
 {
 state := critical;
 grant;
 }
 }

 implement release {
 ts := ts.next;
 request_ts(self) := 0;
 var outgoing : msg_t;
 outgoing.sender_id := self;
 outgoing.ts := ts;
 outgoing.kind := release_kind;
 broadcast(outgoing);
 state := idle;
 }

 # At the end, we have definitions of internal (non-interface)
 # actions (in other words, subroutines) and functions (i.e., pure
 # functions).

 # This function takes two timestamp-host_id pairs and determines
 # whether (X1,Y1) < (X2,Y2) in lexicogrpahic order.

 function lexord(X1:timestamp,Y1:host_id,X2:timestamp,Y2:host_id) =
 X1 < X2 | X1 = X2 & Y1 < Y2


 # The action `unicast` sends a message to just one process.
 # To actually send a mesage to a socket, we call the `send` action
 # of our socket, giving it the receiving socket's network address
 # and the message to be sent. Notice we can get the network
 # address of process with identifier `idx` with the expression
 # `node(idx).sock.id`. This might seem odd, as we asre asking for
 # the local state of an object in another process. This is allowed
 # because the network addresses of the sockets are immutable
 # parameters that are determined at initialization and are
 # provided to all processes.

 action unicast(outgoing:msg_t, dst_id : host_id) = {
 # debug "send" with dst = dst_id, msg = outgoing;
 sock.send(node(dst_id).sock.id,outgoing);
 }

 # Action `broadcast` sends a message to all processes with
 # identifiers not equal to `self`. We use a 'for' loop to
 # iterate over the type `host_id`. The 'for' construct defines
 # two variables:
 #
 # - `it` is an 'iterator' of type `host.iter`
 # - `dst_id` is the value of the type the iterator refers to
 #
 # The reason we do it this way is the the finite subrange type
 # `host_id` has no value the is 'past the end' of the type, so
 # you can't write a traditional 'for' loop over this type. The
 # iterator type, however, does have a value corresponding to
 # 'past the end'.

 action broadcast(outgoing:msg_t) = {
 for it,dst_id in host_id.iter {
 # do not send to self!
 if dst_id ~= self {
 unicast(outgoing, dst_id);
 }
 }
 }
 }
}

# To compile and run with 3 nodes:
#
# $ ivyc lamport_mutex.ivy
# $ ivy_launch host_id.max=3
#
# To test:
#
# $ ivyc target=test lamport_mutex.ivy
# $ ivy_launch host_id.max=3
#
# Bounded model checking:
#
# TODO: As usual, we need the assumption that all endpoint id's are
# distinct.

axiom node(X).sock.id = node(Y).sock.id -> X = Y

# This says to try bounded model checking up to 20 steps (but Ivy
# won't actually get that far). The second parameter say to unroll the
# loops three times. This means that BMC ignores all executions in
# which a loop is executed more than three times. We need this because of
# the loop in `node.broadcast`

attribute method = bmc[20][3]

# Try adding a bug and see if you can find it with testing and bmc. Change
# this definition above:
#
# function lexord(X1:timestamp,Y1:host_id,X2:timestamp,Y2:host_id) =
# X1 < X2 | X1 = X2 & Y1 < Y2
#
# to this:
#
# function lexord(X1:timestamp,Y1:host_id,X2:timestamp,Y2:host_id) =
# X1 <= X2 | X1 = X2 & Y1 < Y2
#
# This mistake could allow two nodes with requests with the same timestamp
# to enter the CS at the same time. Here's a counter-example produced
# by BMC (it takes a while!):
#
# > node.request(1)
# > node.request(0)
# > node.sock.recv(0,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:request,msg_t.sender_id:1,msg_t.ts:1})
# > node.sock.recv(1,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:request,msg_t.sender_id:0,msg_t.ts:1})
# > node.sock.recv(1,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:reply,msg_t.sender_id:0,msg_t.ts:2})
# < node.enter_cs(1)
# > node.sock.recv(0,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:reply,msg_t.sender_id:1,msg_t.ts:2})
# < node.enter_cs(0)
# lamport_mutex_save.ivy: line 137: error: assertion failed

Specifying Token Ring for Mutual Exclusion

Sat, 11 Sep 2021 00:00:00 +0000

Mutual exclusion is a common term appearing frequently in computer sciences. In essence, it’s a mechanism of concurrency control allowing exclusive access to some resource (or “critical region”). Token passing is an algorithm for distributed mutual exclusion (DME) and will be our focus in this post.

DME specifications usually make the following assumptions:

Network delivers message in order, e.g. TCP (sometimes)
Every message is eventually delivered (usually)
Messages are never duplicated. Duplication may result granting resources to multiple clients, which is not what mutual exclusion demands (usually)

Thing we might want to guarantee for DME specifications are:

Mutual exclusion, at most one client is in a critical section (always)
Non-starvation. A requesting client enters critical section eventually (usually)
Non-overtaking. A client cannot enter critical section more than once while another client waits (usually)

In addition, we need to analyze DME algorithms’ performance metrics, which usually includes:

Message complexity, e.g. number of messages sent between clients being served
response time, or time between request and entering CS
Throughput, or rate of processing CS requests

Let’s take a token ring as an example. In a token ring, a client holds a token and then sends it to the next one after exiting its critical section. When we make assumptions about a token ring, we

do not need to have network delivering messages in order, because at any given time in a token, there is at most one message in transit.
ensure every message is eventually delivered. Otherwise, the system won’t make progress, and we will not have non-starvation guarantee.
need non-duplication for messages. Otherwise, we violate the fundamental properties of this protocol, or no mutual exclusion.
clients don’t spuriously release. This will be clear later when we demonstrate what happens if clients release multiple times.

We want to guarantee that

mutual exclusion holds.
non-starvation
non-overtaking, because token will get through every client in the network first because repetition happens.

To analyze token performance, we use the above performance metrics (message complexity, response time, and throughput)

Message complexity: when the system is under low load, the message complexity is unbounded because there may be an arbitrary number of messages being sent throughout the network where no one is in the critical section. When system is under high load, the message complexity is 1.
Response time: when the system is under low load, there could be $N$ messages times (where $N$ is the total number of clients). When under high load, the response time would be 1 message time.
Throughput: the maximum throughput would be 1/(message time + CS time)

A naive specification for mutex in Ivy would be:

action grant(v:id)
action release(v:id)

specification {
 var lock(X:id):bool
 after init {
 lock(X) := false;
 }

 before grant {
 require ~lock(X);
 lock(v) := true
 }

 before release {
 require lock(v);
 lock(v) := false
 }
}

To see token ring in action, we use the demo from Ken’s presentation:

#lang ivy1.8

include network
include numbers

global {
 type host_id = {0..max}
 alias token = uint[1]
 instance net : tcp.net(token)
}

process host(self:host_id) = {
 import action grant
 export action release

 specification {
 var lock : bool

 after init {
 lock := false;
 }

 before grant {
 require forall X. ~host(X).lock;
 lock := true;
 }

 before release {
 require lock;
 lock := false;
 }
 }

 implementation {
 instance sock : net.socket

 after init {
 if self = 0 {
 pass
 }
 }

 action pass = {
 var tok : token;
 var next := 0 if self = max else self + 1;
 sock.send(host(next).sock.id, tok);
 }

 implement sock.recv(src:tcp.endpoint, val:token) {
 grant;
 }

 implement release {
 pass;
 }
 }
}

We put the above code into a file called token_ring.ivy and compile it using ivyc token_ring.ivy. Then we launch the program using ivy_launch max=2 token_ring.ivy, which opens three terminal windows.

If we type in host(1) with host.release, we see in host(2) it outputs host.grant, which seems to show that the token work properly. However, if we type host.release again in host(1), host.grant will show up again in host(2), resulting in multiple tokens getting created, which violates the requirement that there is at most one token in the ring at any given time.

If we execute ivyc target=test token_ring.ivy && ivy_launch max=2 token_ring.ivy, then we see the token ring work properly. The reason is we have specified the requirement for grant and release (require forall X. ~host(X).lock for grant and require lock for release).

grant is an action imported from the environment, thus we know when grant happens, all clients in the network do not hold a lock. On the other hand, release is an action exported from the system, which means the tester must perform grant given the host has the lock. So the tester won’t perform release multiple times like we did above because the tester can not violate the require lock requirement.

Writing Specifications for a Distributed System using Ivy

Wed, 08 Sep 2021 00:00:00 +0000

Before we jump into writing specifications in a distributed setting, we first define what a specification is. I take the definition from the magnificent Ken McMillan: a specification is a statement.

A statement describes an abstract view of a program. The view itself is often at an interface, which hides or abstracts internal states. A specification is stated in terms of two elements:

Assumption: properties of the environment the system relies on
Guarantee: properties that most hold if the assumption(s) is met

The way we write specifications is through an abstract program that observes or monitors all program events. This abstract program is able to remember the execution history of program being monitored, and decides, at any given moment, whether an action is allowable according to the specification.

One way to implement this abstract monitor program is to use guarded command form:

Let $A$ be a set of program actions
An event $e(x_1,\ …,\ x_n)$ is an action $e\in A$ with parameter values $x_1,\ …,\ x_n$ of the right types for $e$.
Let $S$ be a set of states and $s_0 \in S$ be the initial state.
Guarded command set $G$ is specified as:

\[e(V):\ \gamma (S,V) \rightarrow {S := \tau(S, V)}\]

It means if a guarded command $\gamma$ determines a given event $e$ satisfies certain specifications with parameter $V$ under state $S$, then we accept the code and then deterministically update the state with a function $\tau$.

The observation $E$ of system is going to be a finite sequence of events, which corresponds to the system behavior, denoted as $e_0(V_0)…e_{n-1}(V_{n-1})$. A run of $E$ is a state sequence $s_0\ …s_n$ such that for $i\in 0\ … n- 1$, $\gamma(s_i, V_i)$ is true and $s_{i+1} = \tau(s_i, v_i)$. Observation $E$ is accepted by the specification iff it has a run. We can test whether an observation is accepted by just executing the guarded commands. In layman’s term, if all guarded commands accepts the their corresponding event at a given time, then the sequence events must satisfy our specification and should be accepted.

Now let’s replicated file as an example. Out first informal attempt to the specification for “append” operation would be:

Assumption: network is ordered and non-duplicating
Guarantee: if no further append requests, eventually replicas are equal

However, the problem with this specification is that this is a liveness property, meaning that we can’t practically test such property by observing a finite sequence of events. Therefore, we resort to a different safety specification we can test:

If all sent messages are delivered, the two replicas are identical.

Now we convert liveness to safety by explicitly defining the moment hen the eventuality should hold.

Liveness property means a good thing eventually happens. A liveness property can be refuted by finite execution. Safety property means a bad thing never happens. A safety property can always be refuted by a finite execution.

To see how replicated file specification plays in action, we use the example given in Prof. McMillan’s presentation. The code is written in Ivy and is pretty self-explanatory. In this demo we only have two processes.

To install Ivy, simply execute virtualenv ivyenv && source ivyenv/bin/activate && pip install ms-ivy. This is tested on Ubuntu 18.04 LTS and may vary slight on other distros.

#lang ivy1.8

include numbers
include collections
include network

global {
 alias byte = uint[8]
 instance file : vector(byte)
 type pid = {0..1}
 instance net : tcp.net(byte)
}

process host(self:pid) = {
 export action append(val:byte)
 import action show(content:file)
 instance sock : net.socket
 var contents : file

 after init{
 contents := file.empty;
 }

 implement append {
 contents := contents.append(val);
 sock.send(host(1-self).sock.id, val);
 show(contents);
 }

 implement sock.recv(src:tcp.endpoint, val:byte) {
 contents := contents.append(val);
 show(contents);
 }
}

Then we form our specification based on the guarantee that if all sent messages are delivered, the two replicas are identical. The specification is equivalent to the guarded command we’ve talked about earlier.

specification {
 var msg_count : nat

 after init {
 msg_count := 0;
 }

 after host.sock.send(self:pid, dst:tcp.endpoint, val:byte) {
 msg_count := msg_count + 1;
 }

 after host.sock.recv(self:pid, src:tcp.endpoint, val:byte) {
 msg_count := msg_count - 1;
 ensure msg_count = 0 -> host(0).contents.eq(host(1).contents);
 }
}

We wrote the above code into a file named append.ivy and we generate the testing code using ivyc target=test append.ivy. Then we run the code using ivy_launch append.ivy.

Interestingly, the program yields an error message:

`ivy_shell`; ./append "[[0,{addr:0x7f000001,port:49124}],[1,{addr:0x7f000001,port:49125}]]"
> host.append(1,251)
< host.show(1,[251])
< host.show(0,[251])
> host.append(1,46)
< host.show(1,[251,46])
> host.append(0,183)
< host.show(0,[251,183])
< host.show(0,[251,183,46])
< host.show(1,[251,46,183])
assertion_failed("append.ivy: line 49")
append.ivy: line 49: error: assertion failed

What happens is the program generates tests that randomizes message arrival times and we can see a delivered message may arrive after its target sends another message, therefore creating corrupted file contents.

Notice that here we are actually running on real network to find counter examples, the downside is the test may be arbitrary long depending on the randomized testing cases. Instead, we will use bounded model checking (BMC) to test if the specification is correct. This way we can reply purely on the logic of our specification instead of running on the real network. The Ivy checker uses Z3 Theorem Prover.

BMC construct a boolean formula that is satisfiable if and only if the underlying state transition system can realize a finite sequence of state transitions that reaches certain states of interest.

To tell Ivy using bounded model checking, we add the following lines to append.ivy:

axiom host(0).sock.id ~= host(1).sock.id

attribute method=bmc[10]

Executing ivy_check detailed=false append.ivy, we see an error message:

> host.append(1,80)
< host.show(1,[80])
> host.append(0,64)
< host.show(0,[64])
> host.sock.recv(0,{tcp.endpoint.addr:...,tcp.endpoint.port:...},80)
< host.show(0,[64,80])
> host.sock.recv(1,{tcp.endpoint.addr:...,tcp.endpoint.port:...},64)
< host.show(1,[80,64])
append.ivy: line 49: error: assertion failed

Sometimes BMC can help us find the error faster because it is systematically checking all possible actions. However, increasing the number of steps for the BMC can result in the exploration space growing exponentially, so we are going to use some combination of BMC and randomized test cases.

Whiz: Data-Driven Analytics Execution

Sun, 05 Sep 2021 00:00:00 +0000

This paper by UTNS lab appeared in NSDI 2021. It presents a data-analytics framework that decouples intermediate data from computations.

Whiz addresses several challenged posed by current analytics frameworks. The first one is data opacity. Most modern data analytics frameworks relies on MapReduce execution engine. The developer specifies the map and reduce function, which then get submitted to the analytics framework. The workflow can be expressed as a logical graph; the physical graph (which includes the cluster configuration, disk quota, etc.) is generated transparently. The workflow is shown below:

The problem is in the region marked yellow. It shows the execution engine has limited runtime visibility into the intermediate data. Thus, adapting processing logic of tasks based on the states of intermediate data becomes challenging.

In addition, task parallelism and intermediate data partition strategy are often static. In the graph above, the intermediate data partition tasks and the final reduce tasks might be determined prematurely, without taking the intermediate data partition characteristics into account. For example, data skew (unevenly distributed jobs) causes different reduce nodes to process different amount of tasks. The graph below illustrates how the shuffle stage can result in disproportional intermediate data partitions.

Finally, Whiz addresses the limitation posed by compute-driven scheduling. In compute-driven scheduling, one stage usually relies on the completion of the upstream tasks, the may lead to compute idling waiting for remaining data to become available, even if the a subset of workers in the current stage is ready for execution. Decoupling data from computation enables the execution engine to treat intermediate data as first-class citizen, thus allowing finer-grained control of data processing.

In summary, Whiz solves two problems presented in compute-centric execution engines:

Tight coupling between intermediate data and compute.
intermediate data agnosticity.

Thus, Whiz creates a feedback loop between the execution service and the data service so that the execution can dynamically adjust its policy based on the information offered by the data service to optimize system performances.

Whiz classifies itself as a data-driven execution engine, which drives execution based on intermediate data properties. Making intermediate data visible opens door for optimization opportunities, thus increasing performances. For more technical details regarding the architecture and implementation of Whiz, please refer to the original paper.

In-Network Aggregation for Shared Machine Learning Clusters

Tue, 31 Aug 2021 00:00:00 +0000

This paper by Nadeen appeared in MLSys 2021. It presents an in-network aggregation framework called PANAMA for distributed ML training tasks. PANAMA has two components: (1) an in-network hardware accelerator with support for floating-point gradient aggregation; (2) a domain-specific load-balancing and congestion control protocol.

Motivation

The primary motivation behind PANAMA is the data-parallel training (in which the neural network is replicated across $N$ worker where each worker processes a subset of the training data) demands constant local gradient exchanging at every iteration, thus creating a huge amount of traffic.

For example, for a training job with $1000$ workers and 1 GB DNN model size requring $1000$ iterations, the total traffic will be about 2 PB.

Network Design

The paper assumes a traditional data center multi-tier folded Clos topology:

PANAMA uses multiple aggregation trees per training job to spread the traffic across multiple paths and avoid congestion hotspots. This is different to equal-cost multi-path (ECMP) protocol because the aggregation flows are typically large. Bounding such flows to a single aggregation tree will create network imbalance.

Congestion Control

PANAMA uses implicit acknowledgments instead of traditional point-to-point approaches. Because each aggregated packets are constructed on the fly, one-to-one mapping between packets and the acknowledgements is unnecessary, if a worker receives aggregation results, that automatically serves as an implicit acknowledgement. This eliminated the need to keep a per-flow congestion state at PSwitches.

Similar to DCTCP, PANAMA relies on ECN marks in the IP header to react to the network congestion. Since aggregation packets are created on the switch, each hardware accelerator need to perform a bitwise $OR$ on the ECN field of received packets to mirror the traditional ECN bit.

Hardware Design

The design of the aggregation accelerator in PANAMA is straightforward: it utilized the SIMD architecture in which the gradients are partitioned across adder trees. Adder tree can operate in parallel and pack the results and sent them to the output ports. The VID fields are merely used to correct aggregation.

Overall, the workflow is really simple and illustrated below:

Deploy Hugo Site to GitHub Pages

Fri, 27 Aug 2021 00:00:00 +0000

Update: The official guide from Hugo is for deploying from public repo. This post is intended for deploying from private repo.

This post assumes the user has already setup two separate repositories: a private repository for Hugo source files, and a public repository for GitHub Pages.

Note: test Hugo site by executing hugo server in the source code directory to make sure the site is generated properly.

Then, we need to generate a pair of keys by using the following command:

ssh-keygen -t rsa -b 4096 -C "$(git config user.email)" -f deployment -N ""

This will create two files: deployment and deployment.pub, which corresponds to a private key and a public key.

Next, execute cat deployment and copy the private key. Navigate to the private source repository -> Settings -> Secrets -> New repository secret. Paste the private key and save the change.

I’ve already added the private key to the source directory and named it PRIVATE_KEY. You can named it however you want.

Then, we go to the public repository for hosting our website. Navigate to the public site repository -> Settings -> Deploy keys -> Add deploy key. Execute cat deployment.pub and copy paste the result. You should see a SSH key added:

Finally, create a directory in the private repository in the following directory: .github/workflows/deploy.yml.

name: github pages

on:
 push:
 branches:
 - main  # Set a branch to deploy
 pull_request:

jobs:
 deploy:
 runs-on: ubuntu-20.04
 steps:
 - uses: actions/checkout@v2
 with:
 submodules: true # Fetch Hugo themes (true OR recursive)
 fetch-depth: 0 # Fetch all history for .GitInfo and .Lastmod

 - name: Setup Hugo
 uses: peaceiris/actions-hugo@v2
 with:
 hugo-version: 'latest'
 extended: true

 - name: Build
 run: hugo --minify

 - name: Deploy
 uses: peaceiris/actions-gh-pages@v3
 with:
 deploy_key: ${{ secrets.PRIVATE_KEY }}
 external_repository: your_username/public_repository_name
 publish_branch: branch_to_publish
 publish_dir: ./public

Finally, make sure you create a file named .nojekyll in the root directory of the public repository to prevent GitHub Pages from building the site using Jekyll.

Every time you make commits to the private repository, the site will be automatically generated and published on the public repository.

Quantum State in a Nutshell

Thu, 19 Aug 2021 00:00:00 +0000

There are thousands of articles trying to explain what exactly a quantum state is. Many of them boiled down to “the state of a qubit is 0, 1, or 0 and 1 at the same time”. This statement leads to both confusion and misinterpretation. The explanation I found on Quantum computing for the very curious is by far the most elegant and simplest:

The state of a qubit is a vector in a two-dimensional vector space. This vector space is known as state space.

I will use many of great content from Quantum computing for the very curious to explain things.

Mapping qubits to classical bits

We’ve described what a qubit state is, but provided no link between a qubit state and a classical bit state. There are two possible states for a classical bit: 0 and 1. The corresponding states for a qubit is slightly fancier: $|0\rangle $ and $|1\rangle $.

The notation with $|$ and $\rangle$ is called a $ket$ notation. With a number wrapped between them, $0$ or $1$ are called $kets$. A $ket$ is a fancy term for a vector. In fact, $|0\rangle$ is really just $ \begin{bmatrix} 1 \newline 0 \end{bmatrix} $; $|1\rangle$ is really just $ \begin{bmatrix} 0 \newline 1 \end{bmatrix} $.

States between 0 and 1

Both the states $|0\rangle$ and $|1\rangle$ are called computational basis states, which maps to classical 0 and 1 states. There are more states for a qubit. We’ve already learned that a quantum state is a two-dimensional vector. An example is given:

Not all linear combination of vector $|0\rangle$ and $|1\rangle$ are qubit states. There is one constraint: the sums of the squares of the amplitudes must be 1. For example, we can compute $0.6^2 + 0.8^2$ and verify the result is 1.

For general quantum states, the amplitudes can be complex numbers as well. Denoting both amplitudes as $\alpha$ and $\beta$, a quantum state can be formally written as:

\[\alpha |0\rangle + \beta |1\rangle \wedge \alpha^2 + \beta^2 = 1\]

$\alpha^2 + \beta^2 = 1$ is called the normalization constraint.

If we think of $|0\rangle$ and $|1\rangle$ as orthonormal vectors, we can visualize the possible linear combination of these two vectors as a circle of radius 1:

Since amplitudes can be complex numbers, the state space really becomes a sphere. Summing all these up:

the quantum state of a qubit is a vector of unit length in a two-dimensional complex vector space known as state space.

– Quantum computing for the very curious

Measuring a qubit

Suppose we have qubit in a quantum state $\alpha |0\rangle + \beta |1\rangle$. We want to observe the state of this specific qubit. It turns out the law of physics prohibits us from figuring out the the amplitudes $\alpha$ and $\beta$ if they start out unknown. In short, the quantum state of any system is not directly observable.

To figure out the quantum state. We rely on a process called measurement in the computational basis. Suppose a qubit is in the state $\alpha |0\rangle + \beta |1\rangle$. Measuring the state of this qubit gives us the outcome $0$ with probability $|\alpha|^2$, or 1 with probability $|\beta|^2$. The state of the qubit after the measurement is thus either $|0\rangle$ or $|1\rangle$. After the measurement, $\alpha$ and $\beta$ are gone.

Writing in the Sciences - Writing Process

Mon, 09 Aug 2021 00:00:00 +0000

This post covers the topics mentioned in Writing in the Sciences offered on Coursera.

Writing Process

The writing process includes three steps:

Prewriting
- Collect and organize information
- Brainstorm take-home messages
- Work out ideas away from the computer
- Develop a road map
Writing the first draft
- Putting ideas together in organized prose
Revision
- Read out loud
- Cut the clutter
- Verb check
- Get feedback

A lot of people often convolute step 2 and 3. They try to write and revise at the same time, which is anything but efficient. It’s hard to resist the impulse to be perfect. Paying too much attention to details obfuscates the whole picture. Unsurprisingly, the class poll shows most people focus on the writing step:

A better time allocation would look like:

Prewriting (70%)
Writing the first draft (10%)
Revision (20%)

The Prewriting

The key to prewriting is to get organized first. What it means is you shouldn’t try to write and gather information simultaneously. Instead, you should gather and organize information before writing the first draft. That means you need to have an organization system to help you keep track of various thoughts. I personally found writing this blog a really good way to keep myself motivated but there are definitely alternatives.

Compositional Organization

Here are some simply tips to help organizing ideas:

Group like ideas/paragraphs together, which often reveals necessary repetition.
Don’t ‘‘Bait-and_switch’’ you readers. Switching arguments too many times leads to confusion.
- When discussion controversy, flow the arguments -> counter-arguments -> rebuttals pattern.

The Writing

This is hardest step for most people. This where you pop up a blank windows and start up writing. The biggest tips is to not be a perfectionist. The first draft should aims to get the ideas down in complete sentences in order. You should even purposefully set a low bar to get the first draft out quickly.

Focus on logical organization more than sentence-level details.

The recommended order for writing a manuscript is:

Tables and Figures
Results
- Summarize what the data show by (1) pointing out simple relationships; (2) describing big-picture trends; (3) citing figure or table that present supporting data.
- Avoid simply repeating the numbers already available in tables and figures.
Methods
Introduction
Discussion
Abstract

Step 1 to 3 involve the most concrete things to put down. They help you frame the introduction.

Tips for Writing Results

Here are a few tips for writing results:

Break into subsections, with headings
Complement the information that is already in tables and figures
- Give precise values that are not available in the figure
- Report the percent change or percent difference if absolute values are given in the table
Repeat/highlight only the most important numbers
Talk about negative and control results
Reserve the term ‘‘significant’’ for statistically significant
Don’t mix results with methods
- Don’t discuss the rationale for statistical analyses within the Results section
Reserve comments on the meaning of your results for the discussion section. (show vs meaning)

Writing Introductions

The good news is that the introduction is easier to write than you may realize. Typically, the recommended range for an introduction is 2 to 5 paragraphs long. The introduction forms a cone structure:

The idea is to start from something general, then quickly narrow down to your specific study. So an introduction starts from some general background, then to what’s unknown. Then we narrow down to our hypothesis. In summary, the introduction is divided into:

What’s known
What’s unknown
Your burning question
You experimental approach
Why your experimental approach is new and different and important

The structure corresponds to roughly 3 paragraphs: step 1 = paragraph 1; step 2 = paragraph 2; 3-5 = paragraph 3.

Some of the tips for writing an introduction include:

Keep paragraphs short
writing for a general audience
Known -> Unknown -> Hypothesis
Emphasize the unknown
Be explicit about your research hypothesis: ‘‘We asked whether’’; ‘‘Our aims/s were’’
Do now answer the research question

The Revision

Surprising to me, the first big tip is to read your writing out loud, because the brain processes the spoken word differently than the written word.

The second tip is to do a verb check. You should underline the main verb in each sentence, and watch out for:

Lackluster verbs (e.g. There are …)
passive verbs (e.g. Something was observed by me.)
buried verbs (e.g. A careful monitoring of achievement levels before and after the introduction of computers in the teaching of our course revealed no appreciable change in students’ performances.)

Some words should also be cut out:

Dead weight words
Empty words
Long words that can be short

In addition, watch for

Unnecessary jargon and acronyms
Repetitive words
Adverbs

Most of these tips are already covered before in Cut the Clutter and Verbs

The next tips is to do an organizational review. For example, you can tag each paragraph with a phrase or sentence that sums up the main point in the margins of the paper. Then you can move paragraphs around to improve logical flow and bing similar ideas together.

Another interesting tip is to get feedback, especially those from people without any technical background. Ask if they can grasp the main findings or significance of the work, as well as those hard-to-read sentences and paragraphs. If an average Joe can understand your paper, chances people in your field can understand it are much higher.

More Tips

Use past tense for completed actions (e.g. We found that…)
Use the present tense for assertions that continue to be true, such as what the tables show, what you believe, adn what the data suggests (e.g. Figure 2 shows…)

Other notes including Cut the Clutter, Verbs, and Structure are also available.

Writing in the Sciences - Structure

Sun, 08 Aug 2021 00:00:00 +0000

This post covers how to improve sentence structures, and builds to to writing strong paragraphs. Most contents comes from the Writing in the Sciences course offered on Coursera.

Punctuation

Here is the list of punctuations ranked based on their power to separate:

Comma (,)
Colon (:)
Dash (-)
Parentheses ( () )
Semicolon (;)
Period (.)

The formality of these punctuations are ranked as:

Dash (-)
Parentheses ( () )
The others (comma (,), colon (:), semicolon (;), period (.))

A dash is a mark of separation stronger than a comma, less formal than a colon, and more relaxed than parentheses.

– Strunk and White

Semicolon

It connects two independent clauses (a clause always contains a subject and predicate; an independent clause can stand alone as complete sentence.)

Here is an example: ‘‘It was the best of times; it was the worst of times.’’

Semicolons can also be used to separate items in lists that contain internal punctuation. If some clauses contain commas, the comma inside the clause is no longer sufficient to separate different items in a list, because we don’t know where the boundaries are.

Parenthesis

Parentheses are used to insert an afterthought or explanation into a passage that is grammatically complete without it.

Colon

Colons are used after an independent clause to introduce a list, quote, explanation, conclusion, or amplification.

The colon has more effect than the comma, less power to separate than the semicolon, and more formality than the dash.

– Strunk and White

Dash

Dash can add emphasis or insert an abrupt definition of description almost anywhere in the sentence.

Use a dash only when a more common mark of punctuation seems inadequate.

– Strunk and White

Here is an example illustrating how dash emphasizes and adds information: ‘‘Researchers who study shipworms say these mislabeled animals–they are clams, not worms–are actually a scientific treasure’’.

I like the example provided in the class to illustrate how to use dash to join and condense a sentence. The original sentence is:

Finally, the lessons of clinical epidemiology are not meant to be limited to academic physician-epidemiologists, who sometimes have more interest in analyzing data than caring for patents. Clinical epidemiology holds the promise of providing clinicians with the tools necessary to improve the outcomes of their patients.

By using dash, we can connect these two sentences together, whiling maintaining the description on physician-epidemiologists:

Finally, clinical epidemiology is not limited to academic physician-epidemiologists–who are sometimes more interested in analyzing data than caring for patients–but provides clinicians with tools to improve their patients’ outcomes.

Parallelism

It is often better–in scientific writing–to write paris of ideas joined by ‘‘and’’, ‘‘or’’, or ‘‘but’’ in parallel form.

Here is an example sentence with a list of things in parallel form: ‘‘NASA’s intrepid Mars rover, Curiosity, has been through a lot in the past year. It flew 354 million miles, blasted through the Mars atmosphere, deployed a supersonic parachute, unfurled a giant sky crane, and touched down gently on the surface of Mars’’.

Paragraph

There are several tips fo writing paragraphs:

1 paragraph = 1 idea
Give away the punch line early. Scientists like putting details, details, details, data, and conclusion, which is a nightmare for readers. Invert the order!
Paragraph flow is helped by:
- logical flow of ideas. Less pointers improves readability.
- parallel sentence structures
- if necessary, transition words.
Reader remembers the first and the last sentence best.
Sequential in time
From general to specific
Logical arguments (if else)

Repetition

It’s ok to repeat a word. It’s important to ask yourself if the second instance of the word necessary. If the word is needed, is a synonym really better than repeating the word? Using synonyms–especially in scientific writing–may lead readers to think you are referring to a different instrument, model, etc.

Other notes including Cut the Clutter, Verbs, and Writing Process are also available.

Writing in the Sciences - Verbs

Sat, 31 Jul 2021 00:00:00 +0000

This is an overview of the second chapter of Writing in the Sciences offered by Stanford. This chapter focuses on writing with strong, active verbs. Lessons include how to:

write in the active voice
avoid turning verbs into nouns
choose strong verbs
get to the main verb of a sentence quickly

Active Voice

There are three advantages of using active voice:

Emphasizes author responsibility
Improves readability
Reduces ambiguity

Author responsibility

Here is an example sentence: ‘‘No attempt was made to contact non-responders because they were deemed unimportant to the analysis’’. When we put it in the active voice, we get ‘’We did not attempt to contact non-responders because we deemed them unimportant to the analysis’’. The active voice version places more emphasis on the role of the authors in the decision making, subtly indicating human judgement and potential fallibility.

Readability

Putting sentences into active voice often leads us to be more direct. For example, putting ‘‘a strong correlation was found between use of passive voice and other sins of writing’’ into active voice yields ‘‘We found a strong correlation between use of the passive voice and other sins of writing’’. Active voice tends to make sentences more direct.

Ambiguity

The example sentence is: ‘‘General dysfunction of the immune system at the leukocyte level is suggested by both animal and human studies. Turning the sentence into active voice gives: ‘‘Both human and animal studies suggest that diabetics have general immune dysfunction at the leukocyte level’’. A sentence in form of agent - verb - recipient forces us to be more specific, thus reducing ambiguity of a sentence.

It is important to point out that passive voice may be appropriate in the methods section where what was done is more important than who did it.

After all, human agents are responsible for designing experiments, and they are present in the laboratory. Writing awkward phrases to avoid admitting their responsibility and their presence is an odd way of being objective.

– Jane J. Robinson, Science 7 June 1957: 1160.

Write with Verbs

Verbs with Embedded Meaning

For example, phrases like ‘‘reports that approximately’’ can be shortened to ‘’estimates’’ with ‘‘approximately’’ as its embedded meaning. They can make a big difference in sentences.

Avoid ‘’to be’’ verbs

There verbs are rather boring. Substituting ‘’to be’’ verbs can lead to exciting contents.

Don’t Turn Verbs into Nouns

Nouns slow readers down by the lack of action. Turning nouns into verbs gives a clearer picture of what is going. It has a bonus of avoiding ambiguity.

Turning verbs into nouns sometimes leads to the use of weaker verbs. For example, ‘‘decide’’ can be transformed into ‘‘make a decision’’, where ‘‘make’’ is a much weaker verb than ‘‘decide’’.

Don’t Bury the Main Verb

The principle is to keep the predicate close to the subject. Here is a sentence:

‘’one study of 930 adults with multiple sclerosis (MS) receiving care and one of two managed care settings or in a fee-for-service setting found that only two-thirds of those needing to contact a neurologist for an MS-related problem in the prior 6 months had done so’’

Readers struggle to understand the sentence due the clutter between the subject and the predicate. Moving ‘‘found’’ to the front of the sentence gives us ‘‘One study found that…’’. The reader are less bothered by the descriptive stuff as long as he/she has gotten the verb.

Example

Here is a great example provided in the course:

Important studies to examine the descriptive epidemiology of autism, including the prevalence and changes in the characteristics of the population over time, have begun.

There are multiple problems in this sentence. 1) the main verb appears at the very end of the sentence while the main subject ‘‘studies’’ is placed at the beginning.; 2) fluff words like ‘‘important’’. 3) redundant phrases: ‘‘changes’’ almost always happen ‘‘over time’’; 4) ‘‘of the population’’ sounds vague. After addressing those issues, the sentence becomes:

Studies have begun to describe the epidemiology of autism, including recent changes in the disorder’s prevalence and characteristics.

Grammar Tips

Data is/are:

‘‘Data’’ is plural.

Compared to/with:

Compared to: used to point out similarities between two objects.
Compared with: (used more often in science) used to pointed our differences between similar things.

That/which:

‘‘That’’ is the restrictive (defining) pronoun (doesn’t have comma). Eliminating essential clause changes the meaning of the sentence.
‘‘Which’’ is the nonrestrictive (non-defining) pronoun. Eliminating the non-essential clause alters the basic meaning of the sentence. (must be set off by commas).

Careful writers, watchful for small conveniences, go witch-hunting, remove the defining which-es, and by doing so improve their work.

– Strunk and White

Singular antecedents:

Do not use ‘’they’’ or ‘’their’’ when the subject is singular. To avoid gender choice, turn to a plural.

Other notes including Cut the Clutter, Structure, and Writing Process are also available.

Writing in the Sciences - Cut the Clutter

Fri, 30 Jul 2021 00:00:00 +0000

This is an overview over the first chapter of Writing in the Sciences offered by Stanford.

The secret of good writing is to strip every sentence to its cleanest components. Every word that serves no function, every long word that could be a short word, every adverb that carries the same meaning that’s already in the verb, every passive construction that leaves the reader unsure of who is doing what. These are the thousand and one adulterants that weaken the strength of a sentence. And they usually occur in proportion to the education and rank.

– William Zinssler in On Writing Well, 1976

Cutting Extra words

Here are some common clutters:

Dead-weight words and phrases such as ‘‘as it is well known’’, ‘‘as it has been shown’’
Empty words and phrases: ‘‘important’’, ‘‘methodologic’’, ‘‘basic tenets of’’
- Hedge words: appreciable changes. One may ask: ‘‘what is an appreciable change?’’ Hedge words intends to introduce ambiguity, probability, or indecisiveness.
Long words or phrases that could be short: a majority of -> most, a number of -> many, ‘’neonatal population’’ -> ‘’newborns’’, etc.
Unnecessary jargon and acronyms. No one wants to constantly look for what ‘‘miR’’ means
Repetitive words or phrases: illustrate/demonstrate
adverbs: very, really, generally, basically

I have only made this letter rather long because I have not had time to make it shorter

– Lettres provinciales, 16, Dec. 14, 1656

Little Tricks

Here are a few other small tricks:

Get rid of negatives. The sentence usually becomes much clearer using the positive construction. ‘‘Not honest -> honest’’, ‘‘does not have -> lacks’’
Eliminate superfluous uses of ‘’there is/are’’. For example, we can change the sentence ‘‘There are few single genes that can cause autism in isolation’’ to ‘‘Few single genes cause autism in isolation’’.
Omit needless prepositions. For example, ‘’that’’ and ‘‘on’’ are often superfluous. This is useful to cut off words in abstract with word limitations. For example, you can simplify ‘’they agreed that it was true’’ to ‘’they agreed it was true’’.
Use verbs than adjectives: protective for -> protect against.

Example

Here is an example sentence: ‘‘Clinical seizures have been estimated to occur in 0.5% to 2.3% of the neonatal populations’’. We can perform the first elimination: ‘‘Clinical seizures ~~have been estimated to~~ occur in 0.5% to 2.3% of the neonatal populations’’. The range of percentages presents possibilities of variance, making ‘’estimated’’ unnecessary.

Upon first glance, ‘’neonatal’’ seems like a essential word. However, upon inspection, ‘’neonatal population’’ is merely fancy way of saying ‘’newborns’’. So the sentence can be stripped down to ‘‘Clinical seizures occur in 0.5% to 2.3% of newborns’’.

Other notes including Verbs, Structure, and Writing Process are also available.

Unitary Matrix

Wed, 14 Jul 2021 00:00:00 +0000

Recently, I was trying to get the hang of quantum computing. I found myself in a position where I forgot most of the linear algebra stuff I’ve learned in past semesters. So again, I decide to put them down in hope that some of the knowledge here will stay in my memory a bit longer.

General single-qubit Gates

Trying to understand unitary matrix in the context of pure linear algebra is, I must admit, rather boring. Perhaps that is one reason why I brushed them off so quickly and so easily. However, explaining it in the context of quantum computing feels a lot more fun. Maybe it’s because I can associate a unitary matrix with a quantum gate, which is something a bit more concrete, or simply because the term ‘‘quantum computing’’ makes me sound smarter.

Speaking of something concrete, here ara two example unitary matrices: the NOT gate ($X$) and Hadamard gate ($H$):

\[ X =\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} ;\ H = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \]

For example, if we take the Hadamard gate ($H$) and compute its adjoint $H^{\dagger}$:

\[ H^{\dagger} = \begin{pmatrix} \begin{pmatrix}\frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \end{pmatrix}^T \end{pmatrix}^{*} \]

We know the transpose of $H$ is still $H$, and taking the complex conjugate of $H^T$ doesn’t do anything since $H^T$ is a real matrix. Thus, we can verify that $H^{\dagger}H = I$.

There are other single-qubit quantum gates such as the $Y$ and $Z$ matrices (Pauli matrices) introduced by physicist Wolfgang Pauli. It’s a good exercise to verify they are also unitary matrices.

What does it mean for a matrix to be unitary

The most important property of unitary matrices is that they preserve the length of inputs. It means that given a quantum state, represented as vector $|\psi\rangle$, it must be that $ \left\lVert U|\psi\rangle \rangle \right\rVert = \left\lVert |\psi\rangle \right\rVert $.

Proving unitary matrix is length-preserving is straightforward. We wanna show that $ \left\lVert U |\psi\rangle \right\rVert_2 = \left\lVert |\psi\rangle \right\rVert_2 $:

\[\begin{aligned} \left\lVert U |\psi\rangle \right\rVert_2^2 &= (U |\psi\rangle)^H(U |\psi\rangle) \\ &= |\psi\rangle^H U^H U |\psi\rangle \\ &=|\psi\rangle^H |\psi\rangle \\ &= \left\lVert |\psi\rangle \right\rVert_2^2 \end{aligned}\]

Why are unitaries the only matrices that preserve length

Previously, we use the ket notation for quantum state vectors. We can extend the two-dimensional quantum state vectors to more general vectors and the properties of unitary matrix will still hold.

Putting our questions in formal terms, we want to show that if $A \in \mathbb{C}^{m \times m}$ preserves length ($\left\lVert A x \right\rVert_2 = \left\lVert x \right\rVert_{2}\ \forall x \in \mathbb{C}^m$, then $A$ is unitary).

We first prove that $(Ax)^H(Ay) = x^Hy$ for all $x$, $y$ by considering that $ \left\lVert x - y \right\rVert_2^2 = \left\lVert A(x - y) \right\rVert_2^2 $. Then we will the result to evaluate $e_i^H A^HAe_j$.

Let $x$, $y \in \mathbb{C}^m$, then we can use the alternative definition for the matrix 2-norm (e.g. $\left\lVert y \right\rVert_2 = y^Hy$) for $ \left\lVert x - y \right\rVert_2^2 = \left\lVert A(x - y) \right\rVert_2^2 $,

\[ (x-y)^H(x-y) = (A(x-y))^HA(x-y) \]

Based on that fact that the hermitian transpose rule that $(Ax)^H = x^HA^H$, we get

\[ (x-y)^H(x-y) = (x-y)^HA^HA(x-y) \]

Multiplying the above formula out,

\[ \begin{align} x^Hx - y^Hx - x^Hy + y^Hy &= x^HA^HAx - y^HA^HAx \\ &\quad - x^HA^HAy + y^HA^HAy \end{align} \]

The alternative definition for $y^Hx$ is $\overline{x^Hy}$, so we apply the definition here,

\[ \begin{align} x^Hx - (\overline{x^Hy} + x^Hy) + y^Hy &= x^HA^HAx - (\overline{x^HA^HAy} + x^HA^HAy) \\ &\quad + y^HA^HAy \end{align} \]

We know that $A$ preserves length, and that $\frac{\alpha + \overline{\alpha}}{2} = Re(\alpha)$. so we can simplify the above formula as:

\[ Re(x^Hy) = Re((Ax)^H(Ay)) \]

We know that $A$ preserves length, and thus we need to show that $A^HA = I$ by using the fact that the standard basis vectors have the property that

\[ \begin{equation} e_i^H e_j = \begin{cases} 1 & \text{if $i = j$}\\ 0 & \text{otherwise} \end{cases} \end{equation} \]

Therefore, $e_i M e_j$ will essentially extract the $i,\ j$th entry in matrix $M$. So we know that

\[ e_i A^HA e_i = \left\lVert Ae_i \right\rVert^2 = \left\lVert e_i \right\rVert^2 = 1 \]

We can conclude that all the diagonal elements of $A^HA$ are $1$.

A side question remains, how do we prove that all the off-diagonal elements in $A^HA$ are $0$? Turns out it very straightforward to illustrate the process if we resort back to the two-dimensional quantum vector state matrix.

Suppose we have $|\psi\rangle = |e_i\rangle + |e_j\rangle$, we already know that $\left\lVert A |\psi\rangle \right\rVert^2 = \left\lVert |\psi\rangle \right\rVert^2 = 1 + 1 = 2$, and we know we can expand $\left\lVert A |\psi\rangle \right\rVert^2$ to $1 + e_i A^HA e_j + e_j A^HA e_i + 1$, we would get $e_i A^HA e_j + e_j A^HA e_i = 0$.

Then, suppose instead we have $|\psi\rangle = |e_i\rangle + i|e_j\rangle$, following the same process, we would get $e_i A^HA e_j - e_j A^HA e_i = 0$. Combining with the fact that $e_i A^HA e_j + e_j A^HA e_i = 0$, we’ve proven that the off-diagonal elements in $A^HA$ are all $0$. We can extend the vector $\psi$ to higher-dimensional vectors and the proof will be similar.

BGP in a Nutshell

Tue, 06 Jul 2021 00:00:00 +0000

Border Gateway Protocol (BGP) protocol has a very simple purpose: choose the fastest and the most efficient route to deliver a message from one autonomous system (AS) to another. In layman’s term, BGP is the GPS for the internet. Many contents here are credit to Prof. Mohamed G. Gouda.

In a nutshell, BGP informs each router $R$ how to route packets to an IP prefix $pf$ (i.e. block of IP addresses) that is used in $AS_i$ different from $AS_j$, where $R$ is located:

BGP consists of two parts:

external BGP (eBGP): informs each gateway
internal BGP (iBGP): informs each non-gateway router

Gateway: A gateway is defined as a router that is connected to computer in two or more ASes.

Abstractly, each router has a BGP routing table in the form of:

\[(\text{prefix in another AS},\ \text{best ngh (next gateway hop) to reach prefix})\]

eBGP

First we will go over eBGP. We know BGP uses TCP to send messages and eBGP is no exception. The TCP connection exists between:

each two gateways in the same AS, and
each two ‘‘adjacent’’ gateways in different ASes.

These gateway pairs sent route advertisements in the following form (represented as a tuple):

\[(prefix,\ AS\text{-}path,\ next\text{-}hop)\]

The BGP next-hop attribute is the next hop IP address that is going to be used to reach a certain destination.

Here is an example illustrating how one AS lets other ASes know the route leading to itself. Assume we have three ASes. AS1 wants other ASes to know how to reach itself:

AS1 is trying to broadcast the path to reach itself, it sends the first route advertisement message (1) as:

\[ (pf,\ (AS1),\ x) \]

Now AS2 receives the message from AS1, it updates the AS-path by appending itself to the path list. It also needs to update the next-hop attribute because an incoming message needs to find the best address to reach AS2 before reaching AS1. AS2 broadcasts the message (2) as:

\[(pf,\ (AS1, AS2),\ y)\]

Each gateway $A_i$ or $B_j$ will also add an entry to its BGP routing table, thus each gateway in the picture will have its routing table like: $A_2:\ (pf,\ B_1)$, $B_2:\ (pf,\ \text{ngh to }x)$, $A_3:\ (pf,\ B_2)$.

Notice $B_2$ doesn’t have an explicit next-gateway-hop in its routing table. This is because so far we’ve only covered eBGP which works across different ASes. For a message to go from $B_2$ to $x$, it must go through routers inside AS2 internally, which brings up internal BGP (iBGP).

iBGP

For iBGP, there is a TCP connection between each two routers in the same AS, given the only one of them is a gateway.

Here is an example. Suppose the gateway $A$ in $AS_j$ needs to broadcast its routing information to other routers in $AS_j$:

The normal eBGP advertisement will be something like $(pf,\ (AS_1,\ …,\ AS_j),\ x)$.

For iBGP, the protocol states that the next hop advertised by eBGP should be carried into iBGP. Therefore, for all routers $C_1,\ …,\ C_n$, the next hop to reach $pf$ will be $x$ which is provided by $AS_i$. It’s your job to make sure $x$ is reachable via IGP.

For the iBGP advertisement, it will only be $(pf,\ x)$. Intuitively, this makes sense, because we only need to worry about the next best hop to reach $pf$ without worrying about changing the AS. All the routers ($C_1,\ …,\ C_n$) receiving the advertisement from $A$ will add the information to their routing table as:

\[ (pf,\ \text{best ngh to }x) \]

That is it. There’s really nothing difficult about BGP in general.

From Autotools to CMake

Mon, 21 Jun 2021 00:00:00 +0000

Since my paper on GPU benchmarking was published, every once in a while, I got emails asking me why Altis doesn’t build on their platforms. It almost always has something to do a small script which is responsible for finding CUDA dependencies. This script is invoked every single time make is executed. For some reason, the regular expression in the script sometimes breaks randomly, depending on the Linux distro, the kernel version, the host architecture, or even the CUDA version. After enough requests piled up in my inbox, I decided enough is enough and it’s time to ditch the autotools shenanigans for CMake.

When I started this project, it already had a skeleton build system based on autotools. The old build system generates automake and autoconf files. It worked fine for me on the server I used so I never bothered to make big adjustments. However, the problem soon arises when I upgrade some packages or switched to another server in our lab.

Because our servers are constantly being used for GPU research, the execution environment is constantly changing. Sometimes the benchmark suite would build in the morning and stop working in the afternoon because of a CUDA downgrade. My directories were filed with strange files like configure.ac, Makefile.am, Makefile.in, … It also uses a helper script called M4 which I still don’t quite understand. Hand-made shell scripts are everywhere. automake has a million versions and you’ll never know why it doesn’t build on someone else’s OS. Getting OptiX to work is like having a constipation because the LD flag doesn’t get placed in the right location.

Switching to CMake was a lot smoother than I anticipated. I took about two days to rewrite the entire build system from scratch. Contrary to autotools, CMake’s syntax is easier to learn. The terminal output is easier to read by default, instead of throwing every single detail to my face. And, it’s colored! Debugging build issues no longer takes multiple hours.

Perhaps the best part is, since CMake 3.8, CUDA is natively supported. Compiling .cu files is as easy as adding CUDA as the project’s LANGUAGES. That alone should be the reason to use CMake if there’s anything CUDA related. The only caveat is there’s a small difference between how CMake handle CUDA architecture flags in different versions. Since CMake 3.16, CMAKE_CUDA_ARCHITECTURES is introduced. Older versions still requires find_package(CUDA) to set CUDA compilation flags though.

How SAT Solver works

Fri, 21 May 2021 00:00:00 +0000

This is a summary over the high-level design of SAT solver covered in Prof. Dillig’s Automated Logical Reasoning class. It’s meant to cover the basic steps towards determining whether a given boolean formula is satisfiable or not.

Convert to NNF

The first step in a SAT solver is to convert a given boolean formula to Negation Normal Form (NNF). A normal form of a formula $F$ is another formula $F’$ such that $F$ is equivalent to $F’$, but obeys certain syntactic restrictions. NNF has two syntactic restrictions:

The only logical connectives are $\neg$, $\land$, and $\lor$.
Negation appears only in literals. (i.e., no $\neg(a \land b)$).

Why not DNF?

A formula in disjunctive normal form (DNF) is a disjunction of conjunction of literals. It can be expressed as:

\[ \begin{equation} \bigvee_i \bigwedge_j l_{i,j} \mbox{ for literals }l_{i,j} \tag{1} \end{equation} \]

That formula states that $\lor$ can never appear inside $\land$ or $\neg$. If we take a look at formula in DNF, we might claim it’s trivial to determine satisfiability of such formula. This is because, if we can find one element that evaluates to true, then due to the nature of $\lor$, it must be that the formula evaluates to true.

Practically, this is impractical. This is because DNF conversion causes exponential blow-up in size. Take formula $(F_1 \land F_2) \land (F_3 \lor F_4)$ for example, the DNF will become $(F_1 \land F_3) \lor (F_1 \land F_4) \lor (F_2 \land F_3) \lor (F_2 \land F_4)$. The main problem is distributing $\lor$ over $\land$ will introduce more elements to the formula.

CNF

The solution is to convert from NNF to conjunctive normal form (CNF). CNF is a conjunction of disjunction of literals:

\[ \begin{equation} \bigwedge_i \bigvee_j l_{i,j} \mbox{ for literals }l_{i,j} \tag{2} \end{equation} \]

What this says that $\land$ is not allowed inside $\lor$ and $\neg$, or clauses.

CNF vs DNF

Unlike CNF, it is not trivial to determine satisfiability of formula in CNF. However, it is just as expensive to convert formula to CNF as to DNF. So why do most solvers convert to CNF although it’s easier to determine satisfiability in DNF?

Tseitin’s Transformation

The answer is Tseitin’s Transformation. The most important thing about Tseitin’s Transformation is that it converts formula $F$ to equisatisfiable formula $F’$ in CNF with only a linear increase in size.

Tseitin’s Transformation has three major steps:

Introduce a new variable $p_G$ for every subformula $G$ of $F$.
Consider each subformula $G : G_1 \circ G_2$, stipulate representative of $G$, or that $p_G \leftrightarrow p_{G_1} \circ p_{G_2}$.
Convert $p_G \leftrightarrow p_{G_1} \circ p_{G_2}$ to CNF.

Eventually, we will introduce a new formula:

\[ \begin{equation} p_F \bigwedge_{(G=G_1\circ G_2)\in S_F} CNF(p_g \leftrightarrow p_{g_1} \circ p_{g_2}) \tag{3} \end{equation} \]

Precisely, the size of resulting formula is bound by $30n + 2$ where $n$ is the size of original formula.

DPLL

The Davis-Putnam-Logemann-Loveland (DPLL) algorithm can be expressed as:

BCP stands for boolean constraint propagation. It requires that one of the clauses must be a unit clause. Performing BCP (or unit resolution) is same as replacing a literal with true in the original clauses.

choose_variable contains multiple heuristics. For now we can consider a variable is randomly picked.

The red part is optimization to the original DPLL. Pure literal Propagation (PLP). All it does is if variable $p$ occurs only appears only in the form of $p$ or $\neg p$ in the entire formula, we will set all occurences of $p$ or $\neg p$ to true or false.

Experience on Dafny Programming

Sun, 16 May 2021 00:00:00 +0000

Because of Professor Dillig’s class, I finally got the chance to try out Dafny, a language made by Microsoft Research, with built-in support for formal specification through preconditions, postconditions, loop invariants and loop variants. I often think, what if we write programs in a verification language, would there be much less bugs and will it make our lives much easier than sitting in front a screen for hours grinding at bugs. Here are some thoughts I want to put down before they are gone in my head.

Here is a simple example illustrating some basic concepts in Dafny. Suppose we are given a function that reverses a sequence. For example, we want to return [c, b, a] when given [a, b, c], the recursive implementation would look like:

function reverse<T>(in : seq<T>) : seq<T>
{
 if (|s| == 0) then s
 else reverse(s[1..]) + [s[0]]
}

This looks correct, right? We have termination condition set for recursion, the implementation is fairly straightforward. Nothing seems wrong. If the function is correct, it would suggest that the lemma

lemma reverseLemma<T>(in : seq<T>)
ensures in == reverse(reverse(in));
{}

must hold. If we simply reverse a sequence twice, it would be the same compared to the original input. However, compiling the program would complain that a postcondition might not hold. This is really strange. If we simply eye-ball through the implementation of reverse, it’s hard to imagine what could possibly go wrong.

One thing you should constantly keep in mind is that the Dafny compiler doesn’t really understand the implementation of the code. It instead uses specification to reason about the correctness of the program.

Imagine we are given a function called func1, we know its function signature and its return type. Then we are told that calling this function twice with a given input $s$ is going to produce an output $k$ matching $s$ exactly. How should we trust that this claim is valid if the function func1 is a black box. How do we know if calling func1 with input $s$ might never terminate?

The solution is to annotate the function with certain properties so that the compiler knows what condition might hold before and after a function is executed.

For example, the reverse might take a sequence with length greater or equal to 0. Otherwise, it wouldn’t make sense to reverse a sequence with negative length. We could write this requirement as requires |in| >= 0.

We also need to claim that the output of function reverse must be equal to the input, denoted as ensures |s| == |reverse(s)|.

Finally, and most importantly, what property must the output of reverse have? It’s the order of the output must be the reverse of the input. How do we express such property? We should say something like: every element in the output sequence must match the element in the original input sequence at the reversed location. It will look like: ensures forall i :: 0 <= i < |in| ==> reverse(in)[i] == in[i_reversed].

After spending some good hours writing Dafny, my feeling towards Dafny is mixed. The plus side is writing down how a program behaves exactly and precisely forces programmers to be more careful when writing code. This would remove a lot of problems after deployment. Reuse verified API also removes concerns over much of the safety issue.

The tricky part is coming up with annotations that meets specification requirements. On one side, this forces people to break down big tasks into smaller functions so that it’s much easier to come up with correct annotations. On the other side, debugging complex functions would require some magic. The debugging messages aren’t really helpful in terms of narrowing down possible problems. It would require some guesses and luck to find what annotations are wrong or lacking. The example reverse function only has few post and pre conditions. However, more sophisticated functions involving multiple conditions as well as invariants are much harder to get right. Often, it requires the programmer to explore in the dark without much guidance before uncovering the solution.

Ethereum

Fri, 14 May 2021 00:00:00 +0000

In my previous post, we’ve gone over the high-level structure of blockchain and its attributes. This post covers Ethereum and explore how blockchain can be used not only for money transfer but also application development.

More Than Money

The idea behind Ethereum was proposed by Vitalik Buterin. He wanted to apply the idea of decentralization to build applications with a central authority in control. He first proposed to add a scripting language to Bitcoin but was later rejected. Later, Dr. Gavin Wood released the Ethereum yellow paper, covering the Ethereum Virtual Machine (EVM) which is capable of executing smart contracts on the network.

Similar to blockchain, all computers across the Ethereum network have a full copy (ledger) of the application code and data. That means the platform can provide services at any time, without censorship or third-part interferences (however, this doesn’t necessarily mean extreme scalability nor low application response time).

Ethereum Architecture

To understand the high-level architecture of Ethereum, we compare it with the traditional client/server architecture. I really like Zastrin’s illustration.

Here is how a traditional client/server architecture looks like:

The main point is that the web application is deployed on centralized platforms or hosting services like AWS. All client applications need to access the centralized component for requested services.

The architecture of Ethereum is shown below:

The major difference here is that each client interacts with a decentralized application instead. Each computer holds a replica of all the data as well as the code. You can think of your computer as a dedicated mini server handling only your requests. Of course, what this implies is you will need to download the entire Ethereum blockchain to be able to use the application.

This is obviously infeasible. In practice, there are solutions like Metamask to mitigate the problem so your storage space won’t be saturated. But the high-level concept holds.

A Ethereum blockchain consists of two components:

Database: this is similar to the database we use in the traditional client/server architecture. The difference lies in how the data is represented and stored. Ethereum stores events as transactions which are exactly the same as BTC transactions, but are much more general as they can store anything ranging from a git commit message to a blog post. It can even be used to represent something more abstract such as ownership of artwork in the form of non-fungible tokens. Because Ethereum is a public blockchain, all the data stored in the blockchain will be visible to the public.
Smart Contract: this is fancy name for application code. The code is compiled into Ethereum Byte Code and then executed by the EVM.

The interesting part is the association between contract and code. ‘‘Contract’’ implies some form of agreement enforced by certain rules. ‘‘Code’’ on the other hand can be much more flexible. The reason behind this is that code deployed on the Ethereum blockchain can enforce the agreement because, once deployed, it can not be terminated or modified. The beauty behind this is that the smart contract is deployed in a distributed network. A successful tampering attempt can only be achieved through agreement from majority, which is hard to achieve.

How is identity established?

In a traditional server/client application, for example, like GitHub, a user account is identified using the username. To managed an account, a user must have the corresponding password to access and modify the information in the account. The notion of account in Ethereum is also similar. An account have a Ethereum address which is visible to the entire network just like a GitHub username. The private key is used to control an the account and is not meant to be shared publicly, which corresponds to your GitHub account password. The exception is the contract account. Intuitively, ths makes sense because it is meant to be accessed by all parties in the contract.

Not a Silver Bullet

There has been a lot of hype around Ethereum and blockchain in the business world. However, Ethereum is not meant to be the solution for all problems.

The root of Ethereum lies in decentralization. The benefit of decentralization comes at a cost. Because any request on the Ethereum blockchain has to go through multiple stages of verification, validation, and consensus, the response latency makes it unbearable for any application that needs to meet stringent time requirements.

For example, posting a Twitter can be done in a matter of seconds. All it takes is for the client to send the tweet to the server. The server adds the tweet in its database, and update the information to other clients. Doing this in Ethereum means we need to first seek agreement from multiple other nodes with completely network conditions, wait for consensus to be achieved (which takes some time), before the tweet becomes valid.

In fact, we have a simple prototype using Ethereum to provide a platform (like Patreon) for artists to demonstrate their artwork and receive financial support directly from their community without any third-party agencies. The advantage is, by avoiding third-party agencies, we reduced the cost of transaction. However, the problem is, whenever a user click the ‘’like’’ button, the information would not be reflected until minutes later. This would be unacceptable for products like Facebook or Instagram.

Reflections on my CS PhD Application Process

Mon, 10 May 2021 00:00:00 +0000

I’m glad it’s over.

I applied for CS Ph.D. programs this past fall and had interviews with schools from late December all the way to March. Now that the semester has ended, I decided to put down some reflections on this process. This post is not intended to be the most comprehensive CS Ph.D. application tutorial in the world, but merely a half-guide half-memoir of journey towards a PhD. Of course, you should take this post with a grain of salt, since I don’t work on admission committees, and am no where near an expert in the application process.

On Research

The singly most important factor in a Ph.D. application is perhaps your research experience. The general thumb is ‘’the earlier, the better". You should absolutely start getting involved in research as early as possible. Throughout this experience, you will eventually find out if research is your thing and whether you want to continue the path in academia.

The general rule of thumb is ‘’the earlier, the better". Practically speaking, you will likely have something to put on your resume or your statement of purpose by the time you start Ph.D. applications. Since grad school is all about research, having as many research experiences as possible is only going to make your profile look stronger.

The earlier, the better

More importantly, grad school applications, especially in Computer Science, has become extremely competitive over the years. The admission rate for most top CS departments is well below 10 percent. Having publications, or even just research experiences, is likely gonna decide whether you are getting accepted. To some extent, “publish or perish” also applies to Ph.D. application.

I had friends who asked me whether they should wait until junior or senior years to get started in research because by then they will have taken enough courses and are better equipped with the background. My suggestion is no. First, part of research is about learning: you settle on a problem and you find ways to solve it. Learning happens throughout this process. My advisor Chris always says you learn more by doing research. My personal experience has reenforced this statement. Second, there simply might not be enough time for you to finish the research project by the time application process starts. It’s possible to pull a rabbit out of the hat in some areas like theory or machine learning. But for areas like systems, the sheer amount of workloads makes submission cycle really long. This is exacerbated by the relatively small number of top conferences. In addition, there will be several months before you are notified of the final decision.

For example, my first GPU project took more than a year to finish. It was then rejected and cost another several months before it was published. By the time my application started, my paper was still under submission and I didn’t hear anything back until late January.

Another problem people asked me is how to pick a research area in the first place. I use a small trick I learned from a post on Quora: figure out what you spend most time on. For a really long time, I thought I would eventually do something related to deep learning, because, well, that’s what everybody is doing. However, I found myself spending much more time browsing through LWN.net or Linux kernel code than actual deep learning. So I decided doing systems suites me better.

Picking Schools

So I’ve decided to apply for PhD, it’s time to pick which schools to apply to. Ph.D. is equivalent to research apprenticeship. Therefore, the application process is very much faculty-oriented. I highly recommend csrankings. It is capable of filtering out a lot of unwanted information and focus on your research interests quickly. It’s much quicker and more convenient to see all names in one place than browsing through every person’s name in every university. But you should still take a loof at each university’s faculty list because csrankings might not contain the most up-to-date information.

In terms of how many schools to apply, my suggestion is “the more, the better”. Statistically speaking, more application increases your chance of getting accepted. The major problem is the application fee and the need to tailor each SOP to different schools. I applied 20 programs, all in U.S.. This is way above the average number and I don’t really recommend most people doing it. However, applying more programs does gave me several interviews from schools I thought weren’t the best fit. For example, there is one AP who just came to the university and has research interests closely matching mine, but his personal information is not yet reflected in both csrankings and the school’s website. Another interviews I got are from a professor coming from a more theoretical background but is in need of students with more system skills to build the underlying infrastructure.

Statement of Purpose

You will need to submit a SOP to every school you apply to. There are many tutorials online on how to come up with the best SOP so I won’t go over them. In general, I think there are two things to consider:

Don’t be creative, be simple.
Focus on research.
Getting as many people reading your draft as possible.

If you google “what to avoid in a PhD SOP”, you will be advised to not say something like how you started programming at age 4. This is not the same as your college application essays so don’t be smart about it. Simply stating your goals and interests is more than enough. My SOP opening goes like:

After spending three years in academic research, I see my goal to seek a Ph.D. in Computer Science, more specifically in systems, as a continuation of my increasing involvement with the field, and as a requirement to pursue my research interests and solve related problems in systems as a professor. In particular, I am interested in operating systems, heterogeneity, networking, machine learning systems, and architecture.

Avoid writing every accomplishment in your life. I get it, everyone wants to show off their proud moments. I made the mistake in my first draft by going over all my research projects, including those outside of systems, as well as my side projects. In the end, my essay turned into a hodgepodge with no real focus. It also goes way over the page limit. In the end, I had to cut down all stuffs that are not ‘‘system-related" to make rooms for more important contents. If you want a sample, I’d be glad to share you a copy upon request.

The Interview

Now that you submitted all the applications, and interviews start to come in. My general suggestion for preparing for an interview is to keep things simple. First, there are only 30 minutes. Your priority, as a student, is to make the most out of this 30 minutes. It’s important to discuss your research and know what you are talking about, but don’t go over too much into the details because

No body is able to understand all the technical details under 30 minutes.
The high level idea is generally much more important. If you can’t describe it under a few minutes, you likely don’t really know the project very well.
If the professor is interested in digging more into the project, he/she will ask more questions. This is good because it can turn an interview into a discussion, which is both more interactive and less intimidating. Nobody likes a 30-minute monologue.
The purpose of the interview is to find whether there is a match of interests and whether this interest is mutual. You should also reserve some time to ask your own questions.

Other Thoughts

I don’t really recommend taking too many classes while doing research projects. Research is very time-consuming and can take away a lot of sleep. When I was first semester in working on my GPU project, I was trying to double major, taking all the hard classes, TAing an OS course, and juggling around all sorts of stuff. Anxiety became an issue and I wasn’t able to get enough sleep. My productivity took a hit and resulted in my first paper being rejected. If I could travel back in time, I’d rather do less to achieve more.

If you decide to email potential advisors, keep the message short and go straight to the point. Avoid American novels. Assistant professors or professors that explicitly wrote on their page about students reaching out to them have much higher likelihood of replying your messages, because they are constantly looking for students to expand their groups.

Keep up with the application status, but do not check TheGradCafe too often because it can get addictive:)

May the force be with you 🤘.

Blockchain

Mon, 19 Apr 2021 00:00:00 +0000

The first time I’ve heard the term “blockchain” was around 2014. Since then, its popularity has grown rapidly. However, I’ve never actually understand what blockchain is exactly, until recently. In fact, I didn’t really understand the difference between blockchain and bitcoin. For me, blockchain is clubbed with cryptocurrencies. So here is a short summary of what blockchain is and why people use blockchain.

What is Blockchain

I tried reading the articles about blockchain before, and it didn’t take long before I was completely overwhelmed by technical terms: consensus, asymmetric crypto, consistency, etc. It’s hard to combine all these little pieces together and form a big picture. Instead, it’s much easier to understand block-chain from a top-to-bottom view. Even better, a small step-by-step example can clarify much of the confusion. I like Prof. Anand’s example given in the class slides:

Supposed we own a comic book store, and we want to sell comic books to some customer.
Every time we sell a book, 10 of my friends will record the action. Traditionally, we refer to such record as ‘’ledger’’. In the world of blockchain, we call this ‘‘distributed ledger’’.
Once we sell enough comic books, the 10 records (ledgers) will be collected into a book, all of my friends will get a copy of the book. This is very important because we use duplication to achieve consensus.
To make things more secure, all these books are stored in a secure vault. In digital world, we achieve this though ways of encryption, digital signature, and so on. An attacker need to tamper many copies of such book to disrupt our selling records, which tends to be extremely hard in real world scenario.
Now we have this secure vault, which is effectively an immutable block. This block (vault) stores the record of we selling a comic book. If we decide to sell more books, each one will generate an additional block (vault). Each block is appended after the previous block, forming what we call a ‘‘blockchain’’.

In essence, a blockchain is a series of immutable blocks, each storing the information of an event(s) whose validity is approved by a majority of other participants. Simple as that.

I like using the term ‘‘distributed ledger’’ to characterize blockchain. In Prof. Anand’s slides, this graph summarizes how a distributed ledger differs from traditional centralized ledger:

The main difference is how consensus is achieved. In centralized ledger, we have a single that decides the ‘‘golden record’’. In a distributed ledger, consensus is achieved is everybody agrees with it. To give an example, we would pay a 45-dollar electricity bill each month to Texas electricity company because the price standard is set by the company alone. In a distributed ledger world, we might pay 32 dollars instead, if every single residents living in the building agrees this is the best price. Essentially, we eliminate the centralized entity and distribute the ability of making decisions to each individual evenly.

In super-simple terms, a blockchain is just a computer file for storing data. The reason why it’s so secure is because there doesn’t exist a single central point of attack for hackers to target.

Sketch on Blockchain

Now we understand what a blockchain is, it’s time to find out how blockchain enables the development of digital currencies such as Bitcoin. There are many great articles talking about Bitcoin in details, but I found the original paper extremely helpful in terms of understanding the motivation behind Bitcoin. In essence, Bitcoin was introduced to eliminate one problem: the need of a trusted third party to process electronic payments. More abstractly, shift from trust-based system to cryptographic-proof-based system.

Tha paper claims that trust based model suffers from a fundamental weakness: the need of mediation. The logic is simple: mediation is required in the presence of disputes. Disputes means making non-reversible transactions more difficult, thus comes the possibility of reversal. Reversal causes the need for trust to spread. To establish trust, more price needs to be paid, in the form of money, personal information, etc. Essentially, the need of trust creates a centralized component that participants must rely on. In theory, Bitcoin resorts to a cryptographic-proof based system to replace the trust-based system, with the difference being that cryptographic-proof based system is distributed in nature.

Imaging a distributed system as a fully connected graph with $n$ nodes where each node represents a buyer/buyer. A transaction represents an edge connecting two nodes together. We denote set $T$ as currently ongoing transactions, there could be as many as $n(n-1)$ transactions going on concurrently, and each transaction $t \in T$ is independent to each other. There enables 1) extremely scalability; 2) on reliance on central components. If a transaction is committed to a Bitcoin network, it suggests that the transaction has already gained approval from both the buyer and seller side (why this is the case is more technical, and you should Google how symmetric and asymmetric encryption work).

If we imaging a centralized system, where every buyer node is connected to one node $c$. Node $c$ in turn is connected to every seller node $s$. Assume the centralized component, or node $c$, has a fixed capacity limiting the amount of traffic flowing thought it in any given moment. To achieve the same level of information flow in a distributed system, we need to increase node $c$’s capacity, which represents the increased costs of mediation. Assume a buyer node $b$’s output value is different from a seller node $s$’s input value (disputes), extra information flow will be required from seller node $s$, creating the need for more capacity at node $c$, thus driving the cost.

From an abstract point of view, I’d like to imaging a normal transaction in a Bitcoin network as follows:

Transaction is initiated
The buy, the seller, and all witnesses agree with the validity of such transaction.
With everyone satisfied as the precondition, transaction completes. If there exists a disagreement, transaction doesn’t happen.

On the other hand, in a centralized system, the transaction happens as follows:

Transaction is initiated
Seller received payments, but there’s a mismatch
Now the centralized component must be engaged to mitigate the issue.
More shit happens, the centralized component must constantly nag both the buyer and the seller until the problem is solved.

In short, I think trust is not ‘‘removed’’, it is merely achieved through a different way. I’d like to modify Prof. Anand’s summary on Bitcoin: Bitcoin is an engineering solution to solve trust issues.

Blockchain Structure

The structure of blockchain is surprisingly simple. A blockchain consists a series of blocks, each holding batches of valid transactions that are hashed and encoded into a Merkle tree, with only the root of the tree included in the block’s hash. Each block also includes a hash value of prior block in the blockchain. The essentially forms a linked list, except we replaced pointer with a hash value of a block.

Using hash value to link block has another benefit that it protects the integrity of all previous blocks. For example, if an attacker modifies the data in one block, the action will consequently change the block’s corresponding hash value, resulting the hash value stored in the next block being invalidated. The attack needs to modify hash values starting from the modified block all the way to the latest block. In addition, the modified blockchain would be different compared the one stored in other nodes across the internet, making attacking even more difficult.

Mining

After explaining the basic concepts behind blocks, it becomes easy to understand the purpose of mining. Mining, in its essence, is using proof-of-work to implement a distributed timestamp server on a P2P bases.

What a miner does is incrementing the nonce in a block until a value is found that would result in a block’s hash with required beginning zero bits. It is that simple 😄. The more the number of zero bits required, the more work is needed to derive such hash value.

Once thing one might ask is: it could be possible that there are multiple miners producing blocks with hash values satisfying such requirement. In that case, how should we determine which miner’s blocks get accepted? The original paper paper explains that the proof-of-work also solves the problem of determining representation in majority decision making. The majority decision is represented by the longest chain. That’s it! The longest chain means a miner is able to produce many blocks that satisfy the zero bit requirement, thus showing the greatest proof-of-work effort invested.

What is suggests, in layman’s term, is that whoever gets the most computational power has higher probability of generating new blocks and thus getting rewarded with bitcoins. That’s why people are craving for GPUs, FPGAs, and other accelerators because they are much better at parallel computing and have higher throughput than CPUs.

Personally, I have doubts on the way proof-of-work is implemented. Normally, proof-of-work, for example, can be providing technical support to customers, or helping cleaning your neighbor’s backyard. The work you did has created positive values to the society. In the bitcoin case, the work was simply spending electricity to derive a value, which is hard to argue about its value. One way to justify its value might be that it provides a fundamental service so that blockchain can function properly and smoothly. Even then, it still feels like a bubble, not to mention the massive amount of resources wasted. Could there be another way to implement proof-of-work? If calculating the nonce value takes a long time, could we use waiting time instead to mimic the same result while saving resources in the same time?

Update: Recently, a new cyptocurrency called Chia was introduced and caught my attention. It was developed by the inventor of BitTorrent, Bram Cohen. It uses proof of space and time to replace the energy-hungry proof of work approach. In short terms, the way it works is: whenever the blockchain broadcasts a challenge for the next block, farmers scan their plots to see if they have the hash closest to the puzzle. The probability of winning a block the roughly proportional to the total space a farmer has compared to the entire network.

Obviously, the demand for storage devices will increase dramatically. In fact, according to Tom’s Hardware, in about a month’s time storage space allocated to Chia network increased from 120PB all the way to 1143PB, or 1.14 Exabytes. 1.14EB equals 1,140,000TB, or 63,333 20TB hard drives. Looking back at proof-of-work, it feels like choosing between one evil and another.

Transaction

Transactions are the most part in the bitcoin system. It is represented as data structures that encode the transfer of value between participants. There are many fields in a transaction structure. But the most important components are: input and output.

The best way to understand how transaction works is through an example. Suppose we have sender $A$ and receiver $B$. To send some BTC to receiver $B$. $A$ signs a transaction using his private key with specific details. This message is sent to the bitcoin network, the message contains:

input: the source transaction sent to $A$ at an earlier time.
amount: amount of BTC to sent to $B$.
output: $B$’s public address.

Here, the miners will verify whether $A$ actually have access to the funds he/she claims to control using $A$’s public key. Upon verification, new blocks will be created.

Note: to actually understand how public and private key works, please refer to public-key cryptography, Diffie-Hellman algorithm, and the use of number theory in encryption.

Hoare Logic

Sat, 17 Apr 2021 00:00:00 +0000

Hoare logic forms the basis of all deductive verification. To illustrate Hoare logic, we first consider a smaller imperative programming language IMP.

In IMP, we have three program constructs: expressions, conditionals, and statements:

Expression takes the form $ E := Z\ |\ V\ |\ e_1 + e_2\ |\ e_1 \times e_2 $
Conditional is self-explanatory: $ C := true\ |\ false\ |\ e_1 = e_2\ |\ e_1 \leq e_2 $
Statement consists of several different forms:
- $S := V := E$ (Assignment)
- $S_1; S_2$ (Composition)
- if $C$ then $S_1$ else $S_2$ (If)
- while $C$ do $S$ (While)

Hoare Triple

In Hoare logic, we specify partial correctness of programs using Hoare triples:

\[\{P\} S \{Q\}\]

Here $P$ is the precondition and $Q$ is the post-condition. $S$ is a statement in IMP.

The interpretation of Hoare triple is as follows:

if $S$ is executed in a state satisfying $P$
and if execution of $S$ terminates
then the program state after $S$ terminates satisfies $Q$

Here an example, $\{x = 0 \} while\ true\ do\ x := 0\ \{x = 1 \}$ is a valid Hoare triple because the execution of the statement never terminates, thus satisfying the requirement posed by Hoare triple.

Thus the specification $\{P\} S \{Q\}$ is called partial correctness spec, because it doesn’t require $S$ to terminate.

There is also a stronger requirement called total correctness. The total correctness specification is written as:

\[ [P] S [Q]\]

Total correctness requires that if $P$ is satisfied when executing $S$, then $S$ must terminate, and the post-conditional $Q$ must be satisfied after $S$ terminates.

Thus the example $\{x = 0 \} while\ true\ do\ x := 0\ \{x = 1 \}$ is no longer valid because it never terminates.

In summary, we can say that Total correctness $=$ Partial correctness $+$ termination.

Proving Partial Correctness

We use $\vDash \{P\} S \{Q\} $ to say a Hoare triple is valid and we use $\vdash \{P\} S \{Q\} $ to indicate we can prove validity of a Hoare triple.

Let’s say we are given an assignment $x := y $ with post-condition $x > 2$. The question is, what do we need to know before the assignment happens so that the post-condition, $x > 2$, holds afterwards?

To prove $Q$ holds after the assignment $x := E$, we need to show that $Q$ with $E$ substituting $x$ holds before the assignment. Formally, we write it as:

\[\vdash \{Q[E / x]\}\ x := E \{Q\}\]

For example, given $ \{ x+1 = n\}\ x := x+1 \ \{x=n\} $, we know this formula is provable because we can take $Q$, which is $\{x=n\}$, substituting $x$ with $x+1$ given we need to replace it with $E$, and we will convert $x=n$ to $x+1 = n$, which matches the precondition.

Here is another interesting example, given $ \{z = 2\}y:= x \{y = x\} $, this Hoare triple is valid but not provable. If we use the above substitution procedure, it will result in the precondition being $x=x$ which is always true but is also different from the original precondition $z=2$.

Intuitively, we can prove the post-condition $y = x$ given the statement $y = x$ without any assumptions, so even if we do have assumptions like $z=2$, we should still be able to prove it, and here comes proof rule for precondition strengthening.

Proof Rule for Precondition Strengthening

Formally, we define precondition strengthening as:

\[ \frac{ \vDash \{P’\} S \{Q\}\ \ P \Rightarrow P’ }{\vdash \{P\} S \{Q\}} \]

Now, with the original formula $ \{z = 2\}y:= x \{y = x\} $, we would derive $ x= x \equiv true $. and since $z=2 \rightarrow true$ is valid, we can now prove the formula!

A Dual: Post-Condition Weakening

Formally, we define post-condition weakening as:

\[ \frac{ \vDash \{P\} S \{Q’\}\ \ Q’ \Rightarrow Q }{\vdash \{P\} S \{Q\}} \]

What this means if that if we can prove a post-condition $Q’$, we can always relax it to something weaker.

For example, given that $\vdash \{true\}S\{x=y \land z=2\}$, we can prove $\{true\}S\{x=y\}$ because $x=y$ is a weaker condition of $ x=y \land z=2 $.

Proof Rule for Composition

For composition, we define the rule as:

\[ \frac{ \vdash \{P\}S_1\{Q\}\ \ \vdash \{Q\}S_2 \{R\} }{ \vdash \{P\}S_1;S_2\{R\} }\]

I won’t show why this is true, so this will be left as an exercise.

Proof Rule for If Statements

Naturally, we define the rule for if statement as:

\[ \frac{_{ \vdash \{P \land C\} S_1 \{Q\} }^{ \vdash \{P \land \neg C\} S_2 \{Q\} }}{ \vdash \{P\}\ if\ C\ then\ S_1\ else \ S_2 \ \{Q\} } \]

In summary, this means given we know $P$ is true, no matter what $C$ evaluates to, we will come to the same post-condition $Q$. If you still don’t understand it, just stare at it for five minutes and you should figure out why this is the case:)

Proof Rule for While

To understand the proof rule for while statement, we need to first understand a simple concept: loop invariant

Loop Invariant

Loop invariant $I$ has two properties:

$I$ holds initially before the loop
$I$ holds after each loop iteration

For example, given a loop

i := 0;
j := 0;
n := 0;
while i < n do
 i := i + 1;
 j := i + j

Here, $i \leq n $ is a loop invariant but $i < n $ isn’t.

Now, we put the properties of loop invariant $I$ in formal terms. Given that the precondition before a loop executes is $C$, by definition, $I$ holds initially before the loop, we know $I \land C$ holds.

For the second property of loop invariant, it specifies $I$ holds after each loop iteration. So that means $\{ I \land C\ \} S \{I\} $ holds. Formally, we express loop invariant as $ \vdash \{P \land C\} S \{P\} $.

Now, we know if a loop terminates, it must be that condition $C$ no longer holds, meaning $ P \land \neg C $ must be true after loop terminates. This is because $P$ is a loop invariant and always holds after each loop iteration, including termination.

Putting all this together, we form the proof rule for while loop:

\[ \frac{ \vdash \{P \land C\} S \{P\} }{ \vdash \{P\} while \ C \ do \ S\{P \land \neg C\} }\]

Inductive Loop Invariant

It’s not always the case that we can prove loop invariant is valid. Here is a counter example:

Consider precondition $ I = j \geq 1 $ and the code is:

\[i := 1; j := 1; while \ i < n\ do\ \{j := j+i; i ;= i + 1\}\]

We know that the precondition is $I = j \geq 1$ and $C$ (loop condition) is $i \leq n$. So we have a Hoare triple:

\[ \{ j \geq 1 \land i \leq n \} j =j + i;\ i = i + 1; \ \{j \geq 1\} \]

We could simply set $i = -100$, then if we execute the code once we will not be sure if the post-condition $j \geq 1$ holds.

However, if we have strengthened invariant such as $j \geq 1 \land i \geq 1$, the new Hoare triple will be valid. Then $I$ will become inductive invariant because we can prove these invariant.

To put everything in action, here is an example showing how to find inductive loop invariant to prove the following Hoare triple:

\[ \{i = o \land j = o \land n = 5\} \] \[while\ i < do\ i := i + 1; \ j := j + i; \] \[\{j = 15\} \]

If we have $ j = \frac{i(i+1)}{2} $, this is a loop invariant because we can prove that:

\[\{j = \frac{i(i+1)}{2} \land i < n\} i = i + 1;\ j = j+ i\ \{j = \frac{i(i+1)}{2}\} \]

If we conjoin this condition with $i \geq n$ as the post-condition, however, we can’t really show that $j = 15$ is true for the given Hoare triple.

If we also add condition $n = 5$ and $i \leq n$, and we conjoin this with the end-loop condition $ i \geq n$, we would realize that $ i = n = 5$, and thus prove that $j = 15$ for the given Hoare triple.

How we get $j = \frac{i(i+1)}{2}$ is, however, not trivial to solve, and requires some human effort in program verification.

Basic Idea behind Program Verification

Automating Reasoning in Hoare Logic

It’s reasonable to automate the tedious parts of program verification: proving correctness. The basic idea to assume an oracle (human or another program) gives loop invariants but automate the rest of the reasoning.

Automating Hoare logic is based on generating verification conditions (VC). Essentially, a verification condition is formula $\phi$ s.t. program is correct iff $\phi$ is valid.

There are two way to generate verification conditions: forwards and backwards.

As their name suggests, a forwards analysis starts from precondition and generates formulas to prove post-condition. Forwards technique computes strongest post-conditions (sp). In contrast, backwards analysis starts from post-condition and tries to prove precondition. Backwards technique computes weakest preconditions (wp).

Here, we start from the backwards method.

Weakest Preconditions

Formally, we define the weakest precondition of $Q$ with respect to $S$ as $wp(S, Q)$.

$wp(S, Q)$ has the property that it is the weakest condition (least amount of information we need to have) that guarantees $Q$ holds after $S$ in any execution.

Thus, Hoare triple $ \{P\}S\{Q\} $ is valid iff $ P\Rightarrow wp(S, Q) $.

Weakest preconditions are defined inductively and follow Hoare’s proof rules:

$wp(x := E, Q) = Q[E/x]$
$ wp(s_1 ; s_2, Q) = wp(s_1, wp(s_2, Q) ) $
$wp(if \ C\ then \ s_1\ else \ s_2, Q) =C \rightarrow wp(s_1, Q) \land \neg C \rightarrow wp(s_2, Q) $

However, for loops, we might not be able to compute the weakest preconditions exactly because there might be cases where we simply don’t know the number loops executed.

Thus, we relax our requirement by computing $awp(S,Q)$ ($a$ stands for approximate)) instead, hoping that $awp(S, Q)$ is weak enough to be implied by $P$ although it may not be the weakest.

Now, assume all loops are annotated with invariants $while \ C \ do \ [I]\ S$, we will just define $awp(while \ C \ do \ [I]\ S, Q) \equiv I$.

However, there is another program, since $awp$ is only an approximated condition, it doesn’t necessarily mean that if $P \Rightarrow awp(S, Q)$, $ \{P\}S\{Q\} $ is valid. There are two reasons:

We don’t know if the loop invariant $I$ provided by the oracle is correct since it might be provided by human and we know human make mistakes.
Even if $I$ is correct, we don’t know if $I \land \neg C$ is sufficient to establish $Q$.

Thus, for each statement $S$, we need to generate verification condition (VC) $ VC(S,Q) $ which encodes additional conditions to prove.

Verification Conditions

So how do we formulate VC generation rules for loops?

\[ VC(while\ C\ do\ [I]\ S,Q) = ?\]

First, we need to ensure that $Q$ is satisfied after loop, which means $ I \land \neg C \Rightarrow Q $.

To show that $I$ is actually correct, we also need $ \{I \land C\} S \{I\} $.

This implies that we need to show $ I \land C \Rightarrow awp(S, I) $. In case $S$ contains nested loops, and also add $VC(S, I)$

In summary, to how that loop invariant $I$ provided by the oracle is correct, we need to show $ I \land C \Rightarrow awp(S,I) \land VC(S, I) $.

To show $I$ is strong enough to establish $Q$, we need to show $ I \land \neg C \Rightarrow Q $.

Putting this together, and to answer the two reason why $P \Rightarrow awp(S, Q)$, $ \{P\}S\{Q\} $ might not be valid, VC for a while loop $ S’ = while \ C \ do \ \{I\} $ is expressed as:

\[ VC(S’, Q) = (I \land C \Rightarrow awp(S, I) \land VC(S, I) ) \land (I \land \neg C \Rightarrow Q) \]

In essence, verification condition simply stands for additional checks we need to verify before we can claim that, if an approximated precondition $P$ is valid, $ \{P\} S \{Q\} $.

The verification condition for other statements is as follows:

For assignment, we don’t need any additional checks for precondition because if $ P \Rightarrow wp(S, Q) $, it implies that $ \{P\} S \{Q\} $ is valid. Thus, $ VC(x:= E, Q) = true $.
For composition, we have $ VC(s_1 ; s_2, Q) = VC(s_2, Q) \land VC(s_1, awp(s_2 , Q)) $.
For if statement, we have $ VC(if \ C \ then \ s_1\ else \ s_2, Q) = VC(s_1, Q) \land VC(s_2, Q) $.

Quick question: for if statement, why don’t we instead use verification condition generation rule: $ C \Rightarrow VC(s_1, Q) \land \neg C \Rightarrow VC(s_2, Q) $?

Here is a counter example. Suppose we have $ S = if\ (x > 0) \ while (*) x - -; else \ skip$, and we have given loop invariant $x \geq 0$.

If we use the original rule $ VC(s_1, Q) \land VC(s_2, Q) $, according to the verification condition generation rule for while loop, we would have to verify the loop invariant $I$ is correct, and thus $VC(S, I) \equiv \{ x \geq 0 \} x - - \{ x \geq 0 \} $, obviously, this not true, and we can use this VC.

However, if we instead use the rule $ C \Rightarrow VC(s_1, Q) \land \neg C \Rightarrow VC(s_2, Q) $. The VC would become $ x > 0 \Rightarrow (\{ x \geq 0 \} x - - \{ x \geq 0 \}) $, which is valid, and we will include the wrong VC. Thus we can’t use this VC generation rule.

Verification of Hoare Triple

Thus, to show validity of Hoare triple $ \{P\} S \{ Q \} $, we need to compute:

$ awp(S, Q) $
$ VC(S, Q) $

Therefore, a Hoare triple is valid if the following formula holds:

\begin{equation}\tag{*} VC(S, Q) \land P \rightarrow awp(S, Q) \end{equation}

Thus, if we can prove the validity of the above equation *, we know the program obeys specification.

Congruence Closure

Sat, 27 Mar 2021 00:00:00 +0000

This is a summary of how to compute congruence closure. I implemented the algorithm to compute congruence closure and thought I’d never forget it. But my memory starts to get blurry just after two days. So I figured I’d put things down so I don’t have to watch the entire lecture again the next time I need it.

Equivalence Relation

Equivalence relation has three properties: reflexive, symmetric, and transitive. (E.g. $\geq$ is not an equivalence relation because it break the symmetric property. $4 \geq 6$ does not imply that $6 \geq 4$. For example, a binary relation $R$ over a set $S$ meeting these three properties can be expressed as:

Reflexive: $\forall s \in S.\ sRs$
Symmetric : $\forall s_1, s_2 \in S.\ s_1 R s_2 \rightarrow s_2 R s_1$
Transitive: $\forall s_1, s_2, s_3 \in S.\ s_1 R s_2 \land s_2 R s_3 \rightarrow s_1 Rs_3$

Congruence Relation

Given a set $S$ equipped with functions $F = {f_1, …, f_n}$, a relation $R$ over $S$ is a congruence relation if $R$ is an equivalence relation and for every $n$‘ary function $f \in F$ we have:

\[\forall \overset{\rightarrow}{s}, \overset{\rightarrow}{t}.\ \bigwedge\limits_{i=1}^{n}s_i R t_i \rightarrow f(\overset{\rightarrow}{s}) R f(\overset{\rightarrow}{t})\]

A counter example would be given $R(x, y)$ defined as $|x| = |y|$ on all integers. If we have $R = {2, 2}$ and $f(x) = x + 1$ (successor function), then we know it violates the equivalence relation we mentioned above

Equivalence Closure

In short, the equivalence closure $R^E$ is the smallest equivalence relation that includes $R$. This is illustrated through an example. Given a set $S = {a, b, c}$ and binary relation $R:{\langle a, b \rangle , \langle b, c \rangle, \langle d, d \rangle}$, $R^E$ would contain all elements extended from $R$ based on the three properties of equivalence relation.

Congruence Closure

Naturally, congruence closure $R^C$ would be the smallest set that contains congruence relation $R$. What this means is $R^C$ contains $R^E$ (the equivlance closure we derived before), and any element generated from $R^E$ by a given function that produces element which also satisfies equivelance relation. For example, Given $S = {a, b, c}$ and function $f$ such that $f(a) = b$, $f(b) = c$, $f(c) = c$, the congruence closure would contain nine elments in total. First, we would use the procedure above to generated equivalence closure. Then, because $f(a) = b$ and $f(b) = c$ due to congruence relation, we know $b = c$, now we apply the procure for generating equivalence closure again.

Algorithm to Compute Congruence Closure

The high-level description of the algorithm is as following:

To decide satisfiability of $T_{=}$ (equality theory) formula:

\[F\ : \ s_1 = t_1 \land … s_m = t_m \land s_{m+1} \neq t_{m+1} \land … s_n \neq t_n\]

Compute subterms and construct initial DAG (each node’s representative is itself)
For each $i \in [1,m]$, process equality $s_i= t_i$ as described. (Essentially, process all equiv expression first)
For each $i \in [m + 1,n]$, check if $Rep(s_i) =Rep(t_i)$. (Check if any nequiv expression contradicts any equiv expression)
If there exists some $i \in [m + 1, n]$, for which $Rep(s_i) =Rep(t_i)$, return UNSAT
if for all $i$, $Rep(s_i) \neq Rep(t_i)$, return SAT

This is an example for illustration purposes, borrowed from Prof. Dillig’s slides:

Given formula $F\ : \ f^3(a) = a \land f^5(a) = a \land f(a) \neq a$

The initial DAG would be:

Process equality $f^3(a) = a$ gives us:

Recursively merging the parents results in:

Process equality $f^5(a) = a$ gives us:

Now in this step, $f^2(a)$ and $a$ are in the same congruence class, thus we will perform the same operation on their parents, processing equality $f^3(a) = f(a)$:

We find $f(a) \neq a$ has a conflict because node $a$’s representative is $f(a)$, indicating they are in the same congruence class, meeting congruence relation. Thus the formula is UNSAT.

Program Loading and Memory Mapping in Linux

Tue, 03 Nov 2020 00:00:00 +0000

This is a summary over program loading, dynamical paging, signal handling, and memory mapping in Linux.

execve Syscall

One of operating systems’ basic services is to load programs into memory to execute. Programs rely on execve syscall to get the OS to load the program into memory and start it executing as a process. The kernel version we used to testing is 5.4.0. Doing a quick search inside Elixir gives us:

SYSCALL_DEFINE3(execve,
 const char __user *, filename,
 const char __user *const __user *, argv,
 const char __user *const __user *, envp)
{
 return do_execve(getname(filename), argv, envp);
}

Follow the function call, we will eventually reach the call to __do_execve_file, the comment of this function says “sys_execve() executes a new program”, which is pretty straightforward. This function first checks the filename pointer. Then it checks the flags of the current process that limit of running processes is not exceeded:

if (IS_ERR(filename))
 return PTR_ERR(filename);

/*
 * We move the actual failure in case of RLIMIT_NPROC excess from
 * set*uid() to execve() because too many poorly written programs
 * don't check setuid() return code. Here we additionally recheck
 * whether NPROC limit is still exceeded.
 */
if ((current->flags & PF_NPROC_EXCEEDED) &&
 atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
 retval = -EAGAIN;
 goto out_ret;
}

/* We're below the limit (still or again), so we don't want to make
 * further execve() calls fail. */
current->flags &= ~PF_NPROC_EXCEEDED;

The next important task is to allocate the struct linux_binprm structure defined here. This structure is used to hold the arguments that are used when loading binaries.

bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
 if (!bprm)
 goto out_files;

Next, the function performs a seireis of tasks to prepare the bprm struct. Refer to the linux-insides book to find more information on how exactly the bprm structure is filled up.

The most important function called by __do_execve_file is search_binary_handler. Based on the comment, this function cycles the list of binary formats handler, until one recognizes the image. We can find one section of the code surrounded by binfmt_lock:

list_for_each_entry(fmt, &formats, lh) {
 if (!try_module_get(fmt->module))
 continue;
 read_unlock(&binfmt_lock);

 bprm->recursion_depth++;
 retval = fmt->load_binary(bprm);
 bprm->recursion_depth--;

 read_lock(&binfmt_lock);
 put_binfmt(fmt);
 if (retval < 0 && !bprm->mm) {
 /* we got to flush_old_exec() and failed after it */
 read_unlock(&binfmt_lock);
 force_sigsegv(SIGSEGV);
 return retval;
 }
 if (retval != -ENOEXEC || !bprm->file) {
 read_unlock(&binfmt_lock);
 return retval;
 }
}

We can see it calls into load_binary:

retval = fmt->load_binary(bprm);

Here, the load_binary is a pointer in a linux_binfmt struct. For elf format, it can be found here:

static struct linux_binfmt elf_format = {
 .module = THIS_MODULE,
 .load_binary = load_elf_binary,
 .load_shlib = load_elf_library,
 .core_dump = elf_core_dump,
 .min_coredump = ELF_EXEC_PAGESIZE,
};

We can find the load_elf_binary function defined in the fs/binfmt_elf.c file. Then the function will check the magic number in the ELF file header. You can find the ELF format from wiki. We can see for both 32-bit and 64-bit systems, the e-ident field should contain the magic number for ELF format files.

/* Get the exec-header */
loc->elf_ex = *((struct elfhdr *)bprm->buf);

retval = -ENOEXEC;
/* First of all, some simple consistency checks */
if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
 goto out;

Then, load_elf_binary will do some tasks to prepare for the executable file. After that, it will try to load the program header table:

elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
if (!elf_phdata)
 goto out;

Then it will traverse the program header table and find the interpreter which is responsible of setting up the stack and map elf binary into the correct location in memory. After the interpreter is obtained, the function will perform simple consistency checks on the interpreter. It will load the interpreter program headers:

/* Load the interpreter program headers */
interp_elf_phdata = load_elf_phdrs(&loc->interp_elf_ex,
 interpreter);
if (!interp_elf_phdata)
 goto out_free_dentry;

This function will call setup_arg_pages to finalize the stack vm_area_struct:

/* Do this so that we can load the interpreter, if need be. We will
 change some of these later */
retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
 executable_stack);
if (retval < 0)
 goto out_free_dentry;

It will also mmap the elf image into the correct location in memory. The bss and brk sections are prepared for the executable file:

/* Now we do a little grungy work by mmapping the ELF image into
 the correct location in memory. */
for(i = 0, elf_ppnt = elf_phdata;
 i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {

 ...

 /* There was a PT_LOAD segment with p_memsz > p_filesz
 before this one. Map anonymous pages, if needed,
 and clear the area. */
 retval = set_brk(elf_bss + load_bias,
 elf_brk + load_bias,
 bss_prot);
 if (retval)
 goto out_free_dentry;
 nbyte = ELF_PAGEOFFSET(elf_bss);
 if (nbyte) {
 nbyte = ELF_MIN_ALIGN - nbyte;
 if (nbyte > elf_brk - elf_bss)
 nbyte = elf_brk - elf_bss;
 if (clear_user((void __user *)elf_bss +
 load_bias, nbyte)) {
 }

It will also call elf_map to map the segment to [vaddr, vaddr + file size] and align and then perform some checks:

error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
 elf_prot, elf_flags, total_size);

The interpreter is then loaded:

elf_entry = load_elf_interp(&loc->interp_elf_ex,
 interpreter,
 &interp_map_addr,
 load_bias, interp_elf_phdata);

Finally, the elf talbe is created:

retval = create_elf_tables(bprm, &loc->elf_ex,
 load_addr, interp_load_addr);

After everything is prepared, we can call the start_thread function, which prepares the new task’s registers and segments for execution. We will pass the set of registers for the new task, the address of the entry point of the new task, and the address of the top of of the statck for the new task to this function.

start_thread(regs, elf_entry, bprm->p);

A lot of the information here can also be found at the linux-insides book. I found it very helpful clearing my confusion.

In our own implementations, we will not call the loaded program’s main function. Instead, our loader will transfer control to the entry point of the loaded program via the jmp instruction. It has two major differences:

Jumping to the entry point indicates we are going to execute the glibc start up functions before main is called. This includes setting up thread local storage. main simply jump to the main with the loader’s TLS, no other setups are involved.
jmp doesn’t push return address on stack. When the loaded program finishes execution, it exits the loader program, instead of giving control back to the caller.

Scheduler Activation

Sat, 24 Oct 2020 00:00:00 +0000

This is a summary on scheduler activation. To discuss about scheduler activation, we must first understand what is a thread. A thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler.

Kernel Level Threads Pros/Cons

Good functionality, system wide integration
Threads are seen and scheduled only by the kernel. A lot of kernel information should be invisible to user thread and can be useful for scheduling
Poor performance, every thread_related call traps. This situation is a lot worse in the 1990s than it is now mainly due to clock speed. The scheduling quanta are roughly the same, but because the clock speeds are much faster today, you can execute orders of magnitude more instructions per quanta today than you could in 1990. Even if traps, let’s say, costs 10 cycles to complete, it would be a much bigger fraction of the quanta in 1990 than it is today.

User Level Threads Pros/Cons

Good performances. (most threads operations don’t involve kernel)
Good scheduling policy flexibility: done by thread lib
Poor system-wide integration
Multi-programmed workloads are hard to schedule
I/O, page faults invisible
Potential for incorrect behavior
- User level scheduler may not be cooperative. With user threads running on kernel threads, it may be that kernel threads block when a user-thread blocks, thus an application can run out of kernel threads to run their user threads.May be gilding the lily.

Some Problems about User-Level Threads on Kernel Interface

Insufficient visibility between the kernel and user thread lib
Kernel event such as pr-emption or I/O are not visible to user lib
- For example, if user level threads block, then the kernel thread serving it also blocks.
Kernel threads are scheduled with respect to user-level thread library, we can have this interferences between two schedulers.
Kernel time-slicing of threads
- For example, user level threads holding a spin-lock can be pre-emptied, which can potentially cause all other user threads to wait.

Scheduler Activation

The basic principle about scheduler activation is to expose revocation: telling me when you take something away. This is basically the same idea as the exokernel. For example, interfaces like

add_processor()
has_blocked()

The basics about scheduler activation are

Multi-threaded programs are still given an address space
Facilitate flow of kernel information between user and kernel threads
Kernel explicitly vectors kernel events to the user-level thread
- via scheduler activation (upcall)
Extended kernel interface for processor allocation-related events
- Essentially exchanging information

Scheduler Activation vs Kernel Threads

Key differences:

Pre-emptied threads never resumed by the kernel directly.
- Essentially, every new SA is a brand new context.
- For example, if you do blocking I/O, the kernel will provide a new scheduling activation and vector into that application space. There isn’t a notion of “resume”. The kernel is simply going to find some new schedule activation to notify you that a work has unblocked. In modern kernels, you would do something like stack unwinding to get back into user space.

An important problem is what happened if a user thread is forced to be de-scheduler while it’s in a scheduler. The user thread will hold a a lock on user level run queue. That means no other user thread can be scheduled to run because none of them can acquire the lock. Because there’s no notion of “resume” in scheduling activation, we can’t really resume the execution in the scheduler. Thus, we run into a deadlock situation.

One solution is to detect whether we are using a lock and keep executing until we leave the locked region. Of course, there are too many gotchas in this solution.

Another solution is that the kernel can make a copy of the critical section and execute the critical section itself regardless of what the user thread chooses to do. Therefore, we can guarantee by the time you vector back into user space the lock is no longer held. So the kernel is basically executing the user code! Crazy, right? Now we ran into more gotchas. What if the code is written in Java? How to find a locked region in userspace? What if …

Another thing we want to mention is page fault. Page fault indicates that you are missing part of your address. So there will be a notification with a new scheduler activation. Once you do something with it, you will likely touch that same piece in the space and double fault again.

What is the solution?

Add MathJax Support to Jekyll and Hugo

Thu, 22 Oct 2020 00:00:00 +0000

I was using Mathjax v2 for a while and I heard v3 perform significantly faster than v2. Many great tutorials explains explains how to add Mathjax support to Jekyll websites. Some of them only cover Mathjax v2. So here is the brief summary on how to add Mathjax v3 support to your Jekyll website (Recently I’ve migrated to Hugo but adding support to Hugo is also pretty similar).

In the _config.yml located in your root directory, add this line:

markdown: kramdown

Create a file called mathjax.html insides _includes/, add these lines (these settings come from the Mathjax documentation):

<script>
MathJax = {
 tex: {
 inlineMath: [ ['$', '$'], ['\\(', '\\)'] ]
 },
 svg: {
 fontCache: 'global'
 }
};
</script>
<script
 type="text/javascript" id="MathJax-script" async
 src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js">
</script>

For Hugo website, the script will be exactly the same. The only difference is instead of putting mathjax.html into _includes/, you would want to put it inside layouts/partials. For example, I put my mathjax.html into the theme directory themes/mini/layouts/partials.

For Jekyll, add this line in your _includes/head.html before </head>:

{% include mathjax.html >}}

For Hugo, we would add the following line to layouts/partials/header.html before </head> in your theme’s layouts/partials/header.html.

{{ partial "mathjax.html" . }}

Now you can write in-line math equations in your markdown file like:

\\(f(x) = x^2\\)

$f(x) = x^2$

The above text will be render to: $f(x) = x^2$

If you are already using Mathjax v2 and wish to just convert it to v3, you may also try this configuration converter. There is a much more detailed guide but it may contain information unnecessary to average Hugo or Jekyll users.html) how to migrate from mathjax v2 to v3. The most useful resource is the official Mathjax documentation.

Linux Program Measurement and mmap

Wed, 23 Sep 2020 00:00:00 +0000

This is a summary over Linux kernel program measurement and mmap. The specs of our experiment environment is listed below. For more details regarding the CPU spec please refer to cpu world. This is the system spec:

Attribute	Value
Processor name (BIOS)	Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz
Cores	6
Logical processors	12
TLB/Cache details	64-byte Prefetching Data TLB: 1-GB pages, 4-way set associative, 4 entries Data TLB: 4-KB Pages, 4-way set associative, 64 entries Instruction TLB: 4-KByte pages, 8-way set associative, 64 entries L2 TLB: 1-MB, 4-way set associative, 64-byte line size Shared 2nd-Level TLB: 4-KB / 2-MB pages, 6-way associative, 1536 entries. Plus, 1-GB pages, 4-way, 16 entries
RAM	32GB
Operating System	Ubuntu 20.04.1 LTS
Kernel Version	5.4.0-47-generic

> 8-way set associative means the CPU cache is made up of sets that can fit 8 blocks each.

Here are the details for the CPU cache, which we will need later:

Cache	L1 data	L1 instruction	L2	L3
Size	6 x 32 KB	6 x 32 KB	6 x 256 KB	15 MB
Associativity	8-way set associative	8-way set associative	8-way set associative	20-way set associative
Line size:	64 bytes	64 bytes	64 bytes	64 bytes
Comments:	Direct-mapped	Direct-mapped	Non-inclusive Direct-mapped	Inclusive Shared between all cores

Memory Map

To print the /proc/self/maps file for a process, we use the sprintf to construct the file name and then use the system from stdlib to cat the contents of the running process’s address space. If we execute the program, it shows (also available on gist)

address perms offset dev inode pathname
559e3e51f000-559e3e520000 r--p 00000000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
559e3e520000-559e3e521000 r-xp 00001000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
559e3e521000-559e3e522000 r--p 00002000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
559e3e522000-559e3e523000 r--p 00002000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
559e3e523000-559e3e524000 rw-p 00003000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
7faf5c477000-7faf5c49c000 r--p 00000000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7faf5c49c000-7faf5c614000 r-xp 00025000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7faf5c614000-7faf5c65e000 r--p 0019d000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7faf5c65e000-7faf5c65f000 ---p 001e7000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7faf5c65f000-7faf5c662000 r--p 001e7000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7faf5c662000-7faf5c665000 rw-p 001ea000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7faf5c665000-7faf5c66b000 rw-p 00000000 00:00 0
7faf5c685000-7faf5c686000 r--p 00000000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7faf5c686000-7faf5c6a9000 r-xp 00001000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7faf5c6a9000-7faf5c6b1000 r--p 00024000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7faf5c6b2000-7faf5c6b3000 r--p 0002c000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7faf5c6b3000-7faf5c6b4000 rw-p 0002d000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7faf5c6b4000-7faf5c6b5000 rw-p 00000000 00:00 0
7ffcddb8d000-7ffcddbae000 rw-p 00000000 00:00 0 [stack]
7ffcddbe0000-7ffcddbe3000 r--p 00000000 00:00 0 [vvar]
7ffcddbe3000-7ffcddbe4000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]

Based on the linux man page, we can see each column has different definition. The address field is the address space in the process that the mapping occupies. The perms field is a set of permissions:

r = read
w = write
x = execute
s = shared
p = private (copy on write)

The offset field is the offset into the file/whatever; dev is the device (major:minor); inode is the inode on that device. 0 indicates that no inode is associated with the memory region, as would be the case with BSS (uninitialized data).

The pathname field will usually be the file that is backing the mapping. For ELF files, you can easily coordinate with the offset field by looking at the Offset field in the ELF program headers (readelf -l). In addition, we can see a few other pseudo-paths:

[stack]: the initial process’s (also known as the main thread’s) stack.
[vdso]: The virtual dynamically linked shared object. More detailed descriptions can be found on lwn.
[vvar]: location of kernel space variables mapped in user space needed by virtual system calls. Essentially, a kernel-space physical address is mapped into the userspace.
[vsyscall]: similar to vDSO, vsyscall is another segment used to accelerate certain system calls in Linux. Vsyscall has some limitations; among other things, there is only space for a handful of virtual system calls. More detailed descriptions can be found on lwn.

One thing interesting here is that when we execute the same program twice, we can see after the first run, the output is

7fffbc92f000-7fffbc930000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]

Type the same command again:

7ffd6a94d000-7ffd6a94e000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]

Note that the vDSO area has moved, while the vsyscall page remains at the same location. The location of the vsyscall page is nailed down in the kernel ABI, but the vDSO area - like most other areas in the user-space memory layout - has its location randomized every time it is mapped. The vsyscall is legacy implementation of user-space sys call acceleration. Since it has fixed addresses, it is vulnerable to security issues. Because applications depend on the existence and exact address of that page, most functions are simply removed and replaced by a special trap instruction. More detailed explanation can be found on lwn.net.

Another interesting thing we observed is the base address of the executable (the start of the text section) and the start address of libc is rather different. This is also the result of using ASLR which is used to prevent return-to-libc attack.

getrusage

Then, we call getrusage at the end of our program and print out the fields. We will need getrusage later. Here is a sample output for some fields inside struct rusage:

utime: 1306
stime: 0
maxrss: 2692
minflt: 76
majflt: 0
inblock: 0
oublock: 0
nvcsw: 2
nivcsw: 0

Here is a short list of descriptions for each of these fields. More detailed information can be found on gnu website

utime: time spent executing user instructions.
stime: time spent in operating system code on behalf of processes.
maxrss: the maximum resident set size used, in kilobytes. That is, the maximum number of kilobytes of physical memory that processes used simultaneously.
minflt: the number of page faults which were serviced without requiring any I/O.
majflt: the number of page faults which were serviced by doing I/O.
inblock: the number of times the file system had to read from the disk on behalf of processes.
oublock: the number of times the file system had to write to the disk on behalf of processes.
nvcsw: the number of times processes voluntarily invoked a context switch (usually to wait for some service).
nivcsw: the number of times an involuntary context switch took place (because a time slice expired, or another process of higher priority was scheduled).

perf_event_open

perf_event_open interface is useful to measurement numerous system events. However, glibc doesn’t provide wrapper for this system call. Instead, we need to use syscall directly.

To use perf_event_open, we call create a function wrapper that does the actual syscall for us. Take the example from the Linux man page

static int
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
 int cpu, int group_fd, unsigned long flags)
{
 int ret;

 ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
 group_fd, flags);
 return ret;
}

Here the __NR_perf_event_open specifies the syscall number. On our local machine, we can go to /usr/include/x86_64-linux-gnu/sys/syscall.h, which specifies the location of __NR_perf_event_open. In our case, it is located at /usr/include/x86_64-linux-gnu/asm/unistd_64.h.

If we call objdump -d on the binary file, we will see something like this

000000000000119a <perf_event_open>:
 119a: 55 push %rbp
 119b: 48 89 e5 mov %rsp,%rbp
 119e: 48 83 ec 30 sub $0x30,%rsp
 11a2: 48 89 7d e8 mov %rdi,-0x18(%rbp)
 11a6: 89 75 e4 mov %esi,-0x1c(%rbp)
 11a9: 89 55 e0 mov %edx,-0x20(%rbp)
 11ac: 89 4d dc mov %ecx,-0x24(%rbp)
 11af: 4c 89 45 d0 mov %r8,-0x30(%rbp)
 11b3: 48 8b 7d d0 mov -0x30(%rbp),%rdi
 11b7: 8b 75 dc mov -0x24(%rbp),%esi
 11ba: 8b 4d e0 mov -0x20(%rbp),%ecx
 11bd: 8b 55 e4 mov -0x1c(%rbp),%edx
 11c0: 48 8b 45 e8 mov -0x18(%rbp),%rax
 11c4: 49 89 f9 mov %rdi,%r9
 11c7: 41 89 f0 mov %esi,%r8d
 11ca: 48 89 c6 mov %rax,%rsi
 11cd: bf 2a 01 00 00 mov $0x12a,%edi
 11d2: b8 00 00 00 00 mov $0x0,%eax
 11d7: e8 84 fe ff ff callq 1060 <syscall@plt>
 11dc: 89 45 fc mov %eax,-0x4(%rbp)
 11df: 8b 45 fc mov -0x4(%rbp),%eax
 11e2: 48 98 cltq
 11e4: c9 leaveq
 11e5: c3 retq

We notice there’s one interesting line

callq 1060 <syscall@plt>

The plt stands for Procedure Linkage Table. This lines indicates a call to the syscall in the procedure linking table. The PLT allows us to resolve the absolute addresses of shared libraries at runtime.

Take a look at the <syscall@plt> section of the disassembly of section .plt, we see

0000000000001060 <syscall@plt>:
 1060: ff 25 62 2f 00 00 jmpq *0x2f62(%rip) #3fc8<syscall@GLIBC_2.2.5>
 1066: 68 03 00 00 00 pushq $0x3
 106b: e9 b0 ff ff ff jmpq 1020 <.plt>

Notice this jump is a pointer to an address. The address lies inside the GOT (Global Offset Table). The GOT will eventually hold the absolute address call to syscall. On the first call the address will point back to the instruction after the jump in the PLT - 0x1066. Then we see another jump instruction. This jump is a jump into the eventual runtime linker code that will load the shared library which has syscall.

We also see the comment for the first jump instruction

#3fc8<syscall@GLIBC_2.2.5>

Use objdump -R, we see the dynamic relocation entries in the file

DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
0000000000003d98 R_X86_64_RELATIVE *ABS*+0x0000000000001190
0000000000003da0 R_X86_64_RELATIVE *ABS*+0x0000000000001150
0000000000004008 R_X86_64_RELATIVE *ABS*+0x0000000000004008
0000000000003fd8 R_X86_64_GLOB_DAT _ITM_deregisterTMCloneTable
0000000000003fe0 R_X86_64_GLOB_DAT __libc_start_main@GLIBC_2.2.5
0000000000003fe8 R_X86_64_GLOB_DAT __gmon_start__
0000000000003ff0 R_X86_64_GLOB_DAT _ITM_registerTMCloneTable
0000000000003ff8 R_X86_64_GLOB_DAT __cxa_finalize@GLIBC_2.2.5
0000000000003fb0 R_X86_64_JUMP_SLOT getpid@GLIBC_2.2.5
0000000000003fb8 R_X86_64_JUMP_SLOT __stack_chk_fail@GLIBC_2.4
0000000000003fc0 R_X86_64_JUMP_SLOT system@GLIBC_2.2.5
0000000000003fc8 R_X86_64_JUMP_SLOT syscall@GLIBC_2.2.5
0000000000003fd0 R_X86_64_JUMP_SLOT sprintf@GLIBC_2.2.5

Monitor Events

Next, we are going to look at L1 data cache metrics. We are interested in L1 data cache accesses, misses, and data TLB misses. We will measure this code in our experiment. CACHE_LINE_SIZE is defined as 64 to match our CPU specs.

// p points to a region that is 1GB (ideally)
void do_mem_access(char* p, int size) {
 int i, j, count, outer, locality;
 int ws_base = 0;
 int max_base = ((size / CACHE_LINE_SIZE) - 512);
 for(outer = 0; outer < (1<<20); ++outer) {
 long r = simplerand() % max_base;
 // Pick a starting offset
 if( opt_random_access ) {
 ws_base = r;
 } else {
 ws_base += 512;
 if( ws_base >= max_base ) {
 ws_base = 0;
 }
 }
 for(locality = 0; locality < 16; locality++) {
 volatile char *a;
 char c;
 for(i = 0; i < 512; i++) {
 // Working set of 512 cache lines, 32KB
 a = p + (ws_base + i) * CACHE_LINE_SIZE;
 if((i%8) == 0) {
 *a = 1;
 } else {
 c = *a;
 }
 }
 }
 }
}

What this routine does is essentially pick a working set of 512 cache lines, periodically perform a write or otherwise read operation. This process is repeated 16 times during each interaction. Each read or write access will operate on a new cache line. The innermost loop will perform this set of operations for the entire L1 data cache.

When opt_random_access is true, the starting base address of the cache line is randomly picked. Otherwise, it is incremented by 512 cache lines (or one working set) during each outer iteration. The main difference is that with opt_random_access set to true, the starting base address of the cache line can’t be precomputed by the hardware, thus likely increase miss rate.

To measure L1 data cache metrics, we will use the perf_event_open interface we discussed above. To measure L1 data cache read misses, we will configure our struct perf_event_attr as follows:

#define CALC_CONFIG(perf_hw_cache_id, perf_hw_cache_op_id, perf_hw_cache_op_result_id) \
((perf_hw_cache_id) | (perf_hw_cache_op_id << 8) | (perf_hw_cache_op_result_id << 16))

hw_event.type = PERF_TYPE_HW_CACHE;
hw_event.size = sizeof(struct perf_event_attr);
hw_event.disabled = 1; // disable at init time
hw_event.exclude_kernel = 1;
hw_event.config = CALC_CONFIG(PERF_COUNT_HW_CACHE_L1D, PERF_COUNT_HW_CACHE_OP_READ, PERF_COUNT_HW_CACHE_RESULT_ACCESS);

The exact details can be found in linux man page. The important part is:

hw_event.config = CALC_CONFIG(PERF_COUNT_HW_CACHE_L1D, PERF_COUNT_HW_CACHE_OP_READ, PERF_COUNT_HW_CACHE_RESULT_ACCESS);

These configurations allows us to measure the L1 data cahe read misses. The arguments passed to perf_event_open is

pid_t pid = 0;
int cpu = -1;
int group_fd = -1;
unsigned long flags = 0;

The choice of these parameters can also be found on the linux man page. After perf_event_open is called, we will re-enable event measurements by calling

ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

What it does is resetting the event count specified by the file descriptor argument to zero, then enables the individual event specified by the file descriptor argument. After do_mem_access(p, size) is executed, we call ioctl(fd, PERF_EVENT_IOC_DISABLE, 0) to disable the event and then read the result by read(fd, &result, sizeof(long long)). How result is defined is up to how PERF_FORMAT_* was specified. You can also check lxr to see how __perf_event_read_size calculates the size of event that is read. In our case, it’s simple a u64.

Be aware that simply executing the binary might cause perf_event_open to fail (in which case will always return -1). Using sudo is one workaround. Execute cat /proc/sys/kernel/perf_event_paranoid and see what returns. -1 means you have raw access to kernel tracepoints. Otherwise, you might have trouble accessing the performance counter without root privilege. Check this stackexchange post for more details.

To be even more careful about generating repeatable results we should flush the level 1 data cache before enabling the performance counters. We will do this by reading a memory buffer larger than per-core L1 data cache size

size_t buffer_size = 32 * 1024 + 1024;
char *buff = malloc(buffer_size);
for (int i = 0; i < buffer_size; i++) {
 buff[i] = rand();
}

We will also lock the process onto a single processor by using the sched_setaffinity function. Our example is

cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(7, &set);
int aff = sched_setaffinity(0, sizeof(cpu_set_t), &set);

We perform the each of the above experiments 5 times. First, we turn on random cache line base address generation. On average, we have around 1010665367 L1 data cache read misses wtih standard deviation to be 61010967 misses. When random access is disabled, we have on average 964420324 read misses with standard deviation of 65787193 misses. We can also measure the number L1 data cache write misses by using the PERF_COUNT_HW_CACHE_OP_WRITE config instead. Use PERF_COUNT_HW_CACHE_OP_PREFETCH gives us prefetch misses, in our case, both of these metrics are unavailable. We can check the /arch/x86/events/intel/core.c in lxr and we can see these metrics are not available.

We can also use the PERF_COUNT_HW_CACHE_DTLB config option for data TLB measurement. For read access we have on average 3390719 misses with std dev being 17579, while write access has 1486451 misses with std dev being 13455. The prefetch metrics for TLB are unavailable in our case. To find out more about available metrics supported, please check the constant static __initconst const u64 skl_hw_cache_event_ids for specific kernel version.

With random cache line access turned off, we have 517335 read misses data TLB with standard deviation of 3820 misses. For write we have on average 809671 misses with standard deviation being 9580 misses. It is a significant reduction compared to the random access implementation.

To calculate the L1 cache miss rate and data TLB miss rate, we can use 100.0 * cache misses / cache_accesses and 100.0 * tlb misses / cache_accesses to calculate the results. With random access turned off, we get L1 read access miss rate to be $$miss_{cache} = 1.5%$$ and TLB read miss rate $$miss_{tlb} \approx 0$$. When random access is turned on, we have $$miss_{cache} = 1.4%$$ and $$miss_{tlb} \approx 0$$. We can see the miss rate in all scenarios is really low. This is mainly because the inner most loop in our routine is performing operations on working set already presented in L1 cache and TLB. The read/write operations use continous cache lines, which means there will almost be no faults while we access the 512 cache lines. If one fault causes the entire new working set to be cached, then there would be no subsequent faults until the entire working set is iterated.

If we use getrusage we can see the metrics listed below:

Metrics	Mean	std dev
utime	868629	126044
stime	253586	20112
maxrss	1049691	43
minflt	262214	1
majflt	0	0
inblock	0	0
oublock	0	0
nvcsw	0.4	0.54
nivcsw	47	7

mmap

Next we are going to explore the behavior of mmap. Previously, we used malloc for data allocation. Next, we are going to instead use mmap and see what happens. Here we will only use read access for benchmark metrics since it’s available in both L1 and TLB metrics.

First, we use the MAP_ANONYMOUS as a flag passed to mmap. This flag means the mapping is not backed by any file; its contents are initialized to zero. The complete call is

mmap(NULL, length, PROT_READ | PROT_WRITE,
 MAP_PRIVATE | MAP_ANONYMOUS, fd_ignore, offset);

For more details, refer to mmap man page for information.

When we turn on the random access and use perf_event_open interface to collect metrics, we see the L1 data cache read misses are 956148031 (std dev 84631843). The TLB data cache read misses are 3370309 (std dev 17792). We see it is not really different to the malloc approach we used before. Doing a simple strace shows malloc calls mmap. The memory that backs malloc() allocations is handled by the kernel in much the same way as the memory that backs private anonymous mappings created with mmap().

Then, we try to use mmap() to create mapping in the virtual address space backed by a file instead of using MAP_ANONYMOUS.

We first test mmap with MAP_PRIVATE. According to the man page, this flags means creating a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.

Note we should call fallocate() for the newly created file, otherwise mmap is gonna throw bur error.

When we measure the L1 data cache miss, it’s around 946128512 (std dev 956148031), nothing special happens. When we use MAP_SHARED flag, the result was similar. The result seems to fluctuates as time passes, but overall they are not much different. After all, it’s just reading from the memory, whether the address is backed by a file or not doesn’t play a big role in affecting the cache miss rate. The L1 data cache misses is shown below:

Flag	PRIVATE	PRIVATE+POPULATE	SHARED	SHARED+POPULATE
Mean	783864673	769314361	842915231	816749524
Std dev	77816766	53913082	54613278	60580595

If we take a look at TLB data cache, the result is

Flag	PRIVATE	PRIVATE+POPULATE	SHARED	SHARED+POPULATE
Mean	3372303	3370740	3381755	3377370
Std dev	9884	13567	17626	11776

Still, there doesn’t seem have any significant fluctuation in the number of misses in data TLB. This pattern also applies to sequential access, except the TLB data cache misses is alot lower in sequentual access.

Now If we instead use getrusage(), we will get something like this

Flag	PRIVATE	PRIVATE+POPULATE	SHARED	SHARED+POPULATE
Usec/std dev	20/0	20/0	20/0	20/0
usec/std dev	801512/ 78346	793452/ 143556	872342/ 124124	671957/ 229314
Ssec/std dev	0/0	0/0	0/0	0/0
ssec/std dev	475977/ 54355	475678/ 134253	445467/ 99345	536041/ 98797
oublock/std dev	0/0	0/0	2997152/ 82256	2097152/ 19760

The most interesting part here is when MAP_SHARED is enabled, the oublock immediately changes. As we mentioned previously, oublock specifies the number of times the file system had to write to the disk on behalf of processes. Because the address is now backed by a file, all write operations will cause the file system to write the contents back to the file.

mmap() creates a new mapping in the virtual address space of the calling process. However, it doesn’t allocate RAM. If we call memset() then followed by msync() with MS_SYNC flag, we can get some interesting results in getrusage, these observations are summarized here:

kernel space time is much higher. It usually take 1 sec (no std dev) as opposed to 0. Synchronizing to files on disk will require more kernel participation.
minflt (the number of page faults which were serviced without requiring any I/O) was muich higher, the value is around 540782(std dev 3). More memory mapped means the faults by I/O will be less likely.
oublock is much higher, the value is around 4196512(std dev 1). The sync operating means there will be approximatly double amount of writes to disk.
nvcsw was higher, there are more voluntary context switches. Writing results to disk has delay, and thus the process likely need to context switch while waiting for I/O to be finished.

We may notice the number data TLB misses is lower than the total number of page the application uses. One obvious answer the use of huge page. One huge page can cover many small pages. Also, because we have prefetching TLB and the working set access pattern is contiguous, TLB hit rate will be high. Because we have a set-associative TLB cache, and we access the memory in a fairly deterministic way, it’s easy to predict where the next access is pointing to. For example, if the replacement policy is FIFO, then each cache line will remain untouched for exact same clock cycle before replaced. This also applies to other policies. One way to determine the replacement algorithm is using P-Chase.

strace

We then use strace to trace syscalls of our application. The output contains some interesting information, one is

access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
...
arch_prctl(ARCH_SET_FS, 0x7fdc6ad83540) = 0

According to arch_prctl man page, arch_prctl() sets architecture-specific process or thread state. The ARCH_SET_FS option sets the 64-bit base for the FS register to addr, in our case it’s 0x7fdc6ad83540. Let’s set a break point at arch_prctl and backtrace from there

#0 0x00007ffff7febb55 in ?? () from /lib64/ld-linux-x86-64.so.2
#1 0x00007ffff7fd104c in ?? () from /lib64/ld-linux-x86-64.so.2
#2 0x00007ffff7fd0108 in ?? () from /lib64/ld-linux-x86-64.so.2
#3 0x0000000000000001 in ?? ()
#4 0x00007fffffffe2fa in ?? ()
#5 0x0000000000000000 in ?? ()

We can see the FS segment base is set by the ld-linux, which is a part of glibc, during the program loading. A simple google seach tells us /lib64/ld-linux-x86-64.so.2 is a dynamic linker. A more detailed description can be found on this post and lwn.net. During the startup, the loader initalizes TLS. The includes memory allocation and setting FS base value to point to the TLS beignning, which is done via the arch_prctl syscall. More can be found here. This init_tls() is called here, which subsequently calls the actuall syscall in tls.h.

The /etc/ld.so.preload has similarities to LD_PRELOAD, in addition, it doesn’t suffer security limitation posed by LD_PRELOAD (explanation here). This a feature of glibc.

Competing for Memory

Next we are going to fork another process that will compete for memory with our process under test. We will use this code snippet which is going to be executed by both the parent and the child process

int compete_for_memory(void* unused) {
 long mem_size = get_mem_size();
 int page_sz = sysconf(_SC_PAGE_SIZE);
 printf("Total memsize is %3.2f GBs\n",
 (double)mem_size/(1024*1024*1024));
 fflush(stdout);
 char* p = mmap(NULL, mem_size, PROT_READ | PROT_WRITE,
 MAP_NORESERVE|MAP_PRIVATE|MAP_ANONYMOUS, -1, (off_t) 0);
 if (p == MAP_FAILED)
 perror("Failed anon MMAP competition");

 int i = 0;
 while(1) {
 volatile char *a;
 long r = simplerand() % (mem_size/page_sz);
 char c;
 if( i >= mem_size/page_sz ) {
 i = 0;
 }
 // One read and write per page
 //a = p + i * page_sz; // sequential access
 a = p + r * page_sz;
 c += *a;
 if((i%8) == 0) {
 *a = 1;
 }
 i++;
 }
 return 0;
}

The get_mem_size() is implemented using this portable code

#if defined(_WIN32)
#include <Windows.h>

#elif defined(__unix__) || defined(__unix) || defined(unix) || (defined(__APPLE__) && defined(__MACH__))
#include <unistd.h>
#include <sys/types.h>
#include <sys/param.h>
#if defined(BSD)
#include <sys/sysctl.h>
#endif

#else
#error "Unable to define getMemorySize( ) for an unknown OS."
#endif

/**
 * Returns the size of physical memory (RAM) in bytes.
 */
size_t getMemorySize( )
{
#if defined(_WIN32) && (defined(__CYGWIN__) || defined(__CYGWIN32__))
 /* Cygwin under Windows. ------------------------------------ */
 /* New 64-bit MEMORYSTATUSEX isn't available. Use old 32.bit */
 MEMORYSTATUS status;
 status.dwLength = sizeof(status);
 GlobalMemoryStatus( &status );
 return (size_t)status.dwTotalPhys;

#elif defined(_WIN32)
 /* Windows. ------------------------------------------------- */
 /* Use new 64-bit MEMORYSTATUSEX, not old 32-bit MEMORYSTATUS */
 MEMORYSTATUSEX status;
 status.dwLength = sizeof(status);
 GlobalMemoryStatusEx( &status );
 return (size_t)status.ullTotalPhys;

#elif defined(__unix__) || defined(__unix) || defined(unix) || (defined(__APPLE__) && defined(__MACH__))
 /* UNIX variants. ------------------------------------------- */
 /* Prefer sysctl() over sysconf() except sysctl() HW_REALMEM and HW_PHYSMEM */

#if defined(CTL_HW) && (defined(HW_MEMSIZE) || defined(HW_PHYSMEM64))
 int mib[2];
 mib[0] = CTL_HW;
#if defined(HW_MEMSIZE)
 mib[1] = HW_MEMSIZE; /* OSX. --------------------- */
#elif defined(HW_PHYSMEM64)
 mib[1] = HW_PHYSMEM64; /* NetBSD, OpenBSD. --------- */
#endif
 int64_t size = 0; /* 64-bit */
 size_t len = sizeof( size );
 if ( sysctl( mib, 2, &size, &len, NULL, 0 ) == 0 )
 return (size_t)size;
 return 0L; /* Failed? */

#elif defined(_SC_AIX_REALMEM)
 /* AIX. ----------------------------------------------------- */
 return (size_t)sysconf( _SC_AIX_REALMEM ) * (size_t)1024L;

#elif defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
 /* FreeBSD, Linux, OpenBSD, and Solaris. -------------------- */
 return (size_t)sysconf( _SC_PHYS_PAGES ) *
 (size_t)sysconf( _SC_PAGESIZE );

#elif defined(_SC_PHYS_PAGES) && defined(_SC_PAGE_SIZE)
 /* Legacy. -------------------------------------------------- */
 return (size_t)sysconf( _SC_PHYS_PAGES ) *
 (size_t)sysconf( _SC_PAGE_SIZE );

#elif defined(CTL_HW) && (defined(HW_PHYSMEM) || defined(HW_REALMEM))
 /* DragonFly BSD, FreeBSD, NetBSD, OpenBSD, and OSX. -------- */
 int mib[2];
 mib[0] = CTL_HW;
#if defined(HW_REALMEM)
 mib[1] = HW_REALMEM; /* FreeBSD. ----------------- */
#elif defined(HW_PYSMEM)
 mib[1] = HW_PHYSMEM; /* Others. ------------------ */
#endif
 unsigned int size = 0; /* 32-bit */
 size_t len = sizeof( size );
 if ( sysctl( mib, 2, &size, &len, NULL, 0 ) == 0 )
 return (size_t)size;
 return 0L; /* Failed? */
#endif /* sysctl and sysconf variants */

#else
 return 0L; /* Unknown OS. */
#endif
}

The important line is

 return (size_t)sysconf( _SC_PHYS_PAGES ) *
 (size_t)sysconf( _SC_PAGESIZE );

One thing to notice in the routine for competing for memory is we used fflush after the printf. The purpose of fflush(stream) is to make the operating system flush any buffers to the underlying file. This mainly because stdout is buffered. The buffer is not flushed until newline. fflush will cause this process to happen with the absense of newline. stderr is unbuffered and thus fflush would not be necessary.

For this experiment, we tested it on a VM. The reason is because the contending process will take all RAM and completely hault the mahcine if tested on the host. To ensure our VM has enough swap space, we follow this tutorial to create 4GB of swap area (we allocated 2GB RAM for VM).

One thing we observe is that the execution time of the program become significantly longer to run. In our experiement we need to limit the number of iterations from 1 « 20 to 1 « 8 to get some sensible results without running for days.

When we use PRIVATE and ANONYMUS option and random access turned on, the misses in data TLB is 335009(std dev 7298). We can’t get access to L1 cache data because it will cause the session to be automatically logged out whenever L1D is used. here are some interesting things to notice:

MAP_PRIVATE + MAP_ANONYMOUS: TLB misses:335009(std dev 17298)
minflt: 4220(std dev 231)
oublock: 8(std dev 4)
nivcsw: 19(10)
MAP_SHARED: TLB misses:251284std dev 103292)
minflt: 2784(std dev 231)
majflt: 247(std 65)
oublock: 18200(std dev 2987)
nivcsw: 8(7)

The most important difference here is that the oublock is much easier to trigger because the constant swapping. When file backed memory is used we also notice that majflt is much higher. Because pages are constantly traveling between swap area and memory, the page fault rate becomes a lot higher. The oublock also follows previous patterns as the file backed memory requires filesystem involvement.

Finally, we also modify the kernel’s head (or more precisely its LRU page replacement algorithm). Look in mm/vmscan.c there’s a function calleed shrink_page_list. In it, you will see a switch statement with a PAGEREF_ACTIVATE case, which is the case where the kernel sees the page has been recently accessed. In this case the kernel gotos activate_locked, but you will change it to to do the same thing as the PAGEREF_RECLAIM case. We can simply move the case down and change its default behavior to direct to the PAGEREF_RECLAIM case. After that, we need to recompile the kernel for VM. We also summarize the most interesting results:

MAP_PRIVATE + MAP_ANONYMOUS: TLB misses:308031(std dev 17298)
minflt: 4223(std dev 791)
oublock: 8(std dev 1)
nivcsw: 11(5)
MAP_SHARED: TLB misses: 251284std dev 103292)
minflt: 2724(std dev 231)
majflt: 0(std 0)
oublock: 18200(std dev 2987)
nivcsw: 8(7)

We can see that the most of the pattern follow the previous result after the modified kernel is installed. One main difference is majflt value is reduced back down.

Memory Resource Management in VMware ESX Server

Mon, 21 Sep 2020 00:00:00 +0000

VMWare ESX Server is a software layer designed to multiplex hardware resources among virtual machines running unmodified commodity operating systems. ESX Server, different to VMware Workstation, is a type 1 hypervisor, which means it runs directly on bare metal. ESX Server focuses on running guest VMs without modifying the guest OSes at all, which is challenging.

Memory Virtualization is done by interposing an extra abstraction layer between a physical address from the VM’s point of view, and a machine address which represents the actual hardware memory. ESX Server maintains a pmap data structure for each VM to translate PPMs to MPNs. A separate shadow page table, consistent with the physical-to-machine mappings, is used to map virtual-to-machine page mappings. This avoids additional overheads as the hardware TLB will cache direct virtual-to-machine address translations read from the shadow page table.

Key features

Ballooning is a technique used by the server to achieve memory reclamation. As its name suggests, the hypervisor inflates the balloon by instructing the balloon driver module to allocate pinned physical pages and deflates it by instructing it to deallocate previously-allocated pages. The idea behind this technique is that the hypervisor is unaware of the specific usage patterns of policies of its guests, therefore the making page replacement decisions is best done in the guest VM. When the hypervisor over commits memory, it needs some way to claim memories from the VMs. By consuming some of the memory that the guest OS believes is physically present in the virtual machine. The guest OS will then swap memory to disk reducing the load on the host’s physical memory. The host will them reallocate that memory to other VMs. A details description of ballooning can be found in this post.

Page Coloring can be used to reduce cache misses or partition resources. But it might complicates memory management, especially with the presence of huge pages. Because coloring enforces ownership, thus might result in distinct L2 cache entries.

Sharing memory is achieved by comparing the content of each page, since modifying guest operating system internals is not possible. Because comparing each page would be $O(n^2)$, hashing is used to identify pages to make the progress more efficiently. By letting VMs share pages based the contents, the host can potentially save spaces dramatically. For example, the presence of zero pages is a great opportunity for page sharing by mapping one zero page to multiple VMs. Hint is hash hit, but it doesn’t guarantee the content of the page doesn’t change at that moment.

Idle Memory presents a problem in pure proportional-share algorithms because they do not incorporate any information about active memory usage. More specifically, the memory demand might change dynamically. ESX Server collects idle memory tax from VMs to mitigate this issue. A client is charged more for an idle page than the active one. The cost of idle memory is inflated by tax rate. The metrics of idles pages in guests is collected by hypervisor without guests’ involvement. The idle page information in virtual page table inside VMs is periodically sampled on random bases.

Questions

a. What is the overhead of ballooning? Triggering memory management in the VM by “tricking” it into thinking the the memory resource is scarce/plentiful may have unexpected behaviors.
b. Do content-based sharing pose security vulnerabilities?
c. Remapping hot I/O pages to low memory can be a bottleneck if the page number is high. How does modern hypervisor solution cope with this issue?

Xen and the Art of Virtualization

Wed, 16 Sep 2020 00:00:00 +0000

Xen is an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, without sacrificing either performance or functionality. Xen is type I hypervisor, which directly runs on top of bare metal. We will summarize what Xen is what its attributes are.

paravirtualization - presents a virtual machine abstraction that is similar but not identical to the underlying hardware.

The Virtual Machine Interface

Memory is hard to virtualize mostly because x86 doesn’t support software-managed TLB. A tagged TLB entry allows both guest OS and hypervisor to coexist because it can be associated with an address-space identifier. This is not possible on x86, thus address space changing likely requires flushing the TLB. Thus, to achieve better performance, guest OSes are responsible to managing hardware page tables. Batching can be used by the guest OS to reduce constantly requesting new pages from the hypervisor when new processes are created.

CPU virtualization has implications for guest OSes. Principally, OS the most privileged entity on top of hardware. A hypervisor in the middle means the guests OSes must be modified to run a lower privilege level. On x86, this is not a problem since OSes executes in ring 0 while applications execute in ring 3, leaving ring 1 and ring 2 unused. Privileged instructions executed by the guest has to go through the check of hypervisor in general. For performance reasons, system call exceptions can be handled directly by the CPU. As for paging faults, this needs to go through the hypervisor because only code in ring 0 can result the faulting address from CR2.

Device I/O is implemented by transfer data between guest and Xen using shared-memory async buffer-descriptor rings. Event delivery is achieved by hypervisor sending notification to its guest asynchronously. When and whether to hold off these callbacks is at the discretion of the guest.

Essentially, the virtualization interface design is based on a number of factors. The hypervisor acts as a security guard that validates the guest’s request which would go directly to hardware normally if running in ring 0. The bottom line is the hypervisor shouldn’t be involved unless the there are hardware limitations, or when resource validation or management are required. The goal is to separate policy from mechanism wherever possible. This similar to exokernel in that the hypervisor merely provides basic functionalities without understanding higher level issues.

Questions

a. Why does x86 make it hard to support efficient virtualization?
b. How does Xen exists in 64MB section at the top of every address space avoid TLB flushes when entering and leaving the hypervisor?

Start Linux Kernel Hacking

Mon, 14 Sep 2020 00:00:00 +0000

This is a summary of how to compile and boot the Linux kernel on the KVM-qemu virtual machine. It covers how to get a VM running in KVM, how to build a customized kernel, and how to use GDB with the Linux kernel. The experiment is conducted on an amd64 architecture CPU. We use Ubuntu as our testing environment but the steps covered here should apply to other distros as well.

Getting the VM running in KVM

The Ubuntu ISO image is downloaded from the Canonical website. The kernel is downloaded directly from kernel.org. The specs of our test environment is:

CPU: Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz
RAM: 32 GB
Host and Guest OS: Ubuntu 20.04.1 LTS
Host Kernel Version: 5.4.0-47-generic
GCC: 7.5.0
QEMU emulator version: 4.2.0
Guest Kernel Version: 5.8.6

After we obtained the Ubuntu ISO image, we use GUI virt-manager to install the OS. One thing to notice here is the default directory for virtual disks is /var/lib/libvirt/images, since my system partition is located on a separate SSD with limited space, the virtual disk directory is changed to my /home directory instead.

We also create the new virtual disk inside virt-manager. We chose raw format instead of qcow2. Creating a new image file can also be done in command line using:

qemu-img create -f raw -o preallocation=full vmdisk.img 40G

The preallocation can be turn either on or off depends on personal choices. After the disk image is created, we proceeds in virt-manager to install Ubuntu on the newly allocated virtual disk. We enabled storage for this virtual machine so that we don’t need to repeat the installation process every time we launch the VM. One thing to be noticed here is we don’t need swap area inside a virtual machine. We can simply use the whole virtual disk for / partition.

To start the VM from cmd, you might need to change the owner of the disk image. We add the user to both kvm and libvirt. The image created or accessed by virt-manager seems to change the file owner to libvirt-qemu, which may cause problems when starting from cmd.

After the installation is finished, we can simply launch the virtual machine inside virt-manager through its GUI interface. We can also use command line to start the VM:

kvm -accel kvm -m 8G -smp 6 --snapshot -drive format=raw,file=/home/ed/virtimg/ubuntu20.04

The argument -accel kvm enables Kernel-based Virtual Machine full virtualization, which uses hardware acceleration. Without this option the VM will become extremely slow. The -m 8G assigns the given amount of memory to the VM. The -smp 6 assigns the given number of cores to the guest if the host has multiple cores. The --snapshot ensures that no changes are made to your image during an execution so you can do something dangerous and have the original image file preserved. The -drive option specifies the location of the virtual disk and its format. We will use some of these options later.

To confirm the VM has internet access, simply execution apt install pkg-name in the guest terminal. No error message would indicates properly functioning network access from the guest VM. For example, when we execute sudo apt install llvm it shows:

Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
 llvm-runtime
The following NEW packages will be installed:
 llvm llvm-runtime
0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 6,796 B of archives.
After this operation, 128 kB of additional disk space will be used.
Do you want to continue? [Y/n]

Building the Kernel

We can use out customized kernel for our newly created VM. After we obtain the Linux kernel from kernel.org, we extract the source into <kernel dir> and create a separate build directory <kbuild> (outside <kernel dir>).

Then we enter the <kbuild> directory, run

yes "" | make -C /home/ed/Desktop/linux_kernel/kbuild O=$(pwd) config

This will create a .config file inside <kbuild> with the default options selected. We then open the configuration file and ensures CONFIG_SATA_AHCI=y, which builds the SATA disk driver into the kernel. That will allow your kernel to boot off a (virtual) SATA drive without having to load a module to do it.

Next we build the kernel by running make in <kbuild>. We use the -j 6 option speedup the building process using multiple processor cores. This process can take a long time.

Build and Install Kernel Modules

To build modules locally on host, we create another separate <install_mod_dir> directory for building kernel modules. Then in <kbuild>, execute

make INSTALL_MOD_PATH=/home/ed/Desktop/linux_kernel/install_mod_dir modules_install

Now there is a lib directory inside /home/ed/Desktop/linux_kernel/install_mod_dir, which holds all the kernel modules we are about to install.

The complete list of modules can be listed using cat modules.builtin inside lib/moduels/5.8.6. Here is a link to all the modules being built. We didn’t modify anything in the configuration.

Then we use guestmount to mount the virtual disk to a mount point on the host

guestmount -a /home/ed/virtimg/ubuntu20.04 -i ~/vm/linux/

In Ubuntu this step yields the following message:

libguestfs: error: /usr/bin/supermin exited with error status 1.
To see full error messages you may need to enable debugging.
Do:
 export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1
and run the command again. For further information, read:
 http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs
You can also run 'libguestfs-test-tool' and post the *complete* output
into a bug report or message to the libguestfs mailing list.

The underlying problem is that the kernel cannot be read and according to the post and the bug report on Ubuntu Launchpad.

To fix the issue, we need to run

sudo chmod +r /boot/vmlinuz-*

We can verify the contents inside ~/vm/linux by simply cd into it.

To install the modules we just built, we can copy the <install_mod_dir>lib/modules into the mounted filesystem <mount_point>/lib/modules.

Finally, we unmount the filesystem by doing

fusermount -u /mnt/hdd1/vm/linux

Booting KVM with the new Kernel

To boot up the VM with the new kernel, we will add a few extra command line options to kvm. For convenience, we put the scripts into a file. It’s also available on gist:

#!/bin/bash

kvm \
 -s \
 -display gtk \
 -cpu host \
 -vga qxl \
 -accel kvm \
 -kernel "/home/ed/Desktop/linux_kernel/kbuild/arch/x86/boot/bzImage" \
 -append "root=/dev/sda1 console=ttyS0,115200n8 nokaslr" \
 -drive format=raw,file=/home/ed/virtimg/ubuntu20.04 \
 -m 8G \
 -smp 6 \
 --snapshot \
 -S

Aside from the command line arguments we discussed before, there are a few new members here. the -s switch is a shorthand for -gdb tcp::1234. The -display gtk is optional. It enables the opengl context in the display device for gtk display output. -cpu host says the guest should emulate the host processor. -vga qxl enables 3D acceleration on the guest system. -vga virtio also offers good performance in our case. -kernel allows bootloader to pickup the new kernel. The -append along with its arguments specifies where the root partition of the hard disk is and the console parameter adds a serial console at boot so you can see boot messages. The --snapshot in QEMU says the images that refer to an original image will use Redirect-on-Write to avoid changing the original image. The -S means the kernel won’t start executing unless we attach a debugger to it. We only use it later in the debugging stage.

Again, we can verify there is internet access using the new kernel using apt update. There are no errors shown, which indicates the network is functioning correctly.

Booting Process

Now we are able to boot up the VM successfully, we can first measure how much time the kernel spends in booting. Running dmesg -d shows the timestamp and time delta spent between messages. The final line shows [10.842998]. If we use systemd-analyze, it outputs

Startup finished in 795ms (kernel) + 5.451s (userspace) = 6.247s
graphical.target reached after 5.439s in userspace

The reason why there is a gap between these two measurement is because dmesg is not a reliable test of how long a boot-up process goes. dmesg itself merely collects information. The drivers and other system processes can output messages at any point in time. There may or may not be processes spawning between those messages.

Next, we are going to look at how PCI device is involved in kernel startup. lspci outputs the follow

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Red Hat, Inc. Virtio GPU (rev 01)
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)

We can use the PCI address here to search for corresponding information in dmesg. For example, if we use the domain value $0000:$ as query, we get something like:

[ 0.295026] PCI host bridge to bus 0000:00
[ 0.299055] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
[ 0.300133] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
[ 0.301163] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180
[ 0.311006] pci 0000:00:02.0: [1af4:1050] type 00 class 0x030000
[ 0.319650] pci 0000:00:03.0: [8086:100e] type 00 class 0x020000

The full result is also available as gist.

The lspci command specifies the type of device right after the address. For example, the first one is host bridge. We specifically selected the message in the type 00 class format here. The significance here is that the class value actually telss us the type of the corresponding device. We can check the include/linux/pci_ids.h for each macro respectively. For example,

#define PCI_CLASS_NETWORK_ETHERNET 0x0200

this line shows the value 0x0200 corresponds to a network PCI device. This aligns with our dmesg output as well as the lspci result.

Debugging Kernel

To build KVM+GDB-friendly kernel, we need to have proper CONFIG_DEBUG* options set in the .config file. More specifically, we need to have the following options enabled:

CONFIG_DEBUG_INFO y: compile the kernel with debug info. The full list of definitions can be found here.
CONFIG_DEBUG_INFO_DWARF4 y: generate dwarf4 debug info. Definition can be found here.
CONFIG_GDB_SCRIPTS y: creates the required links to GDB helper scripts in the build directory. Full definition can be found here.
CONFIG_GDB_INFO_REDUCED n: disable reduced gdb info.
CONFIG_KGDB y: kernel debugging location. Full list of definitions found here.
CONFIG_FRAME_POINTER y: compile the kernel with frame pointers. Full list of definitions found here.
CONFIG_SATA_AHCI y: this option enables support for AHCI Serial ATA. Definition found here.
CONFIG_KVM_GUEST y: this option enables various optimizations for running under the KVM hypervisor. Definition found here.
CONFIG_RANDOMIZE_BASE n: drop support for Kernel Address Space Layout Randomization (KASLR). Definition found here. We also added nokaslr in our qemu arguments.
CONFIG_SMP y: enable Symmetric multi-processing support. Definition found here.

Now we can recompile the kernel and attack gdb to it. We simply add -S option to kvm to only start the VM when gdb is attached. Then we enter our <kbuild> directory and execute:

gdb vmlinux
(gdb) target remote:1234

The step is also documented in the kernel community documentation.

Set Breakpoints

Spin lock is easy to find in a kernel. Therefore, we will set break points on spin_lock. For kernel 5.8.6, we see that spin_lock is defined in https://elixir.bootlin.com/linux/v5.8.6/source/include/linux/spinlock.h#L351 as a inline function. If we trace the function, we can see the actual function we should use is _raw_spin_lock defined here:

#ifndef CONFIG_INLINE_SPIN_LOCK
void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
 __raw_spin_lock(lock);
}

If we need to break the execution only when a given program is executed, we can use the program PID to as the condition. The problem is, how do we get the program PID if it doesn’t last for long?

We could instead first set a breakpoint on fork. We can break its kernel call at _do_fork which is defined here. After that, we can simply continue executing the kernel until we run the program.

Note: we need to compile the program and open a new terminal first. Since they both involves forking new processes, which will hit _do_fork before our program runs.

Then we print the process PID using p $lx_current().pid. We then use this value as the condition for b _raw_spin_lock if $lx_current().pid == pid_value inside gdb.

If we want _raw_spin_lock to break under different contexts, we can simply use PID as different contexts. We can also set break points in functions in different contexts that calls spin_lock and see what they do. For example, we can set break point at expand_downwards defined in here, if we back trace this function, we will get a series of calls, we mention the important ones here

#1 0xffffffff81284c4e in expand_stack
#3 0xffffffff813843db in load_elf_binary
#8 do_execve
#12 0xffffffff81b1f658 in do_syscall_64

We also added a helper script in .gdbinit to print our the name of the function, which is ‘‘anacron’’ in this case. In short, this process execute commands periodically, and it performs a sys call which loads elf binary, thus requiring stack expansion.

Another example is timer interrupt. The get_next_timer_interrupt calls _raw_spin_lock. We select some messages from backtrace:

#1 0xffffffff8113b224 in get_next_timer_interrupt
#2 0xffffffff8114d52e in tick_nohz_next_event
#4 tick_nohz_idle_stop_tick ()
#5 0xffffffff810df567 in cpuidle_idle_call ()

In short, the is a timer interrupt that gets called when CPU is idle.

The last example is hrtimer_interrupt. The selected messages are:

#4 0xffffffff8114d80c in tick_sched_timer
#7 0xffffffff8113c8e7 in hrtimer_interrupt
#12 run_on_irqstack_cond
#14 0xffffffff81c00cc2 in asm_sysvec_apic_timer_interrupt

In summary, hrtimer_interrupt is called as event handler. This function is responsible to select all timers that have expired and either move them to the expiration list (if they may be processed in softIRQ context) or call the handler function directly.

Syscall

Essentially, processor switches from the user mode to kernel mode and starts execution of the sys call entry - entry_SYSCALL_64, we can find its definition at here. This is the only entry point used for 64-bit system calls. We can set a break point here. When the break point is hit, we use info registers in gdb to get the value of cr3. In our case, it is 0x22a6d5806. Then we simply step from this breakpoint, and will likely reach SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp. After this call the value in cr3 is changed to 0x22a6d4006. The macro is defined here.

We can see whenever the processor switch from the user mode to kernel mode the value of cr3 is changed. The root cause the Page Table Isolation (PTI). It is a countermeasure against attacks on the shared user/kernel address space such as the ‘‘Meltdown’’ approach. To mitigate this class of attacks, two independent page table copies are created, one in kernel space, one in user space. The cr3 register enables the processor to translate linear addresses into physical addresses by locating the page directory and page tables for the current task. So whenever the process enters kernel mode, the kernel copy requires its page directory address to be loaded into cr3 register.

If we add nopti in -append in the QEMU cmd argument and perform the same steps. We get 0x231466005 before and after SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp is executed. Based on the description in the linux kernel tree, the nopti on X86_64 is equivalent to pti=off, therefore explaining the constant value of cr3.

Performance Anomaly of 802.11b

Sun, 13 Sep 2020 00:00:00 +0000

This research is conducted by Martin Heusse, Franck Rousseau, Cilles Berger-Sabbatel, Andrzej Duda on analyzing the performance of the IEEE 802.11b wireless local area networks. Degraded transmitting rate is caused by CSMA/CA channel access method.

Overview

The performance of the IEEE 802.11b wireless local area networks have degraded performances when some mobile hosts use a lower bit rate than the others, which is caused by CSMA/CA channel access method. When one host changes it modulation type which degrades bit rate, it occupies the channel for a longer time, causing other hosts still using higher bit rate to be penalized. The paper Performance Anamoly of 802.11b analyzes how such anomaly works.

Transmission Overhead

Consider there is only a single host in a 802.11b cell transmitting a single data frame. The overall transmission time is expressed as:

$$T = t_{tr} + t_{ov}$$

where the constant overhead

$$t_{ov} = DIFS + t_{pr} + SIFS + t_{pr} + t_{ack}$$

The transmission process can be represented by the graph

When there are multiple hosts attempting to transmit, a host will execute the exponential backoff algorithm - it waits for a random interval to avoid saturating the channel, resulting in extra time spent in the contention procedure:

$$T = t_{tr} + t_{ov} + t_{cont}(N)$$

Finally, the useful throughput obtained by a host depends on the number of hosts:

$$p(n) = t_{tr} / T(N)$$

This indicates the useful throughput is smaller than the nominal bit rate and largely depends on the number of competing hosts.

Anomaly

Assume there are $N$ hosts, $N-1$ hosts use high transmission rate $R=11$Mb/s, one hosts transmits at rate $r=5.5$, $2$, or $1$ Mb/s. We can deduce the transmission time of the fast ones:

$$T_f = t_{ov}^{R} + \frac{s_d}{R} + t_{cont}$$

The transmission time of the slow host is:

$$T_s = t_{ov}^{R} + \frac{s_d}{r} + t_{cont}$$

The short term behavior of CSMA/CA is shown to be not fair, thus we have

$$U_f = \frac{T_f}{(N-1)T_f + T_s + P_c(N)\times t_{jam} \times N}$$

$t_{jam}$ is the average time spent in collisions, calculated between the all possible pairs between the fast hosts and the slow one:

$$t_{jam} = \frac{2}{N}T_s + (1 - \frac{2}{N})T_f$$

The throughput at he MAC layer of each fast hosts is:

$$X_f = U_f \times p_f(N) \times R$$

given that:

$$p_f(N) = \frac{s_d}{RT_f}$$

We apply the same process for the slow host, given $p_s(N) = \frac{s_d}{rT_s}$, what we get eventually is:

$$X_f=X_s = X$$

This key point here is that the fast hosts transmitting at the higher rate R obtain the same throughput as the slow host transmitting at the lower rate.

Simulation and Measurement Results

In general, the experimental value of $P_c(N)$ seems to match the theory model. One thing the paper could illustrates better is to show how experimental value matches the equation as the number of hosts increases. The average and cumulative throughput value also seems reasonable compared to the expression discussed before.

The throughput is measured using three different tools: netperf, tcpperf, and udpperf. This idea of duplication makes the data collected more reliable and persuasive, which is especially useful in benchmarking since the results can be sensitive to environmental variable changes.

The presented results justify the statement made in the paper. For example, the measured TCP throughput for two hosts is shown to degrade as time passes:

One thing the paper can articulate more is how this seemly periodic pattern is related to the model. Another concern is the number of device used to conduct these experiments. The number of devices used seems to be much smaller than what would be in real-world scenario. It will be interesting to see how the performances are affected with a lot devices competing for a channel. This can be further extended to measuring performances with multiple devices having lower bit rate, which is more likely to capture real-world use cases. The potential performance impact is not clear given the present measurement.

The paper also claims the useful throughput strongly depends on the number of competing host. More data related to how the number of hosts is related to performance impact will make this paper more interesting. It may be hard to achieve as many papers resort to simulation.

This paper has made improvements over previous work in that it studies the performance of 802.11 WLANs, with one host having lower bit rate, whereas many other assume that all hosts communicate using the same bit rate. This is a step forward to capture more realistic situations. Overall, the paper does a good job in terms of proving its point. It captures the most critical information and it’s easy to follow the concept. However, the neat structure can make readers without sufficient background to spend more time catching up since the background section may not be enough for starters.

Conclusion

Overall, this paper brings novel approach to analyze the performance of 802.11 WLANs with varying bit rate. It brings new insights into studying the 802.11 standard. The paper focuses on TCP and UDP protocols. Applying the method discussed in paper to a lesser known protocol such as DCTCP can yield more insights into the different protocols can affect the throughput. Another direction is to generalize this model to multiple bit rate degrading and study their behaviors.

The bit rate used in the paper also seems to be pretty low compared to modern standards. With the introduction of 5G network, the bit rate becomes a lot higher, it will be interesting to see how extremely high bit rate can affect the performance of 802.11.

Exokernel

Tue, 01 Sep 2020 00:00:00 +0000

Exokernel is a term every system researcher has heard of at some point in life. However, according to the PDOS group at MIT, there aren’t any exokernel-based operating systems in active use today. It’s interesting to discover what ideas exokernels brought to the OS high-level design and some potential drawbacks of such design choice.

Perhaps the most important thing to keep in mind is that exokernel operating system architecture pushes management of physical resources to the application level, contrary to what most monolithic kernel would do: providing hardware resource management through some form of abstraction, usually hiding hardware-related details.

Limitations of Traditional Approaches

Monolithic kernels usually enforce centralized resource management via a set of abstractions. In microkernel-based system, they are usually provided through some form of trusted user-level servers. There are several drawbacks:

Too general. Over generalizing can limit application diversity and have performance implications (domain/application-specific approach usually have performance improvements, in the cost of, well, being more “specific”.). For example, in UNIX, two applications exhibiting rather different memory access patterns are subject to the general-purpose OS scheduler and page replacement policy. Letting applications define such policies can open doors for performance improvements since applications have better knowledge of their behaviors.
Hide information. This is further expanded from the previous point. Applications tend to have better “self-awareness” and can implement custom policies that outclass the general-purpose ones provided by the kernel.
Limited functionality. Having limited resources in hand can inhibit implementation of new ideas.

However, generalization may not be a bad thing. As discussed in the UNIX the Timesharing System paper, having a generalized and unified yet limited file system API can simplify programming efforts. Accessing both ordinary files and I/O devices is achieved by utilizing a unified interface. Nobody today wants to implement a different set of policies just for character device or block device.

Design

Essentially, exokernel consists of thin veneet that multiplexes and exports physical resources through a set of primitives. The libraries, running in the application space, use them to implement with special-purpose functionalities in a higher abstraction level. The architecture is shown in the Paper:

There are three majors tasks to separate protection from management:

Tracking ownership of resources.
Ensure protection by guarding resource usage.
Revoke access.

The paper presents three techniques to achieve these goals:

secure binding: lib OS can securely bind to machine resources.
visible revocation: lib OS can participate in a resource revocation protocol. (Keep in mind why the revocation needs to be visible)
abort protocol: exokernel itself can break secure binding of uncooperative lib OS.

In general, exokernel should expose hardware resources such as disk memory, CPU, interrupts, through low-level primitives with as few abstractions as possible. The resource management policy should be enforced by the library OS instead. The policy control boils down to whether the exokernel permits resource allocation.

Secure Binding

One of the primary tasks of an exokernel is to multiplex resources securely, providing protection for mutually distrustful applications. Secure binding allows the kernel to protect resources without understanding them.

There are three techniques to implement secure bindings: hardware mechanisms,software caching, and downloading application code.

Understanding Secure Binding through Examples

Secure binding is rather abstract and hard-to-comprehend concept without concrete examples. Here are some examples illustrating how secure multiplying is achieved through secure binding.

Take memory allocation for an example. When a library OS tries to allocate a physical memory page, the exokernel creates a secure binding for that page by recording the owner and the capabilities specified by the library OS. Essentially, accessing memory resources is achieved through capability. The exokernel acts as a door-keeper that checks the validity of the capability from the library OS.

I personally like to think the role of the exokernel in memory system is to act as a security guard that protects resources that can be access by the library OS through some form of interface. For example, if the hardware defines a page-table interface, which can be accessed by the lib OS, the exokernel must guard the page table. If the lib OS tries to enter a new virtual-to-physical memory mapping, then the exokernel must check the corresponding memory capability.

In summary, privileged machine operations must be guarded by the exokernel.

Aegis: the exokernel

Up to this point I find it still hard to full understand what exokernel is capable of. Having a concrete system to study for is much more helpful. So here comes Aegis.

Here is a subset of Aegis’s primitives and sys call interfaces that encapsulate these exported primitives. Having a concrete list feels so much better than reading a list of abstract terms!

Here is a sublist of primitives:

Primitive Operations	Description
TLBwr	Insert mapping into TLB
TLBvadelete	Delete virtual address from TLB

And here is a sublist of system call interfaces:

System Call	Description
Yield	Yield processor to named process
Alloc	Allocation of resources
Scall	Synchronous protected control transfer

Address Translation

It’s important to first mention that Aegis provides a small number of guaranteed mappings by partitioning an application’s virtual address space into two segments. The first segments hold normal application data; the other one has guaranteed mapping and holds exception code and page-table. (Guaranteed mapping is sort of a safe lock.)

When a TLB miss happens, there are several steps happening:

Aegis checks which segment the virtual address resides in. If it’s in the standard user segment the exception is dispatched to the application. Otherwise, the exokernel handles the exception or forwards it to the application depends on whether there’s guaranteed mapping.
The application looks up the address in it page table, inserts TLB entry and creates capability, then invokes Aegis system routine.
Aegis validifies the capability. Upon approval, the mapping is installed.
Application resumes execution from kernel mode.

The key takeaway here is the exokernel itself is involved in very few privileged operations such as interacting directly with the hardware via low-level primitives. the bulk of the work is done in the application level.

Because the kernel contains minimal functionalities, it can be extremely fast compared to a monolithic kernel. However, does that mean the overhead is shifted to the library OS instead?

ExOS: the Library OS

The most prominent feature about library OS is that it manages operating system abstractions at application level.

The GEMM operation on both ExOS and Ultrix (a monolithic kernel OS) doesn’t seem to have much difference since GEMM doesn’t use any special abilities of both OSes. It does indicates that the performance gain from the minimal design of exokernel is somewhat cancelled out by the application-space overhead.

The exokernel paper mentions that in the context of networking, the major reason for ExOS to download code is that the network buffers on our machines cannot be easily mapped into application space in a secure way. Downloading the code into the kernel allows applications integrating operations such as checksum during the copy of the message from these buffers to user space. However, I’m a little bit skeptical of this statement today. Usually a highly performant TCP stack will be implemented in userspace, along with some polling (DPDK for example). But it will be interesting to compare the exokernel approach to the gigantic Linux TCP stack. The second reason is downloaded code is bounded, thus allowed full context switch to an unscheduled application.

I do find the graph in the exokernel paper interesting. It shows that when application-level message handlers are downloaded into the kernel, the roundtrip latency is almost not affected by the number of processes. Since the operation is performed inside kernel upon message arrival, no handling is needed from the application. This has the advantage that application handler is subject to scheduling, which has performance implications. (The choice of scheduler is the key bottleneck here.)

Modularity

It a natural property of exokernel since the exokernel itself is simplistic. Thus, operating system abstractions can be redefined simply by changing the library OS. Thus, applications have finer-grained control over resources. However, I think it comes at a cost. In a monolithic kernel, applications are subject to general purpose scheduler. Having modular domain-specific schedulers can indeed improve performances, however, it might also leads to multiple scheduler contention, which is not covered in the paper.

Conclusion

Exokernel does offer some new insights into system design. The simple design concept of the exokernel itself has major performance benefits as well as a limited set of primitives which gives much freedom to the application. However, that means the library OS has to take more responsibility. The paper didn’t cover enough analytics on more general use cases. The performance gain seems to come from some highly specialized, exokernel-specific implementations of OS abstractions (such as IPC, VM, etc.). The more general case, such as GEMM, seem to be much less performance, when compared to traditional approaches. It will be good to see how exokernel performs under more diverse workloads.

I’ve also heard that one reason microkernels never took off was partially due to the performance slowdown compared to monolithic kernels. Since exokernel shared many similarities with microkernels (seems like exokernel is a more stripped-down version of microkernel since it barely has an OS core), it will likely fall into the same caveat. However, there doesn’t seems to have a comprehensive benchmarking trials to compare all major types of kernels.

Sketch on the UNIX Timesharing System

Thu, 27 Aug 2020 00:00:00 +0000

Unix is general-purpose, multi-user, interactive operating system, it offers several new features hardly found in other larger operating systems back in the day. These features include (1) a hierarchical file system incorporating demountable volumes; (2) compatible file, device, and inter-process I/O; (3) the ability to initiate asynchronous processes; (4) system command language selectable on a per-user basis; and (5) over 100 subsystems including a dozen languages.

Simplicity at its Core

Simplicity was engraved into the gene of Unix since its birth, as the paper states: “Perhaps the most important achievement of UNIX is to demonstrate that a powerful operating system for interactive use need not be expensive either in equipment or in human effort”. Therefore, it is important to keep in mind how simplicity is reflected in the design of Unix.

The File System

Perhaps the singly most important part of Unix. The “everything is file” concept that influences all modern system designs. Here is a short description of each major file types.

Ordinary Files: no particular structuring is expected by the system. The structure of files is controlled by the programs which use them, not by the system.
Directories provide the mapping between the names of files and the files themselves, inducing a structure on the file system. The only difference between directory and normal file is the the directory can’t be written on by unprivileged programs, meaning the contents of directories are controlled by the system.

linking allows the same non-directory file to appear in several directories under possibly different names; a directory entry for a file is sometimes called a link. All links to a file have equal rights. A directory entry for a file consists merely of its name and a pointer to the file metadata. Therefore a file exists independently of any directory entry. Directory can be considered as link.

Special Files: perhaps the most prominent feature of the “everything is a file” principle. They are read and written just like ordinary disk files, but requests to read and write will result in activation of the I/O device. It blurs the line between file and device I/O since they share identical interfaces and are subject to the same protection mechanism.

Removable File System

The Unix file system has a mount system request which, in effect, replaces a leaf of the hierarchy tree (the ordinary file) by a whole new subtree (the hierarchy stored on the removable volume). It provides a unified abstraction of the file system hierarchy where the underlying storage components become transparent to the user.

One exception to the identical treatment of files on different devices: no link may exist between one file sys hierarchy and another. Otherwise, some form of bookkeeping would be required to when a removable volume is dismounted from one file system but not the other.

Protection

Each user is assigned a unique user ID. A file, upon its creation, is marked with the user ID of its owner. Also given for new files is a set of seven protection bits. Six of these specify independently read, write, and execute permission for the owner of the file and for all other users. This is a perfect example of ACL (access control list) system.

I/O Calls

Once again, we see how Unix is trying to provide a unified interface such that performing I/O on different devices doesn’t would not require different accessing patterns or styles. There is no distinction between “random” and sequential I/O, nor is any logical record size imposed by the system. Calls like open, seed, read, and write can be found in all major Unix-like systems today.

I found it interesting that the authors were arguing why there are no user-visible locks in the file system. The first argument says: “they are unnecessary because we are not faced with large, single-file data bases maintained by independent processes”. It might be different today on modern systems so I have some doubts on that argument. The next one is “they are insufficient because locks in the ordinary sense, whereby one user is prevented from writing on a file which another user is reading, cannot prevent confusion when, for example, both users are editing a file with an editor which makes a copy of the file being edited.” This certainly is true because the the copies are separate files with distinct metadata during editing but once the editing is finished then it becomes tricky when the updated content needs to be written back to the original file without some form of synchronization or ordering.

The paper further explains the the system has sufficient internal interlocks to prevent these situations from happening. The exact details of how it works is not quite clear at this stage.

Implementation

As we’ve already known, a directory entry contains only a name for the associated file and a pointer to the file itself. This pointer is an integer called the i-number. When the file is accessed, its i-number is used as an index into a system table (the i-list) stored in a known part of the device on which the directory resides.

Directory entry -> (File Name, i-number) -> i-list -> i-node -> description of the file

Because the file is described by its corresponding i-node, any copy and deleting operations are circulating around modifying directory entry or i-node link-count field without actually touching the bulk of the file itself.

It important to distinguish between file descriptor and inode. By definition, files are represented by inodes. The inode of a file is a structure kept by the filesystem which holds information about a file, like its type, owner, permissions, inode links count and so on. Other other hand, the file descriptor is the value returned by an open call is termed a file descriptor and is essentially an index into an array of open files kept by the kernel. There is an inode in the i-list but every process can have its own file descriptor for one file.

Processes

A process is the execution of an image. An image is a computer execution environment. It includes a core image, general register values, status of open files, current directory, and the like. An image is the current state of a pseudo computer. You can imagine the image as a motionless snapshot of current state of the processor, or you can image as the content saved to the main memory when a currently executing process is preemptied by another one.

The user-core part of an image has three logical segments. The program text segments starting from location 0. At the first 8K byte boundary above the text segment is a non-shared, writable data segment. The highest address in the virtual address space is a stack segment.

One key feature of UNIX is a new process can come into existence only by ise of the fork system call. Another system primitive is invoked by execute. This call resembles a “jump” machine instruction rather than a sub-routine call.

Shell

Shell is a command line interpreter. Programs executed by the Shell start off with two open files which have file descriptors 0 and 1, representing files for reading and writing. The symbol “<” and “>” represent what files the file descriptor 0 and 1 will refer to for the duration of the command passed to shell.

A filter, represented by “|”, is a program that copies its standard input to its standard output (without processing).

Command separator, represented by “;”, is used to separate multiple commands. A related feature is “&”, which execute the command in the background. When the shell doesn’t wait for the completion of a command, the identification of the process running that command is printed. In addition, parentheses can be used to enforce order of execution.

It’s worth noting the shell is itself a command, and may be called recursively.
Since it’s a command, it also shared the luxury of having standard I/O file descriptor. Thus, command such as:
sh < file_containing_shell_commands would work.

The last step in the initialization of UNIX is the creation of a single process and the invocation of a program called init. init have various sub-instances prompting for user login information. If the login succeeds, init performs an execute of the Shell. Essentially, init is the parent process of Shell.

Monads in Haskell

Sun, 01 Mar 2020 00:00:00 +0000

I’ve scratched my head for quite a while trying to understand the concept of monad in Haskell. This is a brief summary of monads. I take William Cook’s Anatomy of Programming Languages as my reference.

Definitions of Monads

A monad is defined as a computational structure that involves three parts:

A generic data type $m$
A return function $return_m$ :: $t\rightarrow mt$
A bind function $\triangleright_mt\rightarrow (t\rightarrow ms)\rightarrow ms$

Here the symbol $m$ gives the name of the monad as well as the shape of the computation. We can call the program that uses the monad $m$ as an m-computation. The instantiation of the generic type $mt$ at a particular type $t$ represents n m-computation that produces a value of type $t$. The $m$-computation indicates that in addition to value $t$, some additional requirements or effects will take place. This is the essence of monads.

The definition of the return function states that how values are converted into m-computations. The return will just return the value of type $t$. For example, if we pass in a stateful memory information, return shouldn’t modify the actual but only provide a context to which the value lies in. The reason we convert value into m-computation is that if any error occur then return will catch the error without adding additional error checking codes.

The bind function $\triangleright_m$ specifies how computations are combined together. THe general idea is that the computation behavior of $A\triangleright_m F$ indicates the m-computation $A$ is performed first, the value it produces wil be passed to the function $F$ to create a second m-computation. Because $A$ is a m-computation, if an error happens, the computation will stop and $F$ will not be performed.

Monads in Haskell

In Haskell, we can use Monads using type class. A type class is defined as:

class Monad m where
 (>>=) :: m t -> (t -> m s) -> m s
 return :: t -> m t

For a object of generic type $m$ to be a Monad, it must have those two functions defined. A type class allows us to overload functions according to their type.

So why do we need Monads in the first place? If we are given a function $func1$ which takes in an Int value and produces an Int output, we could link the function together to form a chain of computation. If we make a function like this:

func1 :: Int -> (Int, Int) -> (Int, Int)
x & func1 = func1 x

we could use the output of the function as the input to the same function to produce another value. This process can be repeated and thus form a chain of operation:

(0, 0) & func1 1 & func1 2 & func1 3 ...

However, the function $func1$ could potentially return a Nothing if the given input doesn’t meet certain standards (exp. divide by 0). Therefore, $func1$ can modified to:

func1 :: Int -> (Int, Int) -> Maybe (Int, Int)

The previous definition of $func1$ says $func1$ takes a (Int, Int) tuple as one input, but now if we feed the output of $func1$ directly to the next $func1$ in the chain, error would occur because $func1$ takes a raw (Int, Int) tuple as the input, but now we have (Int, Int) wrapped in a Maybe context. The & operator is not able to pass the argument with a context to the next func1. Fortunately, we have the bind operator defined.

If we look at the definition of the >>= in Monad definition, we see:

(>>=) :: m t -> (t -> m s) -> m s

This means >>= is able to take a value within certain context and map a function that takes the raw value as input to the it. We can simply switch the & operator to >>= such that the chaining would still work:

return (0, 0) >>= func1 1 >>= func1 2 >>= func1 3 ...

If an error occurred in one part of the chain (let’s assume one computation yields Nothing). Then the Nothing value will be propagated to the next function, which will automatically generate an error, or Nothing. Otherwise we would have written error checking code at the end of each single computation to check their output.

In short, >>= is just a way to chain functions with parametric polymorphism together.

Haskell do Notation

Using the do notation can simply the use of bind operator. The basic pattern of do notation is:

do
 x <- e1
 e2

which is equivalent to:

e1 >>= (\lambda x.e2)

The <- notation simply indicates $x$ is bind to the value the computation generates. In other words, $x$ doesn’t lie in a context. if $e1$ returns Nothing, $x$ is not bind to anything. It’s important to remember that do expressions are just different syntax for chaining monadic values.

For a more detailed explanation of Monads, I found A Fistful of Monads to be extremely helpful in terms of clarifying the concept.

Singular Value Decomposition

Mon, 10 Feb 2020 00:00:00 +0000

Unitary matrices and the Singular Value Decomposition (SVD) are two important concepts in linear algebra. In order to fully understand these concepts, we will need to first discuss orthogonality. Most materials are converted in Advanced Linear Algebra: Foundations to Frontiers taught by professor Robert van de Geijn. This is a brief summary over the important concepts covered in Chapter 2.

Components in the direction of a vector

By Pythagorean theorem, we know that $b = \chi a + c$ where $a$ is a unit vector orthogonal to $c$ and $\chi$ is a scaler. Then we have

\[a^T (b-\chi a) = 0\]

Solving it gives us $\chi = \frac{a^T b}{a^T a}$. We have $\frac{a^T b}{a^T a}a = \frac{a a^T}{a^T a}b$. And $\frac{a a^T}{a^T a}$ can map vector $b$ in the direction of $a$. The orthogonal component of $a$ can thus be calculated as $I-\frac{a a^T}{a^T a}$.

The linear transformation can be simplified by letting $\left\lVert a\right\rVert_{2}=1$ because this will render $a^T a = 1$.

Unitary Matrix

A matrix $U$ is said to unitary matrix is if $U$ is a square matrix and satisfies $U^H U= I$.

In addition, unitary matrix has some nice properties. First, the product of a sequence of unitary matrix is also unitary matrix. This can be proven by first explore the product of $(U_0 U_1)^H (U_0 U_1)= I$, showing $U_0 U_1$ is a unitary matrix, and then perform induction.

Unitary matrix also preserves length. This is done by showing $\left\lVert Ux \right\rVert 2^2 = (Ux)^H (Ux) = x^H x= \left\lVert x \right\rVert _2^2$.

Change of orthonormal basis

We mentioned we can map a vector $x$ another vector in the same direction as vector $a$. Now we extend it to express a vector $x$ using a set of orthonormal basis $U$.

We know that $x = Ix= UU^Tx=U(U^Tx)=u_0^Hxu_0+…+u_{m-1}^Hxu_{m-1}$. We notice that $u_0^Hx$ is a scalar so we can write then equation as $U(U^Tx)=a_0u_0+…+a_{m-1}u_{m-1}$. We successfully expressed the vector $x$ based on the orthonormal basis.

TODO

Understanding Probabilistic Clock Synchronization

Tue, 17 Sep 2019 00:00:00 +0000

This post is meant to discuss the probabilistic clock synchronization technique. The main goal of this technique is to bound the difference between systems by setting up an upper bound. Formally, we define the problem as $|P(t)-Q(t)|\leq \varepsilon$, or the difference between clocks across the network. We will go over the technical detains and discuss what these symbols represent in later sections. Most of these materials are from Prof. Mok’s slides on dependable systems classes.

Perfect Synchronization

The motivation behind this technique is that synchronization always involves overheads. In a perfect environment where network delay and request processing time are both 0, the clocks can be synchronized with ease. A slave P will send ‘‘Time = ?’’ at global time $t$ to master Q and master Q replies ‘‘Time = Q(t)’’ instantaneously at global time $t$. Then P will adjust its clock P(t) according to Q(t). However, such case only exists in imagination.

Amortization

Suppose the difference between the clock of P and Q is $\Delta$ at synchronization, our goal is to adjust P’s logical clock C(t) to mitigate the difference. The adjustment is simple:

\[C(t)=H(t)+A(t)\]

Here C(t) is P’s logical clock, H(t) is P’s hardware clock, and A(t) is the adjustment function(can also be A(H(t))).

A naive method will be simply subtract or add $\Delta$ to C(t) to mitigate the difference. However, it will create a discontinuity in P’s clock, which may disrupt systems services. For example, if $\Delta = 2$ seconds,the logical clock will instantly jump ahead 2 seconds and a stopwatch will skip one second.

So the adjustment function is as follows:

\[A(t)=m\cdot H(t)+N\]

Now the logical clock can be derived as follows:

\[C(t)=(1+m)\cdot H(t)+N\]

This process is called amortization.

However, how do we know the value for m and N? Let’s take a look at the time when amortization process starts, the logical time of P at this moment is:

\[L=(1+m)\cdot H+N \qquad (1)\]

At the end of the amortization (lasts for time period $\alpha$) we have reached $M=H+\alpha$. Here M is the master logical clock sent by master Q. So at the end of the amortization, the slave P should be able to catch up with its master’s logical clock after $\alpha$ period of time. Therefore, we have:

\[M+\alpha = (1+m)(H+\alpha)+N \qquad (2)\]

Solving (1) and (2) together, we now get:

\[m = \frac{M-L}{\alpha}\]

\[N = L - (1+m)H\]

Thus, at the end of amortization at time $t$ where $t > H+\alpha$, we would want the following to be true:

\[C(t)=C(H+\alpha)+(H(t)-H(H+\alpha))=H(t)+M-H\]

Here is a question, why is N required in this case. Couldn’t we simply use m to amortize the time difference? Here’s my interpretation(feel free to pin me if you have something else in mind): if N is set to be 0, then at the beginning of amortization, we would have:

\[L=(1+m)H\]

Therefore, $m = \frac{L-H}{H}$ . Now, m is settled by L and H. Compared to $m=\frac{M-L}{\alpha}$ , we can see that now m is a constant and not determined by the value of $\alpha$. We lost control of the amortization rate $m$, which is not desirable.

General Case

We now return to the general case where network delay and processing time are both present. The situation is represented below:

Looking at this graph, we can see slave P takes 2d real time to for a round-trip. Let’s also assume that 2D is the round-trip delay measured by P’s clock between sending and receiving. Then we can bound the clock time 2D based on the drift rate $\rho$ of the clock:

\[2d(1-\rho)\leq 2D \leq2d(1+\rho)\]

Ignoring higher order terms of $\rho$, we now have $2d\leq(1+\rho)2D$.

When looking at the graph above, one thing to notice is we are not sure of the time $\alpha$ and $\beta$. However, if we are going to pick one, $\beta$ will be more important than $\alpha$. This is because if we know the value of $\beta$, then we know the lower bound of the round-trip delay. Here we assume min is the minimum amount of time required for network transfer, $\beta$ will be the time master Q spends between processing the request and responds the result back to P.

Now we’ve narrowed down our focus to $min+\beta$. The time interval between $Q(t)=T$ and the arrival of ‘‘Time=T’’ at P will be at least $min(1-\rho)$. This is based on $\beta=0$ and clock drift rate.

The upper bound of the interval will be $(min+\beta)(1+\rho)$, assuming no time is wasted for $Q$ to wait until it starts processing the request from P. The time required will be $min+\beta$ and we need to take Q’s drift rate $\rho$ into account. We can also see that the total round-trip real time is $2d=2min+\alpha+\beta$. Thus we get:

\[\beta=2d-2min-\alpha \leq 2d-2min\]

With this equation, we can see that the upper bound measured from Q(t)=T is also bounded. Thus, we have:

\[ \begin{eqnarray} (min+\beta)(1+\rho) &\leq& (min+2d-2min)(1+\rho) \nonumber \newline &=& (2d-min)(1+\rho) \nonumber \newline &=&(1+\rho)2d-min(1+\rho) \nonumber \newline &\leq&(1+\rho)2D(1+\rho)-min(1+\rho) \nonumber \newline &=&(1+2\rho +\rho^2)2D-min(1+\rho) \nonumber \newline &\approx&(1+2\rho )2D-min(1+\rho) \nonumber \end{eqnarray} \]

Now we can see that master Q’s clock time when P receives the response is bounded in the interval $[T+min(1-\rho), T+2D(1+2\rho )-min(1+\rho)]$. The take away here is that we can’t use real time t in a distributed system because it’s merely an abstract concept since all systems in a network essentially rely on their own clock time. We need to find the relationship between T and master’s clock cycle because P will rely on T, not real time $t$.

How to Put Papers on ArXiv

Tue, 25 Jun 2019 00:00:00 +0000

Recently, I was trying to put my research paper draft on ArXiv. I thought it would be as simple as submitting the pdf file, which should take approximately less than ten minutes. I was wrong. It took several hours to figure what was going on. I included some tips here to prevent mistakes I made from happening again.

The first mistake I made was assuming a single submission of pdf file would be sufficient. ArXiv apparently has mechanisms detecting whether the submitted pdf file is generated using Tex/Latex. According to ArXiv:

a PDF file created from a TeX/LaTeX file will be rejected. There are good reasons why arXiv insists on TeX/LaTeX source if it is available. arXiv produces PDF automatically from all TeX submitted source. For information on viewing the PDF provided by arXiv, see our PDF browsing help.

So, the first thing I came up with was to somehow make the pdf appearing “anonymous” to ArXiv. The were several methods but none of them appear to be practical. If you are interested there is a link to some methods that might be useful. pdfprivacy is package used to remove or suppress pdf meta-data and it sounds promising but I haven’t tried yet.

So the only option left was to follow the restriction described. It was confusing at the beginning because the everything worked like a charm on Overleaf but it completely fell apart when I tried to compile the sources locally. I was under the impression that if it worked on Overleaf i should work everywhere else, which cause many hours of searching for potential problems related to local environment.

After hours of frustration, it started to appear that there was nothing wrong with my local environment. The pdf produced by Overleaf was only “appearing” correct. There were several syntax issue in my .bib file, mostly caused by careless copy-and-paste and duplicate records. Overleaf simple suppressed some of those errors, which leads me to think everything was fine.

There were also error messages popping up during compilation. Most of them are related to undefined references. Something like:

Warning--empty journal in article

The problem was that bibliographic information obtained from Google Scholar night include serious mistakes. The warning message was telling that entries of type @article require a non-empty journal field. For example, the entry could look like:

@article{article,
 title={Something Cool},
 author={Somebody},
 year={2019},
 publisher={IET}
}

The four required fields for entries of type @article are author, title, journal, and year. This is why the warning message showed up. But it doesn’t really affect the compilation on ArXiv.

When I finally compiled all sources locally with success, I immediately moved all source on ArXiv hoping it would finally work. It didn’t.

! LaTeX Error: File 'shoc.pdf' not found.

I had no idea why this occurred. All sources I used to compile were uploaded to ArXiv so there were no reasons for it to fail. More surprisingly, the references only failed for my .eps files but not .png files. According to ArXiv there are several reasons why PostScript (PS/EPS) figures might fail on ArXiv. Due to the error message, it appears the system is trying to find a file called shoc.pdf to insert into the main pdf but somehow couldn’t locate the file.

The solution was to upload the pdf files produced locally to ArXiv. However, the locally generated files have slightly different name. All files names are modified to “name-eps-convert-to.pdf”. What a hassle!

Overall, uploading to ArXiv was not the most pleasant experience. Latex’s compilation system is the one to blame.

A Little Review on Barrelfish Memory Managements

Mon, 18 Feb 2019 00:00:00 +0000

The memory management has been mentioned numerous times and still remains huge topic. virtual vs. physical memory, physical frame allocation, MMUs, page faults, address space layout, and demand paging and swapping are familiar terms for every undergrad in college. In monolithic kernels such as Linux, much of the functionality is handled in kernel. However, there are OSes, such as Barrelfish, that takes a different approach by pushing these functionalities to user space. Many concept here will thus be borrowed from the Barrelfish OS. I also borrow some materials from the main pdf from Barrelfish course materials provided by Professor Simon Peter.

Memory Management in General

Microkernels like L4, Mach, Chorus, and Spring, trapped page faults in the kernel but then reflected them up to other processes which carried out the actual page fault handling. This was done on a per-region basis, so each area of virtual memory was associated with some paging server. Memory objects could be shared between different processes and mapped differently in different address spaces.

Such abstraction means that what happens when a page fault happens is entirely dependent on the code in the user-level pager. This design is highly extensible since it’s all user code and thus isolated, which means that if a user-level pager crashes, there’s a good chance the rest of the OS can continue quite happily since much of the functionality is moved away from the kernel.

However, moving functionality out of the kernel an important question: if user-space processes can manipulate virtual address spaces, how can we make sure that one user’s program can’t manipulate another address space and memory? Here we will introduce the concept of capabilities.

Capabilities

Capabilities are introduced to solve the access control problem in operating systems. Access control is the problem of specifying, and enforcing, which subjects (or principals) can perform particular actions on particular objects in an operating system.

The Barrelfish documentation does a good job illustrating capabilities: abstractly, access control can be thought of as a matrix, which represents all possible combinations of operations in the system. Each row of the matrix represents a different subject, and each column represents a different object. Each entry in the matrix contains a list of permissible actions.

Thus, we have two targets to emphasis: the subject and the object. The ACL(access control list) focuses on the object being operated on.

A good example will be whenever you enter ls -a in a Linux terminal, you will get list of entries specifies the attributes of a file. Here the attributes represent how a object (in this case, a file) may be accessed.

On the other hand, a capability can be thought of as a “key” or “license”. It is an unforgettable token which grants authority. Possession of a capability for an object gives the holder the right to perform certain operations on the object.

A good example will be the file descriptor in Linux. A file is accessed through its file descriptor. Here the file descriptor serves as the “key” to gain access to the file itself. Capabilities provide fine-grained access control: it is easy to provide access to specific subjects, and it is easy to delegate permissions to others in a controlled manner.

Note that to be correct, any capability representation must protect capabilities against forgery. Capabilities can be implemented in various ways such as tagged capabilities, sparse capabilities, or partitioned capabilities. In Barrelfish we used the partitioned capabilities.

In partitioned capabilities, the kernel ensures that memory used to store capabilities is always separated from that used by user processes to store data and code, for example by using the MMU or ensuring that capability memory is only accessible in kernel mode. The OS maintains the list of capabilities each user principal holds (the clist), and explicitly validates access when performing any privileged operation. Thus, whenever the user accesses memory, the operation can only be done through the resources’ corresponding capability. For example, one can map a page frame in the page table page through functions calls with only capabilities.

Caprefa->install(Caprefb, slot, flags)

Capabilities in Barrelfish

According to Barrelfish documentation, all memory in Barrelfish (and some other system resources which do not occupy memory) is described using capabilities. Capabilities are typed, and capabilities can be retyped by users holding them according to certain rules and restrictions. The official documentation has very good explanation on the capability management in Barrelfish. Here is the permissible types for the retype invocation capability retyping:

Image source

Capabilities referring to memory regions. Capabilities can also be split, resulting 2 new capabilities of the same type, one for each half of the region. Some of the more important capability types in Barrelfish are shown in figure below. The picture is from the Barrelfish manual provided in CS378 Multi-core class by Simon Peter:

Allocation and management of physical memory is achieved by retyping and splitting operations on capabilities. For most kernels, the implementation is to constantly allocate and deallocate memory for a wide variety of purposes, much as any large C program relies heavily on malloc and free.

The problem is what the kernel should do when this runs out. The current solution in Linux is little more than “kill a random process and reclaim its memory”, which can be a problem for system stability. In Barrelfish, all kernel objects are actually allocated by user programs. If a user process wants to create another process (or dispatcher in Barrelfish parlance), it has to get a capability to a DRAM area of the right size, retype this capability to type Dispatcher, and hand this to the kernel. This will be covered in later posts. To access different types of memory resources, the corresponding capability has to be retyped to the right type.

More On Implementation

In Barrelfish, every capability resides in a slot in a CNode, so a pair (CNode, slot) would identify a capability. It is important to point out that the CNode is another capability itself. Each process in Barrelfish has a CSpace which is structured as a two-level table. So there are actually two different CNode capability types - one for the first level of the table, and one for the second. Every process has, within its “dispatcher control block”, a pointer to the top-level or root CNode which the kernel can traverse.

A capability reference in Barrelfish is very similar to VA: the first few bits can represent an index into the first level L1CNode, while the next few bits refer to a slot in a CNode referred to by the capability in the L1CNode slot. Here is a picture from the main pdf showing how the the CSpace is represented in Barrelfish:

Thoughts on Design Decisions

Even though it is pretty straight forward to understand the CSpace structure, the actual implementation is a lot more complicated than that. Since the CSpace is not directly accessible by user space program, there are additional data structures used to keep track of available memory resources.

In our implementation, the user process keeps a doubly linked list of struct mmnode to indicate the memory available for allocation. Each element in the free list tracks the information corresponding to one capability. However, there is a big problem with this seemingly simple implementation. Every time we allocate a practical memory space from the memory region, a new capability is created while the old capability still remain in the physical memory pointing to a memory range before the allocation happens. Therefore, the old capability would cover extra memory spaces that are already allocated and managed by other capabilities.

To solve this problem, we maintain the allocation information in the struct mmnode each time an allocation occurs. If a capability covering physical address space from 0 to 100 is requested for 20 units of memory space, then the memory available for the next allocation would be from 20 to 100 even though the capability itself still manages 0 to 100. By restricting subsequent accesses only to the new memory range, the old capability can still be kept around and used later for retyping.

Another Problem emerges when we try to free a memory. Since everything is managed by capabilities, freeing a piece of memory also involves managing the capability responsible for the memory. So an intuitive thought could be whenever a memory space is freed, the corresponding capability is merged back to a piece of memory adjacent to it, managed by a different capability.

However, since capabilities can not be merged, an alternative choice would be to simple destroy it during free. However, this is even a bigger problem in Barrelfish.

Imagine the scenario where capability A is partially allocate from memory space 30 to 100. Later on another memory is freed and that piece of memory is managed by capability with base 100 and size 20, so the memory range covers 100 to 120, which indicates these the two capability could be “merged”.

In this case, if the first capability is destroyed, all children of the first capability will also be destroyed, thus the already allocated memory from 0 to 20 will be thrown away, which is not desired. If the second capability is destroyed, the first one will also be destroyed to create a new capability covering 20 to 120, which will still results in the destruction of capability A.

Our assumption here is that the parent or root capability is never destroyed when added to the free list. Whenever a capability needs to be freed, the memory manager is responsible to make sure the capability is only merged with another capability from the same parent capability.

This is done by creating another list of nodes that tracks all parent capabilities. It is only added when the memory manager adds new capabilities to the free list. After the user initializes free, the memory manager actually creates a new free struct mmnode first, then it find the node’s parent node, copying the parent’s capability and attributes to the newly created node with updated offset to indicate that the memory hasn’t been freed yet.

After that, the memory manager insert the node into the free list. If the memory manager finds out that there are capabilities adjacent to the just-added node, then we simply need to update the attributes of the corresponding mmnode to indicate that merging succeeds. The old mmnode is simply thrown away.

The advantage of this implementation is that root or parent capabilities are kept around and the next retype will be fairly simple. The implementation is also very straightforward.

There is of course more efficient solution than a linked list. For example, Linux uses both linked list and red-black tree to store thread information. The redundant data structures can be used in different scenarios when appropriate. However, we only use this simplified version to prove our concepts. Optimizations vary but the general concept still works pretty well.

Pascal GPU memory and cache hierarchy

Tue, 15 Jan 2019 00:00:00 +0000

Memory access efficiency allows fully utilizing the computational power of graphics processing units (GPUs). However, many GPU vendors like NVIDIA kept the GPU memory hierarchy as a secret. Therefore it becomes hard to measure GPUs performance and sets barriers to understand memory access patterns, which is a key component to improve program’s performance.

We introduce a novel fine-grained micro-benchmark approach and apply to the Pascal generation. Turing architecture might have different results, but the method we used here can be applied as well with slight modification. The method we use in this guide is inspired by the research paper: Dissecting GPU Memory Hierarchy through Microbenchmarking. Here we will explain how P-Chase works and walk through a small example.

Memory Hierarchy Overview

GPU memory hierarchy is different compared to CPU memory hierarchy. Using the terminologies of CUDA, GPU memory space can be categorized in these groups: register, constant memory, shared memory, texture memory, local memory, and global memory. Each different memory space have its own properties. Since we are interested the cache systems, here is a picture demonstrating the memory hierarchy of a NVIDIA GPU:

Image source

The characteristics of each memory space can be found in NVIDIA CUDA C Programming Guide . Here we will focus on some target memory space we are interested in. The paper lists some properties of our target memory space:

Memory	Type	Cached	Scope
Global	R/W	Yes	All Threads
Shared	R/W	N/A	Thread Blocks
Texture	R	Yes	All Threads

Even though the paper targets Fermi, Kepler and Maxwell generations of GPU, the properties of the table still holds for Pascal GPU and possibly Turing as well. The cached global/texture memory uses a two-level caching system. The L1 cache is located in each stream multiprocessor (SM), while the L2 cache is off-chip and shared among all SMs. It is unified for instruction, data and page table access. According to CUDA documentation, like Maxwell, Pascal combines the functionality of the L1 and texture caches into a unified L1/Texture cache which acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler. Page table is used by GPU to map virtual addresses to physical addresses, and is usually stored in the global memory. The page table is cached in TLB to reduce memory access latency. Once a thread cannot ﬁnd the page entry in the TLB, it would access the global memory to search in the page table, which introduced significant memory access latency. The GPU-specific shared memory is located in the SMs. On the Fermi and Kepler devices, it shares memory space with the L1 data cache. On Maxwell and Pascal devices, it has a dedicated space, since the functionality of the L1 and texture caches have been merged. One thing to note here is that shared memory is accessed by the thread blocks. Thread-blocks remain limited to 48 KB of shared memory in Pascal. Therefore, NVIDIA recommends that applications use at most 32 KB of shared memory in any one thread block. This would, for example, allow at least two thread blocks to fit per GP100 SM, or 3 thread blocks per GP104 SM.

However, we should be careful that by default, GP100 caches global loads in the L1/Texture cache. In contrast, GP104 follows Kepler and Maxwell in caching global loads in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler. As with previous architectures, GP104 allows the developer to opt-in to caching all global loads in the unified L1/Texture cache by passing the -Xptxas -dlcm=ca flag to nvcc at compile time. Even though both GP100 and GP104 belongs to Pascal family, we only focus on GP100 here because that’s the GPU we use. Another thing to notice is that unlike Maxwell but similar to Kepler, Pascal caches thread-local memory in the L1 cache. This can mitigate the cost of register spills compared to Maxwell. To illustrate our point, we checked both cudaDevAttrGlobalL1CacheSupported and cudaDevAttrLocalL1CacheSupported on Tesla P100 and GTX 1080 and find both attributes to be 1.

In addition to the L2 data cache, global memory data that is read-only for the entire lifetime of a kernel can be cached in the read-only data cache with a compute capability of 3.5 or above. We will also explore the size of this read-only cache using __ldg() intrinsic.

P-Chase

Most existing GPU microbenchmark studies on cache architecture assume a classical set-associative cache model with the least recently used (LRU) replacement policy, the same as the conventional CPU cache. So here we will use this assumption and proceed with our experiments. Here are some notations we will use throughout this post.

Notation	Description	Notation	Description
C	Cache Size	N	array size
b	cache line size	s	stride size
a	cache associativity	k	iterations
T	number of cache set	r	cache miss rate

Under our assumptions, data is loaded from main memory to lower cache in the basic unit of a cache line. The number of words in a cache line is referred to as the line size (b). For the LRU set-associative cache, the cache memory is divided into T cache sets, each of which consists of $a$ cache lines. It is essential to have these three assumptions using this kind of cache model:

Assumption 1 All cache sets have the same size. The cache parameter should satisfy $T \cdot a \cdot b = C$.
Assumption 2 In the memory address, the bits representing the cache set are immediately followed by the bits representing the offset.
Assumption 3 Cache replacement policy should be LRU.

We will later see why these assumptions are essential as we proceed with the experiment. We won’t go through how P-Chase work exactly. To find more information, this paper does a good job illustrating how P-Chase work. The takeaway is, we need to brute force an array with one element more than a cache can hold so that cache miss will start to occur periodically whereas such array with less or equal elements to the cache capacity will always result in cache hit and thus no access overhead will be introduced after all data is loaded into the cache. This is the algorithm the paper proposed and we will use it to do the experiment:

__global__ void KernelFunction ( . . . ) {
 //declare shared memory space
 __shared__ unsigned int s tvalue [ ] ;
 __shared__ unsigned int s index [ ] ;
 preheat the data ; // implementation varies
 for (it = 0; it < iter ; it++) {
 start_time=clock();
 j = my_array[j];
 //store the array index

 // This following line is essential because due to
 // instruction-level parallelism (ILP), function clock() may
 // overlap with its previous instruction and even return before
 // the previous instruction finishes. For example,
 // end_time=clock() can return before j = my_array[j] returns.
 // adding s_index [it]= j since it have data dependency on the
 // previous line. Thus the memory access will be over before
 // end_time=clock() started.
 s_index [it]= j;
 end_time=clock();
 //store the access latency
 s_tvalue[it]= end_time−start_time ;
 }
}

The steps is the same as the paper proposes, so here we show the paper’s method:

Determine cache size C . We set s to 1. We then initialize N with a small value and increase it gradually until the ﬁrst cache miss appears. C equals the maximum N where all memory accesses are cache hits.
Determine cache line size b. We set s to 1. We begin with N = C + 1 and increase N gradually again. When N < C + b + 1, the numbers of cache misses are close. When N is increased to C + b + 1, there is a sudden increase on the number of cache misses, despite that we only increase N by 1. Accordingly we can ﬁnd b. Based on the memory access patterns, we can also have a general idea on the cache replacement policy.
Determine number of cache sets T . We set s to b. We then start with N = C and increase N at the granularity of b. Every increment causes cache misses of a new cache set. When N > C + (T − 1)b, all cache sets are missed. We can then deduce T from cache miss patterns accordingly.
Determine cache replacement policy. As mentioned before, if the cache replacement policy is LRU, then the memory access process should be periodic and all the cache ways in the cache set are missed. If memory access process is aperiodic, then the replacement policy cannot be LRU. Under this circumstance, we set N = C + b, s = b with a considerable large k (k » N/s) so that we can traverse the array multiple times. All cache misses are from one cache set. Every cache miss is caused by its former cache replacement because we overflow the cache by only one cache line. We have the accessed data indices thus we can reproduce the full memory access process and ﬁnd how the cache lines are updated.

Texture L1 Cache and Read-only Data Cache

When use the code with increased our own data preheat implementation because the texture L1 cache can potentially be greater than the shared memory. The original code uses the first iteration of the loop in the algorithm as a way to preheat data:

const int it = 6144 // texture L1 may hold more elements,
 // So the first iteration may not cold
 // hit all elements, some cold hits can
 // be moved to the second iteration,
 // causing confusion
for (int cnt=0; cnt < it; cnt++) {
 start=clock();
 j=tex1Dfetch(tex_ref, j);
 s_value[cnt] = j;
 end=clock();
 s_tvalue[cnt] = (end -start);
}

However, if texture L1 cache is greater than the shared memory allowed for each thread block, then some reads in the second loop will trigger cache misses. But such misses are in fact cold misses, not misses caused after the texture L1 cache is completely filled up. One solution is increase iteration to a much larger number so that the first iteration will always fill up the texture L1 cache. Note that if you move the data preheat out such as

for (int cnt=0; cnt < it; cnt++) {
 tmp=tex1Dfetch(tex_ref, tmp);
}

The compiler can optimize this whole step out and thus nothing actually gets executed.

After we run the modified code, the result shows that the we the cache missed starts when we set our array size to 6145, indicating the texture L1 cache can hold 6144 ints, which is equivalent to 24 kb. We also notice that each miss is followed by 7 consecutive hits. This means the cache line size is 8 words(b = 32 bytes). The structure of the L1 TLB is shown below, notice there are 192 lines in each set:

Set1	Set2	Set3	Set4
1-8	33-40	65-72	97-104
9-16	41-48	…	…
17-24	46-56	…	…
25-32	57-64	89-96	121-128
129-136
…	…	…	…
2969-2976	3001-3008	3033-3040	3065-3072

According to CUDA documentation, GK110 adds the ability for read-only data in global memory to be loaded through the same cache used by the texture pipeline via a standard pointer without the need to bind a texture beforehand and without the sizing limitations of standard textures. The read-only data cache is loaded by calling __ldg(const restricted * address). We modified the code used to test texture L1 cache. The basic logic remains the same. When the arrays size is set to 6144 integers no cache misses occur with stride set as 32 (s=32 bytes). As soon as we increased one more element in the array cache misses start occurring. This shows the read-only cache is 24kb. We then noticed that the misses occur in a group of either 4 or 8. We infer the cache line to be 32 bytes and the replacement policy is LRU, same as Maxwell. We we increase the array to include 6248 elements(6144+323+8, 6144 is the max capacity of the cache, 32 consecutive number in a set, 323 to cause cache miss in set1, set2, and set3, only need to include 8 more to cause cache miss in set4 since s=32bytes), no caches hits occur. Therefore, we infer the caches set number to be 4, each cache line is 32 bytes, and each set contains 192 cache lines, the same as the texture L1 cache. The memory mapping seems arbitrary because the hit and miss patterns didn’t follow that of the texture L1 cache.

Map Reduce

Sun, 01 Apr 2018 00:00:00 +0000

I was always interested by the name “map reduce” since two years ago when I first heard this term. But I’ve never put any effort to know the concept until Chris mentioned it in class because it will be on the next exam so I figured I’d better figure out what is going on before it was too late. Just kidding:) But map reduce does borrows a lot of characteristics from traditional relational databases even though many useful and important features in RDBMS are eliminated from the map reduce system. You can check this long list of roasts on map reduce here.

But the intention of this post is not about roasting map reduce so if you absolutely resent how map reduce is such a disgrace to RDBMS you are in the wrong place. Essentially, MapReduce is a programming model. Users need to define a map function that processes a key/value pair, producing a set of key/value pairs, then a reduce function will read these intermediate pair, merging pairs with the same intermediate key. It is important to realize the MapReduce is a programming model because it allows the programmers to follow this model without having to worry about the technical details needed to ensure the operations between clusters. In fact, the programming model is very easy to understand. Everything you need is already summarized in the name MapReduce.

Basically, the computation takes a set of pair/key values are input and output a set of pair/key values. The users write the map function which take an input pair and produce a set of intermediate key/value pairs(we will know why the output in intermediate). The MapReduce library takes all intermediate pairs and group the ones with the same key and pass them to the reduce function. The reduce function is also written by the user. It takes an intermediate key with a set of values corresponding to that key, merging those values in hope to form a smaller set of values. What it means is that the reduce function usually produces zero or just one output value. The intermediate values are supplied to reduce function via an iterator. There might be occasions where the memory doesn’t have enough space for all intermediate value and thus some values needs to be pushed to permanent storage.

Networks

Mon, 13 Nov 2017 00:00:00 +0000

The concept of a worldwide of networks of information was introduced long before the technology used to build the internet. The first workable prototype came in the late 1960s with the creation of ARPANET(The Advanced Research Projects Agency Network). The famous TCP/IP, or Transmission Control Protocol and Internet Protocol, was developed by Robert Kahn and Vinton Cerf in the 1970s. In the 1980s, research by Tim Berners-Lee gave birth to the World Wide Web, linking hypertext documents into an information system, making them accessible from any node on the network (History of Internet). The implementation and the evolution of the internet has improved ever since. Today, for most users, the internet feels like smoke and mirrors since requiring everyone to understand the technical implementation will be way too harsh. However, as software developers, they are much more likely to deal with networks sometime in life. This article is meant to unveil technical details of networks mainly from a programmer’s perspective so the focus will be put on the software side.

The OS View of Networks

For the operating system, the network is perceived as an extra device. The Network Interface Controller(NIC), which is a hardware device used to connect the computer to a computer network, is added to the bus.The data can be transferred to/from memory to NIC through two methods: DMA or memory-mapped I/O. DMA refers to direct memory access. The name suggests that the hardware is able to read or write memory without the involvement of CPU. On the other hand, memory-mapped I/O means the CPU can control the hardware to read or write specific memory addresses, which means the CPU is doing the job of writing/reading to/from memory. DMA is usually used for high-bandwidth operations such as disk I/O while memory-mapped I/O is used in low-bandwidth operations like change control bits.

Layers of Network

The OSI (Open Source Interconnection) 7 Layer Model divides network communication into seven layers. Here I will discuss each layer and its corresponding function. Starting from the lowest one:

Layer 1: This layer is the physical layer which is concerned with the transmission of unconstructed raw bit stream over the physical medium. Thus the protocol data unit(PDU) is bit for this layer.

Layer 2: This is the data link layer. Its function is in charge of reliably transfer the data frames between two nodes connected by a physical layer. The PDU of this layer is frame.

Layer 3: This is the network layer. This layer is in charge of structuring and managing a multi-node network. Examples includes addressing, routing, and load control. The PDU for this layer is packet.

Layer 4: This is the transport layer. It is used to deliver messages error-free, in sequence, and without duplications or losses. The PDU of this layer is segment/datagram(segment for TCP, datagram for UDP).

Layer 5: This is the session layer. It allows session establishment between process running on different stations. Layer 5 is often OS/Library. From here, the PDU will generalized to data.

Layer 6: This is the presentation layer. As the name suggests, it formats the data and present it to the application layer. It can be viewed as the translator for the network. It is usually OS/Library.

Layer 7: The final layer is called the application layer. It serves as the window for the users and application processes to access the network.

Note the Department of Defense Four-Layer Model has only four catagories: the Network Access Layer(layer 1-2), the Internet Layer(layer 3), the Host-to-Host Layer(layer 4), and the Process Layer(layer 5-7).

More on Layer 2 Network

There are three types of layer 2 networks: System Area Network(SAN), Local Area Network(LAN), and Wide Area Network(WAN).

File System Design

Mon, 30 Oct 2017 00:00:00 +0000

What exactly is a file system? The general concept is that the file system provides naming organization. It manages the physical disk layout such as picking a block constituting a file, balancing locality with expandability, and managing free space. It can translate from file name and offset to the actual data block. In a nutshell, it is a servant that manages all the dirty details of communicating the data between system and the hardware in an optimal way which you aren’t required to understand so you can go on and do other things with your life. So what are the functionalities of file systems? In general, it provides file name organizations such as directories. It can manage disk layout by picking blocks that constitute a file, balancing locality with expandability, and manage free space. It can translate from file name and offset to the actual data block.

File

Let’s start from and bottom-to-top pattern to describe file system by first introducing the most fundamental unit: the file itself. So a file is composed of two parts: the metadata and the actual data. The metadata is a file header that holds information about where the data is stored and attributes of the file such as a permission, access time, owner id, size, and so on. One thing to note is that meta data blocks are stored on a location that is known by the OS and thus it can be accessed without having to check another data structure. Then the actual data is the part users actually care about. There are two kinds of blocks (there can be more than these two data blocks but we will only discuss two here), The directory data block which maps file names to file headers, and file data block that contains information we care about.

Design File Layout

There are several factors we need to take into consideration when designing file layout:

Support for sequential and random access. Almost every file operation is either sequential or random.
Lay out the files on the physical disk.
Maintain file location information. This makes sense since we need an agent to keep track all files because we users are too lazy to do that.
In Unix most files are small in size so we need to support small files, which means block size can’t be too large due to internal fragmentation.
Most disk space is consumed by large files so we also need to support large files and accessing them should be efficient as well.
I/O operations target both types of files.

Block VS Sector

Before we dig deeper into file system design, it’s important to note the the block size of file system is different from disk blocks size. According to Practical File System Design, block is the smallest unit writable by a disk or ﬁle system. Everything a ﬁle system does is composed of operations done on blocks. A ﬁle system block is always the same size as or larger (in integer multiples) than the disk block size. Also each blocks consists of consecutive sectors so that sequential access becomes possible. A larger block size increases transfer efficiency also because of sequential access since you don’t have to move the head too many times, it may be convenient if the block size matches the machine’s page size, this is because we don’t have to switch pages assuming the block is bigger than the page size. Many systems allows transfer of many sectors between interrupts.

Allocation methods

Contiguous Allocation

OS maintains an ordered list of free disk blocks.
OS allocates a contiguous chunk of free blocks when it creates a file.
Placement/allocation policy can be first-fit, best-fit, or worst-fit.
File header specifies starting block and length.
Pros:
- All file data stored contiguously on disk.
- Simple to implement as bump pointer is a common way of implementation.
- Best performance for initial write of a file due to locality resulted from contiguous allocation.
Cons:
- External fragmentation because some allocation for large files are simply impossible, resulting in wasted unallocated space, and hard to grow file in size.
- Later writes may cause the file to grow which would require it to be copied and moved.

Linked Allocation

Files are stored as a linked list of blocks, in each sector, there’s a pointer pointing to the next sector. (This is a hardware implementation, we still use blocks fot later discussion.)
The file header keeps a pointer to the first and last sector/block allocated to that file.
There are two types of implementations for Linked allocation:
- Linked list of disk blocks, data blocks point to other blocks
- Linked list in a table (FAT file system)
Pros:
- Reduce or eliminate external fragmentation since blocks can fit in if there are free blocks available.
- Easy to grow file just like adding elements into a linked list.
- Linear access is somewhat efficient(It’s linked list, what do you expect? O(1)?).
Cons:
- linear random access time in linked list.

FAT File System (File Allocation Table)

FAT32 file system is very important file system created by Microsoft. It was introduced the solve the volume problem posed by FAT16. Although named FAT32, only 28 of the 32 bits are actually used and the remaining 4 bits are “reserved for future use”. As a result, a FAT32 partition has a maximum cluster count of (268,435,455)2^28-1. I found this description about FAT32 on StackExchange that is useful:

Although VFAT was a clever system, it did not address the limitations of With the appearance of the FAT32 file system, the maximum number of clusters per partition went increased from 65535 to 268,435,455 (228-1). FAT32 thus allows much bigger partitions (up to 8 terabytes). Although the maximum theoretical size of a FAT32 partition is 8 TB, Microsoft has voluntarily limited it to 32 GB on Windows 9x systems to promote NTFS

FAT32 is implemented in a completely different way . Unlike FFS in UNIX, each entry in in the MTF merely represents a block of data. Each block is able to point to another block, with multiple entries in the table to represent a file represented multiple blocks. Each file’s file number is indicated using the index of the first entry in the MTF. Thus, in order to locate a specific block of a file, we need to search the MTF sequentially.

Disk Introduction

Wed, 25 Oct 2017 00:00:00 +0000

This chapter is all about disk. Before we start. We won’t go deep into the mechanical part of disk operation; rather we will be focusing on general concept related to disk and algorithms to improve disk performance.

The Evaluation Criteria

Here we are introducing the basic components used to evaluate the performance of disk operation.

Seek Time

This is the time to position the head over the track. Maximum can be going from innermost track to outer most track. It usually ranges from 10ms to over 20 ms. However, the average seek time is usually to seek 1/3 of the way across the disk.

Head Switch Time

This is time spent to move from one track on one surface to another track on a different surface. The range is similar to seek time.

Rotation Delay

This is the time spend for the sector to spin underneath the head. It varies depends on how fast the disk rotates.

Transfer Time

The time spend to read or write sector as it spins by.

Transfer time: time to move the bytes from disk to memory
Surface transfer time: time to transfer one or more sequential sectors to/from surface after head reads/writes first sector
Host transfer time: time to transfer data between host memory and disk buffer

Disk Head Scheduling (Mainly focusing on HDD)

Now we’ve looked at the basic performance evaluation critiria for HDD, it’s reasonable to discuss how to reduce head movement so that the amount of time spent on moving the head from on track to the other will decrease because the disk I/O request for can be stored in a queue. (Note the seek time takes the most amount of time so it’s reasonable to reduce it.)

FIFO

This technique is easy to understand, the head will move to the corresponding track based the order of the queue of requests. Since the requested data can be read/written on random tracks, the performance can be heavily affected.

SSTF (Shortest Seek Time first)

The queue of requests is reordered such that the head will only look for the closest track it can move to and thus ignore the global state of locations of all requests. This is a greedy algorithm and thus can be trapped in local optimal value.

SCAN/Elevator/LOOK

Simply move the head to one direction until the request that is closest to that end of the disk is reached, then reverse the direction of the head and find the rest of the requests.

Optimization: the head is reset when no more requests exist between the current head position and the approaching edge of the disk (LOOK scheduling)

C-SCAN/C-LOOK (“Circular Scan” scheduling)

Move the head in one direction until an edge of the disk is reached and then reset to the opposite edge. optimization: the head is reset when no more requests exist between the current head position and the approaching edge of the disk (called C-LOOK scheduling).

Note the only difference between SCAN and C-SCAN is that in C-SCAN, after the head reaches one edge, an optimized jump implemented by the hardware is used to directly move the head to the opposite edge instead of reversing the movement direction.)

Partitioning

Disks are partitioned in order to minize the largest seek possible time since each partition is a logically seperate disk. (It’s just merely a collection of cylinders.) More information covering partitioning will be covered in file system.

Other Techniques to Reduce Overhead

To minimize rotational latency and seek time, we can also:

Make disks smaller (less movement distance)
Spin disks faster
Schedule disk operations to minimize head movement(we’ve just discussed)
Lay out data on disk so that related data is on nearby tracks(locality and also less movement)
Place commonly used files on disk
Block size: (note disk is presented with sector address using logical block address converted by the controller)
- Too small: low transfer rate because we need to perform more seeks for same amount of data.
- Too big: internal fragmentation

SSD

The basic advantage of SSD is that it doesn’t have moving parts and thus random access is blazingly fast. It’s implemented using NAND and is non-volatile.

Basic NAND Flash Units

The fundamental unit is a page which is 4KB. 128 pages are organized together forming a block of size 512KB. Each block is the unit forming a plane. There are 1024 blocks on one plane and the size of the plane is 512MB.

Operations

Read page: fast in terms of nano seconds compared to micro seconds for spinning disk.
Write page: can only write to empty page, same as above.
Erase block: (ms) Before a page can be written, all bits in the page need to be set to 1. Note the only way to set bits in a page to 1 is to erase the whole block.
Read and Write occur in page unit.

Virtual Memory Mechanisms

Thu, 19 Oct 2017 00:00:00 +0000

As we can see in the previous post, all allocation algorithms we discussed lead to external fragmentation. As time goes by, external fragmentation is going to get worse and we need solutions for the problem. We can use swap areas to swap out memory onto the disk, or move allocated memory together(a process named memory compaction), leaving empty spaces together. Even these approaches can reduce external fragmentation and allow a higher degree of multiprogramming, they are not perfect. In addition, it is possible to have a single process that is just too big to fit into memory. We are going to discuss methods used to completely eliminate external fragmentation. More than that, we will discuss how to make memory sharing possible and how to allow more processes to execute at once.

Too big to fit

It’s easy for us to assume the amount of available memory that can be allocated is not a big problem. It’s easy for programmers to assume the available memory resource is almost infinite and thus we rarely care about the situation in which the code we wrote is going to occupy all memory resource. But let’s just consider the scenario where we create a program which later create a process that is just too big to fit into memory, what should we do?

The natural response would be: just cut them into pieces! This is a technique called overlay: programmers manually cut the program into pieces, or overlays. When the program executes, a overlay manager is created to swap pieces in and out, allowing only necessary pieces in memory at a given time. But tell me, what is the last time you see an user-level application manually cut into “pieces” by the programmer? Doing things manually is not desired trait of a programmer. Programmers should always be lazy and automate things, or just leave it to someone else!

Paging

I’m pretty sure you don’t like the idea of overlaying as it requires you to do things manually. That’s where paging comes into play. Instead of dividing the program by the programmer, why don’t we let the system do the dirty job? Before we start, I’m going to throw two questions to you: why can a virtual address space be bigger than the physical memory? How are each piece of a process brought into the memory?

The technique to divide a address space into fixed-size pages is called paging. A copy of the address space is stored on the disk. The physical memory is viewed as a series of equal-sized page frames. We will discuss later about how the system choose to load a page into a frame and how to manage pages that are currently in memory.

So how do we use virtual addresses with the our recently introduced pages to find a location in memory? As we can see, a virtual address space is divided into pages, each with a fixed number of entries. In order to represent the number of pages and number of entries, we need two variables:

p - page number(p_MAX pages)
o - page offset (difference between the byte address to search and the start of the page, _MAX indicates the total number of entries in a table)
Virtual Address calculation: o_MAX x p + o (here o is the offset in the last page)

The frame size is equal to that of a page. It’s easy to understand since we need to put everything stored in a page into the frame, we need them both to be equally sized. Note that since virtual address space can be bigger than physical memory, the number of frames can be smaller than the number of pages, which means the number of bits reserved for frame number can be smaller than the number of bits used to indicate the number of pages.

source

From Virtual to Physical: Allocation Policy

We’ve discussed how that a process’s virtual address space can be divided into pages and mapped to frames in physical memory. Here we are going to discuss some policies used to implement the mapping process. I’m going to leave three questions to think here as well: why pages are arbitrarily located in physical memory? How do we find them if they are arbitrarily located in physical memory? Why aren’t all pages mapped to frames? These questions will become more clear as we progress into further discussion.

Here’s the solution: a page table. Each process has one page table that contains all mapping information for each possible virtual address belonged to that process. Even though we call it table, it’s merely a data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses. However, the mapping is invisible to the process. The protection mechanism is the same as dynamic relocation we’ve discussed before.

Virtual Address Translation

Now we are going through a step-by-step description of address how to translate virtual address to physical address.

First, the process will give the CPU a virtual address to translate.
Then MMU will split the address into two parts, the page number and the offset.
Since the size of a page and a frame are the same, the offset of the virtual address is sent along without no modification to the physical memory.
Use page number to find the corresponding entry in the page table.
Check if the page exists in physical memory.
If the page does exist in physical memory, the frame number is sent along. If the requested page is on the disk, then the corresponding page is moved to memory and frame number is recorded now.
Offset is appended to the end of the frame number to get the physical address.

So, we’ve achieved several goals now by using paging technique:

Reduce or even eliminate external fragmentation.
Easy to grow processes.
Allow more process that is too big to fit into memory to be able to execute.
Easy to allocate and deallocate.
Share memory is between processes is easy since memory used by difference processes no longer has to be contiguous. Even pages may exist in different position, they can be mapped to the same physical address in memory.

More about Page Table

One thing to notice is that there’s only one page table for each process. The page table is part of the process’s state and serves as protection mechanism to prevent processes accessing each other’s memory. There’re several elements in each page table entry as well:

Flags: dirty bit, resident bit, and dirty bit (we will talk about them later). Flag is stored at the beginning of each entry.
Frame number: stored in the remaining bits. It tells where the page lives in physical memory.

However, page table still has its disadvantages, the most important thing to notice is that we need two memory accesses to implement virtual memory reference, first access is to get the page table entry, the second access is used to get the actual data from memory, if it’s present. As we know, memory access is extremely slow and expensive, thus we need something faster.

Translation Look-aside Buffer (TLB)

Since it’s hard to improve the speed from the algorithm side, let’s just drop the algorithm for a minute and switch our focus onto the hardware. Here we will discuss how to improve the speed of memory reference by adding a hardware component called TLB. Here are several basic characteristics of TLB:

The TLB holds recently used frame/page pairings.
It has high hit ratios due to locality.
For a TLB hit, translation can be finished in one cycle.

So how does TLB help with efficiency? It’s actually really simple. The system simultaneously sends the page number to both page table and TLB. If there’s TLB hit, then the TLB cache sends the frame number to the memory without having to look into the page table, which avoids the first reference into the memory to find the page table. If there’s missing TLB, everything stays the same: look for the page table in memory and update the TLB.

source

Problems with Page Table

Now we solved the problems of external fragmentation. It seems paging works like a charm and makes things a lot easier. However, we notice it’s still not perfect in terms of space usage:

Data structure overhead (The page table can be huge!)
Inappropriate page size can lead to internal fragmentation, and less processes to exist in memory in the same(page too big)!

Thus we need more complex methods to solve the above issues.

Multi-level Page Tables

The basic concept of multi-level page table is to subdivide page number into n parts(n stands for number levels of pages tables). n is decided by the architecture of the system. Each entry in each page table which is exists each entry points to the corresponding page table in the next level. The last level page table entries hold the bit/frame information associated to the page table entry.

SO how does it work exactly? First, we have only one first-level page table. We extract the first subdivision of the virtual address, added to the PTBR to get the corresponding entry in the first-level page table. Then we extract the second subdivision of the virtual address, add it to the address of the beginning of the second-level page table which we got from the corresponding first-level page table entry. This process continues until we reach the last-level page table. From the corresponding entry we can get the frame number. The offset is preserved so we just need to append the offset to the frame number and we’re done! One reminder is that multi-level page table requires several lookups to eventually find the frame number, so TLB becomes extremely important here in terms of improving performance.

How does multi-level page table save space?

You’re probably still confused why multi-level page table saves space by adding more tables. Don’t worry, I will walk you through an example to illustrate the magic behind the scene:)

Assume a process has a ((2^{10}\) pages, each PTE occupying 4 bytes (32-bit system). Without multilevel page table, we need $2^{20} \times 4 = 4MB$ for one page table stored in memory. Even we just need a portion of all pages, we need the whole page table present in memory to find the corresponding frames. Now, if we divide the virtual address into 3 sections with last one being the offset, we have a two-level page table. The first 10 bits are used to index the page table in the first level and the next 10 bits are used to index the page table in the second level. If we only need virtual addresses that have the second 10 bits modified and leave the first 10 bits untouched, then we only need to find one entry in the first-level page table. Since the first-level page table has to be always present in memory, it will consume $2^{10} \times 4=4KB$ memory space. Now, since we need every entry in a second-level page table pointed by the entry we just found in an entry in the first-level page table, it requires $2^{10} \times 4bytes = 4KB$ memory. So we only need to use 4 + 4 = 8KB for all memory we need instead of 4MB without multi-level page tables.

Another interesting fact is that, even if we need to use all pages of a process, multi-level page table will potentially increase the space needed. Let’s take the above example and assume we need every single pages from a process. Then we need to store the first-level page table, which takes $2^{10} \times 4bytes = 4KB$. Then, for each entry in the first-level page table, there’s a corresponding second-level page table, each with the size of $2^{10} \times 4 = 4KB$. Since the first-level table has $2^{10}$ entries, the total number of second-level page tables is $2^{10}$, each with the size of 4KB, so the total amount of spaces is $2^{10} \times 4kb + 4kb = 4MB + 4KB$. Then bottom line is: if we need to map every pages to its frames, then the total amount of entries in the last level will be the number of pages regardless of how many levels we use since each page has to have a mapping. Under such case, the total amount of memory used by the last-level page tables will be equivalent to the amount used when we use only one huge page table. The additional space comes from the upper levels, but the previous level will only save the corresponding number of entries. (number of table in the next level).

source

Virtual Memory Overview

Sun, 08 Oct 2017 00:00:00 +0000

I love pointers. Pointer is very a useful feature in programming languages like C/C++. I can pass weird hexadecimal numbers to a function and then it will magically locate where the program is in memory. However, all those values we see are merely virtual addresses, a running program’s view of memory in system. Any address we can see while programming user-level programs is a virtual address. It is no more than an illusion of where the data is actually laid out in memory. Only the almighty OS knows where the data actually locates in physical memory.

Three Goals of VM

When the OS provides this layer of abstraction to translate virtual address to physical address, we say that the OS is virtualizing memory. In order to achieve virtualization, there’re three goals to keep in mind.

Transparency: The OS will implement transparency in a way that is invisible to the running program. Usually, transparency would suggest a clear understanding of how things work. However, when we are talking about VM, it means the running program is unaware of the fact its memory is translated by the OS(and hardware) behind the scene.

Efficiency: The OS should do its best to ensure the efficiency of virtualization, in both time and space. Some of the methods used to improved efficiency, including hardware components like translation look-aside buffer(TLB), will be discussed in the following chapter.

Protection: Being able to protect processes from interfering each other or even the OS is important. When one process performs actions like read and write, the process should be isolated so that it’s unable to modify the data of other processes or behave maliciously.

Basic Concept: Address Space

Before we start, there are several terms I will use throughout the discussion so it’s better to get familiar with them.

Physical address space: it is merely a collection of physical addresses used by the hardware. The value of the address can range from 0 to the MAX_sys. The address is actually utilized by the physical memory to fetch the contents inside.

Logical/Virtual address: It a collection of address a process is able to access (the process is not aware of the actual physical location). It can be bigger than physical address due a technique called paging which will be discussed later.

Segment: A chunk of memory assigned to the process to use.

How does an address get generated

Uniprogramming: This is a very simple scenario, there’s only one process executing at the moment, its address always starts at 0 and the OS is loaded in a fixed part of the memory. The process executes in contiguous section of memory. The process is able to use the all the available portions of the memory as long as it doesn’t touch the OS’s property.

Relocation

Same as uniprogramming, the OS locates at the highest location in the memory. The OS allocates a contiguous segment of memory for processes to use. If it doesn’t have enough space, it simply wait the process to terminate. The relocation address (also known as base address) is the first physical address a process can use. Limit address is the largest physical address the process can use.

Also note there’re two types of relocation:

static: The address is produced during load time and the process is load into the given location to execute. The OS can’t move it as soon as the process is loaded in. Note the static address can be changed in both linking and loading stage. Linking stage might involving library routines and loading stage might increment the base address by the amount of memory used by previous process already present in memory. Note the printed value here is the actual physical address, not a virtual address.
dynamic: Physical address is obtained by adding the base register to virtual address. The result has to be less than bound register or an exception will be raised by the processor.

Problem with relocation:

Even though the concept of relocation is easy to understand, it can easily lead to problems with unused memory space. The biggest problem is external fragmentation. When a process finished executing and the memory it occupied is deallocated, it will leave a “hole” behind and become available for other process to use. The problem is, as the size of the emory of the previously running process can be any random number, the “hole” it leaves behind may be too small for other processes to fit in. Even if it is big enough for other processes to fit in, they may eventually lead to smaller fragments that are too small for a process to use, which leads to external fragmentation. There’s also another problem called internal fragmentation, but it won’t be discussed here.

How to minimize external fragmentation?

As we can see, relocation leaves rooms for external fragmentation, which can be a problem since unused spaces will be a waste. Are there any methods we can use to reduce the external fragmentation to minimize external fragmentation and better utilize the free blocks? There’re three policies we will discuss that can be used to achieve our goals. A reminder is that they can’t completely eliminate external fragmentation but merely minimize it to a certain level.

Policies

First-Fit policy

Finding the first free block that can hold the requested amount of memory.

Requirements:

All free blocks are tracked in a list and sorted by address.
Allocation requires a search throughout the list to find the first free block to fit in the memory.
After deallocation, the freed block might need to be merged with other free ones.

Pros:

very easy to implement.
Tend to produces large free block towards the end of the address space.

Cons:

Allocation is slow.
It will eventually lead to fragmentation.

Best-Fit policy

Finding the smallest free block to allocate the request memory.

Goal:

Avoid fragmenting huge free block.
Minimize the size of external fragmentation.

Requirements:

All free blocks are tracked in a list and sorted by address.
Allocation still needs a search in the list to find a suitable block to fit in.
After deallocation, the freed block may need to be merged with other free blocks.

Pros:

Kinda simple to implement
Reduce external fragmentation, works well when allocations size gets smaller.

Cons:

Still leaves room for external fragmentation.
When it is deallocated and merged back, the new free block may require to be resorted.

Worst-Fit policy

Finding the largest free block to allocate request amount of bytes.

Goal:

To avoid having too many small free segments(reduce external fragmentation).

Requirement:

Free blocks sorted by size
Allocation is fast since the first one is always the largest one.
After dealloaction, it needs to check if the new free block needs to be merged back and resort the list.

Pros:

Works great if all allocations are of medium size.

cons:

Still external fragmentation
Deallocation is slow(need to resort and merge)
Tends to break down large free blocks, which can lead to failure to allocate large blocks.

Technique to further reduce and eliminate external fragmentation will be discussed later.

Dynamic Relocation

Advantages

Processes can move during execution.
Processes can grow over time.
It’s easy to provide protection since we only need two registers.
Fast and simple

Disadvantages

Allocation is contiguous.
Sharing memory is hard since there’re no way to set base and bound register to be the same for more than one processes.
Multiprogramming is limited since all active processes have to be loaded into the memory, which creates another problem that physical memory becomes the limit for how many processes can be loaded. (Swapping might help but the number of active processes still needs to be in the memory.)
Need to add memory references every allocation.
Memory management is a mess.
Everyone has the same permission throughout the address space, which can potential create problems.

std::bodun::blog

Four Years into PhD

Stichable Neural Networks

Key Principles

The Choice of Anchors

The Stitching Layer and its Initialization

Where to Stitch

Way to Stitch

Stitching Space

Training Strategy

Blog Archive

Hardware

ML

Security

Datacenter

TensorIR Transformation

Batched BMM ReLu

Perform Transformation

Dive into TensorIR

Implement 2D Convolution

Pathways: Google's New ML System

Single-Controller

Multi-Controller Systems

Going Back to Single-Controller

Deadlock

Solutions to Deadlock

FlexFlow

Problem Inputs

How to search for parallelization strategies

Building Task Graph

Use Simulation to Estimate Execution Overhead

Execution Optimizer

Add Mermaid to Hugo with Dark Mode

Cross Entropy Loss

Maximum Likelihood for Classification

Machine Learning System Resources

Resources

Courses

Labs & Faculties

Tutorials

Seminars

Papers

Training

LLM

KV Cache

Datasets

ML Compilers

Graph Optimization

Inference

Multitenancy

Dynamic Neural Network

Auto Placement

Federated Learning

Switch & ML

Memory Management

System Design

Trade-off

Structured LLM Generation

Async Training

Self-play

Costs

RAG

Megatron with FastMoE

Prerequisites

Docker

Set up FastMoE

Megatron-LM Setup

RACE Dataset

Summury

Set up Slurm across Multiple Machines

Setup Munge

Setup Slurm

Add GPU support

Paper Review - Dynamic Tensor Rematerialization

Paper Review - Capuchin: Tensor-based GPU Memory Management for Deep Learning

Swap and Recomputation Benefit

Determining Tensor Re-generation Cost

Starting Out PhD

Handle GitHub Password Authentication Deprecation

Consensus Problem in Distributed Systems