<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>std::bodun::blog</title><link>https://www.bodunhu.com/blog/</link><description>PhD student at University of Texas at Austin 🤘. Doing systems for ML.</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><atom:link href="https://www.bodunhu.com/blog/index.xml" rel="self" type="application/rss+xml"/><item><title>Four Years into PhD</title><link>https://www.bodunhu.com/blog/posts/four-years-into-phd/</link><pubDate>Sun, 20 Apr 2025 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/four-years-into-phd/</guid><description>&lt;p&gt;I just submitted another paper to &lt;strong&gt;SOSP 2025&lt;/strong&gt;, and it’s hard to believe it’s been nearly &lt;strong&gt;four years&lt;/strong&gt; since I started my PhD. A lot has changed since my &lt;a href="https://www.bodunhu.com/blog/posts/starting-out-phd/"&gt;last post&lt;/a&gt; about my PhD journey—looking back, I seemed pretty desperate then.&lt;/p&gt;
&lt;p&gt;So here I am, reflecting on the past few years. I feel far more confident now—not just in my research decisions but in navigating the space of SysML in general.&lt;/p&gt;
&lt;p&gt;When I started, I wasn&amp;rsquo;t sure about pretty much anything. But one thing I was certain about was ML inference. Admittedly, I didn&amp;rsquo;t grasp its full complexity or what compelling research directions existed, if at all. But I remembered reading in &lt;a href="https://www.usenix.org/system/files/atc21-romero.pdf"&gt;INFaaS&lt;/a&gt; that inference workloads account for &lt;strong&gt;90% of ML infrastructure costs in AWS&lt;/strong&gt;. That fact alone gave me hope—if inference drives such high traffic, it must be, or will become, important in the future.&lt;/p&gt;
&lt;p&gt;Yet I kept wondering: &lt;strong&gt;Is there anything I can do at the model level?&lt;/strong&gt; Many systems ML papers treated models as fixed-sized black boxes with deterministic execution latency and resource consumption. This assumption felt limiting.&lt;/p&gt;
&lt;p&gt;That period was rough. Tons of methods for dynamic DNNs had already been proposed—early exit strategies, Mixture of Experts (MoE), model ensembles—but there just wasn&amp;rsquo;t a clear justification for designing systems specifically to optimize these approaches.&lt;/p&gt;
&lt;p&gt;Honestly, I&amp;rsquo;m not sure how I made it through besides furiously searching &amp;lsquo;&amp;lsquo;dynamic neural network&amp;rsquo;&amp;rsquo;, hoping to find something worth pursuing. Of course, the rise of ChatGPT changed everything, but that&amp;rsquo;s another story.&lt;/p&gt;
&lt;p&gt;Everything shifted after my first paper was accepted to EMNLP. That was the moment I realized publishing isn&amp;rsquo;t as impossible as it seemed. Before, I kept trying to build an end-to-end system, a process that consumed far too much time.&lt;/p&gt;
&lt;p&gt;Instead, I learned that starting with a clear motivation and simply writing it down first is a much better approach. Writing helps untangle confusion—it forces clarity.&lt;/p&gt;
&lt;p&gt;The next few projects moved at a much faster pace. What changed? Honestly, the biggest shift was that I stopped fixating on the future. The long-term uncertainty used to kill my productivity—it felt overwhelming. But I realized that instead of worrying about whether a proposed method might fail, it&amp;rsquo;s far more productive to just write the next paragraph in Overleaf and move forward. After, if you write a paper, there &lt;strong&gt;exists&lt;/strong&gt; a conference that is willing to accept it.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ll stop here and revisit this topic once I fully recover from the SOSP grind. For now, time to rest.&lt;/p&gt;</description></item><item><title>Stichable Neural Networks</title><link>https://www.bodunhu.com/blog/posts/stichable-neural-networks/</link><pubDate>Sun, 01 Sep 2024 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/stichable-neural-networks/</guid><description>&lt;p&gt;TLDR; the &lt;a href="https://arxiv.org/abs/2302.06586"&gt;Stichable Neural Networks&lt;/a&gt; paper includes some interesting concepts. It allows the creation of multiple neural networks with varying complexity and performance trade-offs from a family of pretrained models.&lt;/p&gt;
&lt;h2 id="key-principles"&gt;Key Principles&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;How to choose &lt;strong&gt;anchors&lt;/strong&gt; from well-performed pretrained models in a model family&lt;/li&gt;
&lt;li&gt;The design of stitching layers&lt;/li&gt;
&lt;li&gt;The stitching direction and strategy&lt;/li&gt;
&lt;li&gt;Simple but effective training strateg&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A key question about combining sub-networks from different pretrained models is how to maintain accuracy. The paper concludes that the final performance of these combinations is nearly predictable due to an interpolation-like performance curve between anchors. This predictability allows for selective pre-training of stitches based on various deployment scenarios.&lt;/p&gt;
&lt;h2 id="the-choice-of-anchors"&gt;The Choice of Anchors&lt;/h2&gt;
&lt;p&gt;Anchors that are pretrained on different tasks can learn very different representations due to the large distribution gap of different domains. Therefore, the selected anchors
should be consistent in terms of the pretrained domain.&lt;/p&gt;
&lt;h2 id="the-stitching-layer-and-its-initialization"&gt;The Stitching Layer and its Initialization&lt;/h2&gt;
&lt;p&gt;SN-Net is built upon pretrained models. Therefore, the anchors have already learned good representations, which allows to directly obtain an accurate transformation matrix by solving the least squares problem:&lt;/p&gt;
&lt;p&gt;$$||AM_o - B|| = min||AM - b||_F$$&lt;/p&gt;
&lt;p&gt;where $A \in R^{N \times D_1}$ and \(B \in R^{N \times D_2}\) are two feature maps
of the same spatial size but with different number of hidden
dimensions.&lt;/p&gt;
&lt;p&gt;This function indicates a closed form expression based on singular value decomposition, in which case the optimal solution can be achieved through an orthogonal projection in the space of matrices:&lt;/p&gt;
&lt;p&gt;$$M_o = A^\dagger B$$&lt;/p&gt;
&lt;p&gt;where $A^\dagger$ denotes the Moore-Penrose pseudoinverse of $A$.&lt;/p&gt;
&lt;h2 id="where-to-stitch"&gt;Where to Stitch&lt;/h2&gt;
&lt;p&gt;SN-Net takes Fast-to-Slow as the default stitching direction, meaning it will stitch bigger and slower network after smaller and faster networks to achieve better model performance. Besides, it also proposes a &lt;strong&gt;nearest stitching strategy&lt;/strong&gt; by limiting the stitching between two anchors of the nearest model complexity/performance.&lt;/p&gt;
&lt;h2 id="way-to-stitch"&gt;Way to Stitch&lt;/h2&gt;
&lt;p&gt;Prior works shows neighboring layers dealing with the same scale feature maps share
similar representations. Therefore, SN-Net uses &lt;strong&gt;slideing window&lt;/strong&gt;: where the same window shares a common stitching layer.&lt;/p&gt;
&lt;p&gt;&lt;img src="stitching.png" alt="sliding-window"&gt;&lt;/p&gt;
&lt;h2 id="stitching-space"&gt;Stitching Space&lt;/h2&gt;
&lt;p&gt;The stitching space is controlled by the configuring the sliding window kernel size $k$ and step size $s$.&lt;/p&gt;
&lt;h2 id="training-strategy"&gt;Training Strategy&lt;/h2&gt;
&lt;p&gt;The training algorithm of SN-Net can be described as:&lt;/p&gt;
&lt;p&gt;&lt;img src="stitch-net-train.png" alt="sn-net-training"&gt;&lt;/p&gt;
&lt;p&gt;The training algorithm can be summarized as:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Firstly define a configuration set that contains all possible stitches&lt;/li&gt;
&lt;li&gt;Initialize all stitching layers with least-squares matching&lt;/li&gt;
&lt;li&gt;At each training iteration, we randomly sample a stitch and follow the standard training process as in common practices&lt;/li&gt;
&lt;/ol&gt;</description></item><item><title>Blog Archive</title><link>https://www.bodunhu.com/blog/posts/blog-archive/</link><pubDate>Wed, 28 Aug 2024 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/blog-archive/</guid><description>&lt;p&gt;This is an archive including blogs I find useful or interesting. Hopefully the updates will keep coming.&lt;/p&gt;
&lt;h2 id="hardware"&gt;Hardware&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.servethehome.com/"&gt;Server the home&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="ml"&gt;ML&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://research.google/blog/"&gt;Google Research Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.microsoft.com/en-us/research/blog/"&gt;Microsoft Research Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://magazine.sebastianraschka.com/"&gt;Ahead of AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.aisnakeoil.com/"&gt;AI Snake Oil&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/blog"&gt;Huggingface Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/papers"&gt;Huggingface Daily Papers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lilianweng.github.io/"&gt;Lil&amp;rsquo;Log&lt;/a&gt;: really good ML learning notes.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ghost.oxen.ai/"&gt;Oxen.ai&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://newsletter.languagemodels.co/"&gt;Language Models &amp;amp; Co.&lt;/a&gt;: many great visual illustrations&lt;/li&gt;
&lt;li&gt;&lt;a href="https://thinkingmachines.ai/blog/"&gt;Thinking Machines Blog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="security"&gt;Security&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://nullprogram.com/"&gt;null program&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="datacenter"&gt;Datacenter&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.semianalysis.com/"&gt;SemiAnalysis&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="tech-news"&gt;Tech News&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.ruanyifeng.com/blog/"&gt;阮一峰的网络日志&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="personal-blog"&gt;Personal Blog&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://antirez.com/latest/0"&gt;Antirez&lt;/a&gt;: author of Redis.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>TensorIR Transformation</title><link>https://www.bodunhu.com/blog/posts/tensorir-transformation/</link><pubDate>Tue, 30 Aug 2022 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/tensorir-transformation/</guid><description>&lt;p&gt;In the previous &lt;a href="https://www.bodunhu.com/blog/posts/dive-into-tensorir/"&gt;post&lt;/a&gt;, we&amp;rsquo;ve explored how to write primitive functions in TensorIR. Here, we will see how to transform TensorIR into other (potentially more performant) variants. The content is drived from the &lt;a href="https://mlc.ai/summer22/"&gt;mlc&lt;/a&gt; course taught by &lt;a href="https://tqchen.com/"&gt;Tianqi Chen&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="batched-bmm-relu"&gt;Batched BMM ReLu&lt;/h2&gt;
&lt;p&gt;A batched matrix multiplication followed by a ReLu operation can be expressed using numpy as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lnumpy_mm_relu_v2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Translating the numpy code into TensorIR we get:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@tvm.script.ir_module&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyBmmRule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nd"&gt;@T.prim_func&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bmm_relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func_attr&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;global_symbol&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;bmm_relu&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;tir.noalias&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# we must to allocate the buffer here!&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alloc_buffer&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y_&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y_&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;R&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y_&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Our ultimate goal is to transform the TensorIR above to the following form:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@tvm.script.ir_module&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TargetModule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nd"&gt;@T.prim_func&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bmm_relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func_attr&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;global_symbol&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;bmm_relu&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;tir.noalias&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alloc_buffer&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2_0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ax0_init&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorized&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M_init&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ax0_init&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ax1_0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ax1_1&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ax0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M_update&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ax0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax1_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ax1_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i2_1&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorized&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;R&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i2_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Before we perform the transformation, let&amp;rsquo;s understand what the transformed TensorIR is doing by looking at several loops here.&lt;/p&gt;
&lt;p&gt;First, taking a look at&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2_0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ax0_init&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorized&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M_init&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ax0_init&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The code block is initializing the &lt;code&gt;Y&lt;/code&gt; matrix to be 0. But it does so by initializing every 8 consecutive elements in each row of &lt;code&gt;Y&lt;/code&gt; using a &lt;em&gt;vectorized&lt;/em&gt; operation (which might be faster).&lt;/p&gt;
&lt;p&gt;The next loop is bit tricky:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ax1_0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ax1_1&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ax0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M_update&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ax0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax1_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ax1_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This loop is actually performing the matrix multiplication of &lt;code&gt;A&lt;/code&gt; and &lt;code&gt;B&lt;/code&gt;. We mutiply a row in &lt;code&gt;A&lt;/code&gt; with a column in &lt;code&gt;B&lt;/code&gt; and sum up the result into a number.&lt;/p&gt;
&lt;p&gt;Here, &lt;code&gt;i&lt;/code&gt; is mapped to &lt;code&gt;i1&lt;/code&gt;, which means we access &lt;code&gt;A&lt;/code&gt; one row at a time.i &lt;code&gt;k = T.axis.reduce(128, ax1_0 * 4 + ax1_1)&lt;/code&gt; means we access one row in matrix &lt;code&gt;A&lt;/code&gt; and one column in matrix &lt;code&gt;B&lt;/code&gt; sequentially duing mutiplying, while applying unrolling in hope for better access efficency (\(128 = 32\times 4))). &lt;code&gt;j = T.axis.spatial(128, i2_0 * 8 + ax0)&lt;/code&gt; really just means accessing each column sequentially, nothing special.&lt;/p&gt;
&lt;h2 id="perform-transformation"&gt;Perform Transformation&lt;/h2&gt;
&lt;p&gt;To perform tranformation on any TensorIP, it&amp;rsquo;s very &lt;strong&gt;important to follow the steps listed below&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Get block&lt;/li&gt;
&lt;li&gt;Get loops&lt;/li&gt;
&lt;li&gt;Organize loops by split, reorder, compute_at/reverse_compute_at&lt;/li&gt;
&lt;li&gt;Decompose reduction&lt;/li&gt;
&lt;li&gt;vectorize/unroll/parallel&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Applying step 1, 2, and 3, we first get the block from the original TensorIR:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;sch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tvm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MyBmmRule&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Step 1. Get blocks&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;block_M&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;func_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;bmm_relu&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Step 2. Get loops&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_loops&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Step 3. Organize loops&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;k0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;factors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;j0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;factors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The reason we split &lt;code&gt;k&lt;/code&gt; and &lt;code&gt;j&lt;/code&gt; in such a way is: we already mentioned &lt;code&gt;k&lt;/code&gt; dimension is accessed sequentially but with unrolling (4) applied; when matrix &lt;code&gt;Y&lt;/code&gt; is initialized, a vectorized operation (applied on 8 elements) is applied to dimension &lt;code&gt;j&lt;/code&gt;, or every 8 elements in one row(TVM is row-major, therefore might be faster).&lt;/p&gt;
&lt;p&gt;But the next question is: how do we reorder the spitted loop? I spent a lot of time trying to figure that out. Turns out the simplest way is to write out the implementation in numpy and proceed from there. Remember, we&amp;rsquo;ve already splitted &lt;code&gt;k&lt;/code&gt; and &lt;code&gt;j&lt;/code&gt;, which are used during matrix multiplication, so our new matrix multipliation in numy would be:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k1&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j1&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;j0&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;j1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;k0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;k0&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;k1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;j0&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;j1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Because we move the the next column in &lt;code&gt;B&lt;/code&gt; after traversing the previous column, we will put &lt;code&gt;j1&lt;/code&gt; at the innermost loop. Therefore, the transformation for TensorIR would be:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reorder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We can print out the transformed TensorIR with &lt;code&gt;print(sch.mod.script())&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@tvm.script.ir_module&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nd"&gt;@tir.prim_func&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bmm_relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func_attr&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;global_symbol&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;bmm_relu&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;tir.noalias&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alloc_buffer&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_1&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;j_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;R&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SSS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now, we just need to move the ReLu operation (&lt;code&gt;for n, i, j in tir.grid(16, 128, 128):&lt;/code&gt;) into the loop above:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;block_M&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;func_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;bmm_relu&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reverse_compute_at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Step 4 involves seperating initialization and matrix multiplication, therefore we use &lt;code&gt;M_init = sch.decompose_reduction(block_M, k0)&lt;/code&gt;, which results in:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@tvm.script.ir_module&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nd"&gt;@tir.prim_func&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bmm_relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# function attr dict&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func_attr&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;global_symbol&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;bmm_relu&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;tir.noalias&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# body&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# with tir.block(&amp;#34;root&amp;#34;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alloc_buffer&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;float32&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j_1_init&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M_init&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;j_1_init&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reads&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k_0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_1&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M_update&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;j_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ax0&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;R&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;SS&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ax0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vj&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;tir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The final step is easy, just apply vectorize/parallel/unroll onto corresponding loop:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_1_init&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_loops&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;M_init&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j_1_init&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i2_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_loops&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block_R&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i2_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;block_M_update&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;M_update&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;func_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;bmm_relu&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_loops&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block_M_update&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Print out the final TensorIR to find out its final form ( ͡❛ ͜ʖ ͡❛).&lt;/p&gt;</description></item><item><title>Dive into TensorIR</title><link>https://www.bodunhu.com/blog/posts/dive-into-tensorir/</link><pubDate>Sun, 28 Aug 2022 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/dive-into-tensorir/</guid><description>&lt;p&gt;&lt;a href="https://arxiv.org/abs/2207.04296"&gt;TensorIR&lt;/a&gt; is a compiler abstraction for optimizing programs with tensor computation primitives in &lt;a href="https://tvm.apache.org/"&gt;TVM&lt;/a&gt;. Imagine a DNN task as a graph, where each node represents a tensor computation. TensorIR explains how each node/tensor computation primitive in the graph is carried out. This post explains my attempt to implement 2D convolution using TensorIR. It is derived from the &lt;a href="https://mlc.ai/summer22/"&gt;Machine Learning Compilation&lt;/a&gt; course offered by &lt;a href="https://tqchen.com/"&gt;Tianqi Chen&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="implement-2d-convolution"&gt;Implement 2D Convolution&lt;/h2&gt;
&lt;p&gt;2D convolution is a common operation in image processing. The image below captures how 2D convolution operates. I won&amp;rsquo;t go into details here. But you can find plenty information online regarding convolution.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/dive-into-tensorIR/2d-conv.png" alt="2D-convolution"&gt;&lt;/p&gt;
&lt;p&gt;First, we initialize both the input matrix and the weight matrix:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# batch, input_channel_dim, image_height, image_width, output_channel_dim, kernel_width &amp;amp; height&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# output_height, output_width, assuming kernel has stride=1 and padding=0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_W&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;H&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We can validate the results using &lt;code&gt;torch.nn.functional.conv2d()&lt;/code&gt; from PyTorch.&lt;/p&gt;
&lt;p&gt;One thing Tianqi recommended for starters is to write the implementation first in numpy, and then translate the numpy implementation to TensorIR. I started my implementation directly from TensorIR, before totally getting confused. So here&amp;rsquo;s how I approach the problem.&lt;/p&gt;
&lt;p&gt;First, and perhaps most importantly, you should figure out the accessing pattern of the output matrix, and gradually fill up the compute rules for each element in the output matrix. So, we know the output matrix has a shape of &lt;code&gt;(N, CO, OUT_H, OUT_w)&lt;/code&gt; (which corresponds to batch, number of output channels, output height, and output width). The numpy loop will look like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;co&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;co&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Here, we access element in the output matrix one by one and initialize each element to be 0. Next, we will try to figure out how to compute each element. We know each element in the output matrix is just the sum of element-wise multiplication of both the 2D convolutional kernel (1 by 3 by 3) and the corresponding area in the input matrix (1 by 3 by 3):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;co&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# init to 0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;co&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# 2d conv kernel&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ci&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kh&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# reduction&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;co&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ci&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;kh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;co&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ci&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We can verify the function has the same output as &lt;code&gt;torch.nn.functional.conv2d()&lt;/code&gt; from PyTorch.&lt;/p&gt;
&lt;p&gt;The next part is to translate the numpy code into TensorIR. I won&amp;rsquo;t go into every the details of every single line here, but you can find all explanations from this &lt;a href="https://mlc.ai/chapter_tensor_program/case_study.html"&gt;note&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The nested loop can be encapsulated using &lt;code&gt;T.grid()&lt;/code&gt; like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@tvm.script.ir_module&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyConv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nd"&gt;@T.prim_func&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;conv2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;int64&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;int64&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;int64&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func_attr&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;global_symbol&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;conv2d&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;tir.noalias&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# loop through each elem in the output matrix&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# kernel access pattern&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Next, we define the block (a basic unit of computation in TensorIR). A block contains a set of block axes &lt;code&gt;(vi, vj, vk)&lt;/code&gt; and computations defined around them. Here, we define the property about each block axes:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyConv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nd"&gt;@T.prim_func&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;conv2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;int64&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;int64&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;int64&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func_attr&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;global_symbol&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;conv2d&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;tir.noalias&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# impl&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;A&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vc_o&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vc_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vw_h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vw_w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The outer loop all receives &lt;code&gt;T.axis.spatial()&lt;/code&gt;, because we access each element in the output matrix element by element (spatially), without doing anything else. On the other hand, we see parameters in the innter loop receives &lt;code&gt;T.axis.reduce()&lt;/code&gt;. Remember, each element in the output matrix is just the sum of element-wise multiplication of both the 2D convolutional kernel (1 by 3 by 3) and the corresponding area in the input matrix (1 by 3 by 3). Therefore, after the element-wise multiplication finishes, we need perform a reduction operation over all three axes. More concretely, we will sum up all elements in the row(K), column(K), and channel(CI): (1, 3, 3) -&amp;gt; (1)&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@tvm.script.ir_module&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyConv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nd"&gt;@T.prim_func&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;conv2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;int64&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;int64&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;int64&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;func_attr&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;global_symbol&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;conv2d&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;tir.noalias&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# impl&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;A&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vc_o&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT_H&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spatial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT_W&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vc_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vw_h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;vw_w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vc_o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vw&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# compute rule&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vc_o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vw&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vc_i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vh&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;vw_h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vw&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;vw_w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vc_o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vc_i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vw_h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vw_w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description></item><item><title>Pathways: Google's New ML System</title><link>https://www.bodunhu.com/blog/posts/pathways-googles-new-ml-system/</link><pubDate>Thu, 31 Mar 2022 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/pathways-googles-new-ml-system/</guid><description>&lt;aside class="toc"&gt;
&lt;h4&gt;Table of Contents&lt;/h4&gt;
&lt;nav id="TableOfContents"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#single-controller"&gt;Single-Controller&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#multi-controller-systems"&gt;Multi-Controller Systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#going-back-to-single-controller"&gt;Going Back to Single-Controller&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#deadlock"&gt;Deadlock&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#solutions-to-deadlock"&gt;Solutions to Deadlock&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/nav&gt;
&lt;/aside&gt;
&lt;p&gt;Google recently released the paper about its new ML system called &lt;a href="https://arxiv.org/pdf/2203.12533.pdf"&gt;Pathways&lt;/a&gt;. I&amp;rsquo;m a bit surprised since I expect it to introduce a brand new model architecture. In fact, this paper is not easy to digest at all. I feel like it&amp;rsquo;s written for people who spent many years developing ML frameworks. Anyway, we will try to understand why it is developed and how it works. Also, you should check this &lt;a href="https://zhuanlan.zhihu.com/p/495592456"&gt;post&lt;/a&gt; (in Chinese). This post explains many concepts in Pathways much more clearly. Many contents here are credited to this post.&lt;/p&gt;
&lt;p&gt;This paper spends a long time discussing single-controller and multi-controller. It&amp;rsquo;s really confusing to understand all these SPMD, MPMD, single-controller, and multi-controller stuffs. Pathways claims the future ML framework should go &lt;strong&gt;back&lt;/strong&gt; to single-controller. By &amp;ldquo;back&amp;rdquo; I mean ML frameworks were originally single-controller, then they adopted multi-controller. Now, we are going back to single-controller again.&lt;/p&gt;
&lt;h2 id="single-controller"&gt;Single-Controller&lt;/h2&gt;
&lt;p&gt;TensorFlow v1 is a classic example of single-controller system. The high level idea is the user would define a dataflow graph through a Python client. This graph is then submitted to the &lt;code&gt;session.run&lt;/code&gt; (runtime system). The system consists of a single master and many other workers. The mater will compile and the dataflow graph submitted by the client, then divides the graph into sub-graphs. Then the master submits those subgraphs to other workers.&lt;/p&gt;
&lt;p&gt;In this case, each worker computes its own share of sub-graph. The client + master are the controller.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/pathways/tf1-spmd.png" alt="tf1-spmd"&gt;&lt;/p&gt;
&lt;center&gt;Fig. control messages (oranges lines) need to go through slow DCN between Ctrlr and hosts&lt;/center&gt;
&lt;p&gt;As the paper suggests, dispatching computations in a single-controller system requires communnication across (data center network) DCN. All the orange lines are control messages flowing through DCN. We can see the workers are idle for a long time between each step, even though there&amp;rsquo;s no gap between adjust steps on the controller.&lt;/p&gt;
&lt;p&gt;The controller submits jobs to all workers in each step, then waits all workers to finish computing their own sub-graphs. The problem is: 1) waiting for all workers to finish computation in a lock-step fashion is inefficient; 2) send and wait for control messages (orange line) is costly since these messages go through slow DCN.&lt;/p&gt;
&lt;h2 id="multi-controller-systems"&gt;Multi-Controller Systems&lt;/h2&gt;
&lt;p&gt;Contrary to single-controller systems, multi-controller systems like Jax adopts a different philosophy. Under multi-controller systems, each worker shares the same code and executes different stage/branch of the code. This is why they are called SPMD systems (single-program-multiple-data).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/pathways/jax-spmd.png" alt="jax-spmd"&gt;&lt;/p&gt;
&lt;center&gt;Fig. Dispatching jobs only happens locally on hosts without going through DCN&lt;/center&gt;
&lt;p&gt;Take MPI process as an example, every MPI process is an entrance (client) to the program (In single-controller systems, only the client-master can be the entrance).&lt;/p&gt;
&lt;p&gt;Since multi-controller systems doesn&amp;rsquo;t have a centralized coordinator, all workers in can initiate communication with each other, using much faster channels such as PCIe or NVLink. In the multi-controller graph, the black dotted lines represents message between hosts and devices (through PCIe); the communication between devices happens through fast NVLink. So we don&amp;rsquo;t have the big overhead introduced by DCN.&lt;/p&gt;
&lt;p&gt;If you want to get a taste of how PyTorch vs TensorFlow v1&amp;rsquo;s (multi-controller vs single-controller) programming style feels like, here are two examples: &lt;a href="https://pytorch.org/tutorials/intermediate/dist_tuto.html"&gt;Writing Distributed Applications with PyTorch&lt;/a&gt; and &lt;a href="https://netweblog.wordpress.com/2018/04/10/distributed-tensorflow-sample-code-and-how-it-works/"&gt;End-to-End Tutorial for Distributed TensorFlow 1.x&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="going-back-to-single-controller"&gt;Going Back to Single-Controller&lt;/h2&gt;
&lt;p&gt;We could stick with multi-controller systems forever. If every worker node shares symmetric workloads and communications (like all-reduce, all-gather, etc.), then there&amp;rsquo;s nothing to be worried about. After all, multi-controller seems much more efficient than single-controller based on what we&amp;rsquo;ve discussed so far.&lt;/p&gt;
&lt;p&gt;However, pipeline parallelism changes the story. Under pipeline parallelism, different workers in the pipeline will execute at different programs. Thus we have MPMD (multi-program-multi-data). For example, we can have one worker doing convolution for batch 1 while another worker is doing encoding work on batch 2. At each stage of the pipeline, the worker is doing different jobs on a different data batch (think of a CPU pipeline where each stage is executing different instructions).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/pathways/tf1-non-spmd.png" alt="tf1-non-spmd"&gt;&lt;/p&gt;
&lt;p&gt;Take the above graph as an example, assume we have three workers 1, 2, 3 from top to bottom. Each worker is performing asymmetric workloads and doing irregular point-to-point communications (instead of symmetric communications like all-gather). Obviously, multi-controller doesn&amp;rsquo;t fit into this kind of workload. How do you write a single copy of code that does all these irregular communications under multi-process scenarios?&lt;/p&gt;
&lt;p&gt;Thus, Pathways proposes we should go back to single-controller, so that we can let the master node handle all these nasty communication patterns.&lt;/p&gt;
&lt;h2 id="deadlock"&gt;Deadlock&lt;/h2&gt;
&lt;p&gt;Single-Controller brings back &lt;em&gt;gang-scheduling&lt;/em&gt; and &lt;em&gt;centralized coordinator&lt;/em&gt;. The reason to use &lt;em&gt;gang-scheduling&lt;/em&gt; and &lt;em&gt;centralized coordinator&lt;/em&gt; is to help preventing deadlocks. However, the rational behind this design decision is hard to interpret from reading the paper. I&amp;rsquo;m going to use the &lt;a href="https://zhuanlan.zhihu.com/p/495592456"&gt;post&lt;/a&gt; from Jinhui Yan (the developer behind &lt;a href="https://github.com/Oneflow-Inc/oneflow"&gt;OneFlow&lt;/a&gt;) to explain why &lt;em&gt;gang-scheduling&lt;/em&gt; and &lt;em&gt;centralized coordinator&lt;/em&gt; prevent deadlocks.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Gang-scheduling is essential in the case of TPUs, since they are single-threaded and only run non-preemptible kernels, so the system will deadlock if communicating computations are not enqueued in a consistent order.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We can think of a computing device as a FIFO task queue (e,g. CUDA streams, TPU, or CPU&amp;hellip;). Each FIFO task queue essentially have a stream of tasks to process.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/pathways/fifo.jpg" alt="FIFO-queue"&gt;&lt;/p&gt;
&lt;center&gt;Src. Jinhui Yan&lt;/center&gt;
&lt;p&gt;The paper emphasizes that TPUs are single-threaded and only run non-preemptible kernels. That means we can think of each TPU as a single FIFO task queue. Once we enqueue a task, it can not be preempted from the queue. We need to wait until this task finishes its computation before we can execute the next task in the queue. This is a problem!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/pathways/deadlock.jpg" alt="deadlock"&gt;&lt;/p&gt;
&lt;center&gt;Src. Jinhui Yan&lt;/center&gt;
&lt;p&gt;Imagine we have two devices (1 and 2), represented as two FIFO queues. Device 1 chooses to enqueue task &lt;code&gt;A&lt;/code&gt; first and then &lt;code&gt;B&lt;/code&gt;; device 2 decides to enqueue task &lt;code&gt;B&lt;/code&gt; first and then &lt;code&gt;A&lt;/code&gt;. Both tasks &lt;code&gt;A&lt;/code&gt; and &lt;code&gt;B&lt;/code&gt; are performing an all-scatter operation. Therefore, task &lt;code&gt;A&lt;/code&gt; on device 1 needs to wait for messages from task &lt;code&gt;A&lt;/code&gt; on device 2. Similarly, task &lt;code&gt;B&lt;/code&gt; on device 2 needs to wait for messages from task &lt;code&gt;B&lt;/code&gt; on device 1.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/pathways/deadlock-conditions.png" alt="deadlock-conditions"&gt;&lt;/p&gt;
&lt;p&gt;This is a classical example of deadlock in operating systems.&lt;/p&gt;
&lt;h2 id="solutions-to-deadlock"&gt;Solutions to Deadlock&lt;/h2&gt;
&lt;p&gt;Using gang-scheduling helps preventing deadlocks, because it enforces a global enqueueing order across multiple FIFO queues, instead of letting each queue handling tasks separately.&lt;/p&gt;
&lt;p&gt;The paper also mentions allowing device (e.g. GPUs) to execute tasks concurrently can prevent deadlocks. This is because concurrency eliminates the non-preemption property which is required for deadlocks to happen.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/pathways/concurrency.jpg" alt="concurrency"&gt;&lt;/p&gt;
&lt;center&gt;Src. Jinhui Yan&lt;/center&gt;
&lt;p&gt;If each devices allows concurrency executions (each device has multiple queues), then the task on one queue can be preemptied to allow the other task start executing, thus no deadlock (this is not strictly the case, the &lt;a href="https://zhuanlan.zhihu.com/p/495592456"&gt;post&lt;/a&gt; explains an interesting scenario in &lt;a href="https://developer.nvidia.com/nccl"&gt;NCCL&lt;/a&gt; where deadlocks can still happen if there are too many communications).&lt;/p&gt;</description></item><item><title>FlexFlow</title><link>https://www.bodunhu.com/blog/posts/flexflow/</link><pubDate>Tue, 22 Feb 2022 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/flexflow/</guid><description>&lt;p&gt;&lt;a href="https://flexflow.ai/"&gt;FlexFlow&lt;/a&gt; is a deep learning framework that discovers a fast parallelization strategy for distributed DNN training. It uses &lt;em&gt;SOAP&lt;/em&gt; (Sample-Operation-Attribute-Parameter) search space of parallelization strategies. in short, FlexFlow automates the parallelization of model training.&lt;/p&gt;
&lt;p&gt;The four elements in &lt;em&gt;SOAP&lt;/em&gt; search space represent something that can be sliced into smaller chunks. For example, &lt;em&gt;sample&lt;/em&gt; and &lt;em&gt;parameter&lt;/em&gt; can be thought of as slicing training data and model parameters. &lt;em&gt;Operation&lt;/em&gt; describes how operations (e.g. &lt;code&gt;matmul&lt;/code&gt;, &lt;code&gt;add&lt;/code&gt;, etc.) can be parallelized. &lt;em&gt;Attribute&lt;/em&gt; further describes how to partition a sample.&lt;/p&gt;
&lt;h2 id="problem-inputs"&gt;Problem Inputs&lt;/h2&gt;
&lt;p&gt;Since FlexFlow is about searching for solutions, the framework is given two inputs: an &lt;strong&gt;operator graph&lt;/strong&gt; \(\mathcal{G}\), which include all operations and state in a DNN model, and a &lt;strong&gt;device topology&lt;/strong&gt; \(\mathcal{D}\). Both are described as graphs.&lt;/p&gt;
&lt;p&gt;Each node \(o_i \in \mathcal{G}\) is an operation (e.g. &lt;code&gt;matmul&lt;/code&gt;). Each edge \(o_i, o_j \in \mathcal{G}\) is a tensor. In contrast, each node \(d_i \in \mathcal{D}\) is a computing device, and edge edge \((d_i, d_j) \in \mathcal{D}\) is hardware connection (e.g. NVLink, network link, etc.), Each edge are also labeled with its bandwidth and latency.&lt;/p&gt;
&lt;p&gt;The FlexFlow optimizer uses the operator graph \(\mathcal{G}\) and the device topology graph \(\mathcal{D}\) to generate a discovered strategy to a distributed runtime.&lt;/p&gt;
&lt;h2 id="how-to-search-for-parallelization-strategies"&gt;How to search for parallelization strategies&lt;/h2&gt;
&lt;p&gt;Ultimately, FlexFlow is trying to achieve two things: find parallelization configuration on the operator graph \(\mathcal{G}\), and map the output the device topology \(\mathcal{D}\).&lt;/p&gt;
&lt;p&gt;For an operation \(o_i\), it is given &lt;em&gt;parallelizable dimensions&lt;/em&gt; \(\mathcal{P}_i\), which is the set of all divisible dimensions in its &lt;em&gt;output&lt;/em&gt; tensor. The &lt;a href="https://arxiv.org/pdf/1807.05358.pdf"&gt;paper&lt;/a&gt; provides a 1D convolution example:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/flexflow/parallelization.png" alt="parallelization"&gt;&lt;/p&gt;
&lt;p&gt;For data parallelism, we can see the input data is splitted into smaller micro-batches. In model parallelism, the batch dimension remains the same, while the model is splitted and handles the same input data. The intuition is for a given tensor, there exists many ways to divide it.&lt;/p&gt;
&lt;p&gt;There are many dimensions in \(\mathcal{P}_i\), each single parallelization configuration is denoted as \(c_i\). Therefore, the product of all \(c_i\), represented as \(|c_i|\), is the total number of divided output tensors.&lt;/p&gt;
&lt;p&gt;Each parallelization configuration \(c_i\) partitions the operation \(o\) into \(|c_i|\) &lt;em&gt;tasks&lt;/em&gt;. (denoted as \(t_{i:1}&amp;hellip;, t_{i|c_i|}\)). Each task represents a divided operation and is assigned to a device. The paper claims that, given the output tensor of a task and its operation type, we can infer the input tensors to execute each task. It gives an example of dividing the &lt;code&gt;matmul&lt;/code&gt; operation:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/flexflow/matmul.png" alt="divided-matmul"&gt;&lt;/p&gt;
&lt;p&gt;Given the output tensor is splitted across its sample (batch) dimension and feature dimension, and the task type is &lt;code&gt;matmul&lt;/code&gt;, we can use these information to infer the input tensors \(X\) and \(W\).&lt;/p&gt;
&lt;script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"&gt;&lt;/script&gt;
&lt;script&gt;
(function() {
if (window.mermaidInitialized) return;
window.mermaidInitialized = true;
const isDark = window.matchMedia('(prefers-color-scheme: dark)').matches;
mermaid.initialize({
theme: isDark ? 'dark' : 'default',
startOnLoad: true,
securityLevel: 'strict'
});
})();
&lt;/script&gt;
&lt;div class="mermaid"&gt;
graph TD;
Operator-Graph--&gt;Parallelization-Strategy;
Device-Topology--&gt;Parallelization-Strategy;
&lt;/div&gt;
&lt;p&gt;The parallelization configurations \(c_i\) for each operation \(o_i\) is combined in a final configuration \(\mathcal{S}\).&lt;/p&gt;
&lt;h2 id="building-task-graph"&gt;Building Task Graph&lt;/h2&gt;
&lt;p&gt;Now we have the operation graph \(\mathcal{G}\), the device topology graph \(\mathcal{D}\), and the parallelization strategy \(\mathcal{S}\), we can construct the &lt;em&gt;task graph&lt;/em&gt;.&lt;/p&gt;
&lt;script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"&gt;&lt;/script&gt;
&lt;script&gt;
(function() {
if (window.mermaidInitialized) return;
window.mermaidInitialized = true;
const isDark = window.matchMedia('(prefers-color-scheme: dark)').matches;
mermaid.initialize({
theme: isDark ? 'dark' : 'default',
startOnLoad: true,
securityLevel: 'strict'
});
})();
&lt;/script&gt;
&lt;div class="mermaid"&gt;
graph TD;
Operator-Graph--&gt;Task-Graph;
Device-Topology--&gt;Task-Graph;
Parallelization-Strategy--&gt;Task-Graph;
&lt;/div&gt;
&lt;p&gt;In essence, the task graph specifies the dependencies between each computation and communication task. The task graph is denoted as \(\mathcal{T} = (\mathcal{T}_N , \mathcal{T}_E)\). If two tasks are assigned to the same computation device (e.g. same GPU), no communication task is required. Otherwise, we add a communication task to \(\mathcal{T}_E\). For example, given a operator graph with a set of configurations \(\mathcal{S}\):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/flexflow/operator-graph.png" alt="operator-graph"&gt;&lt;/p&gt;
&lt;p&gt;The task graph will reflect the logical dependency between each task:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_pics/posts/flexflow/task-graph.png" alt="task-graph"&gt;&lt;/p&gt;
&lt;p&gt;Each computation task is also marked with its average execution time &lt;code&gt;exeTime&lt;/code&gt; (from running on the real device multiple times). A communication task&amp;rsquo;s &lt;code&gt;exeTime&lt;/code&gt; is calculated by dividing the tensor size by the bandwidth.&lt;/p&gt;
&lt;h2 id="use-simulation-to-estimate-execution-overhead"&gt;Use Simulation to Estimate Execution Overhead&lt;/h2&gt;
&lt;p&gt;Now that we have the task graph with all dependencies specified, it&amp;rsquo;s time to evaluate (or simulate) the execution time of the whole task graph.&lt;/p&gt;
&lt;p&gt;In essence, we know how a model is &lt;strong&gt;partitioned&lt;/strong&gt; and &lt;strong&gt;placed&lt;/strong&gt; in a cluster, we need to figure out how to &lt;strong&gt;schedule&lt;/strong&gt; the execution.&lt;/p&gt;
&lt;p&gt;The simplest way to simulate the task graph execution is as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Given a task graph, if there are some task nodes that doesn&amp;rsquo;t have an input/s, meaning such tasks represent the beginning layers of a neural network, then they are put into a ready queue waiting to be executed.&lt;/li&gt;
&lt;li&gt;Next, we dequeue the task from the ready queue based on the ready time (the time it is enqueued), or the previously executed task&amp;rsquo;s finish time.&lt;/li&gt;
&lt;li&gt;After this task finishes (simulated) execution, we look at other tasks that depend on this just-finished-execution task, if the other tasks&amp;rsquo; dependees all finish execution, then this task can be put into the ready queue.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, we haven&amp;rsquo;t seen how the task graph \(\mathcal{T}\) might change once we update the configuration of an operation node \(o_i\). FlexFlow only propose a new parallelization strategy by change the configuration of a single operation \(o_i\) at a time. Therefore, whenever we generate a new configuration for an operator, we only need to re-simulate task involved in the portion of the execution timeline that changes. It means we can generate a new task graph from a previous task graph, thus speeding up the simulation process.&lt;/p&gt;
&lt;h2 id="execution-optimizer"&gt;Execution Optimizer&lt;/h2&gt;
&lt;p&gt;Previously, we assumed the parallelization strategy is generated through some black box function. In fact, the &lt;em&gt;execution optimizer&lt;/em&gt; is in charging of taking an operator graph and a device topology as inputs to find an efficient parallelization strategy.&lt;/p&gt;
&lt;p&gt;In fact, the optimizer uses Markov chain Monte Carlo (MCMC) method to sample generated parallelization configurations. It uses the simulation cost as an oracle so that the proposed new configuration will be more likely to be sampled from the ones with less simulation overhead. This method is very greedy but the author argue it can potentially escape from local minimum.&lt;/p&gt;</description></item><item><title>Add Mermaid to Hugo with Dark Mode</title><link>https://www.bodunhu.com/blog/posts/add-mermaid-to-hugo-with-dark-mode/</link><pubDate>Tue, 15 Feb 2022 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/add-mermaid-to-hugo-with-dark-mode/</guid><description>&lt;p&gt;Recently, I was revisiting materials in Deep Learning. I need tools that generate diagrams easily. Drawing the graphs from scratch and upload them individually to the image hosting platform is a daunting process. This is when &lt;a href="https://github.com/mermaid-js/mermaid"&gt;Mermaid&lt;/a&gt; comes into rescue. Now I can generate diagrams directly using Markdown. Here&amp;rsquo;s how to do it inside a &lt;a href="https://gohugo.io/"&gt;Hugo&lt;/a&gt; site.&lt;/p&gt;
&lt;p&gt;I use the &lt;a href="https://github.com/LukasJoswiak/etch"&gt;etch&lt;/a&gt; theme, but this process should apply to all sites using Hugo. First, we create a new file &lt;code&gt;/layouts/shortcodes/mermaid.html&lt;/code&gt;. We fill up &lt;code&gt;mermaid.html&lt;/code&gt; with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-html" data-lang="html"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;script&lt;/span&gt; &lt;span class="na"&gt;src&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#34;https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;script&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;script&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;isDark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;matchMedia&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;(prefers-color-scheme: dark)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;mermaidTheme&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isDark&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;dark&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;default&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;mermaidConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;theme&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;mermaidTheme&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;logLevel&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;fatal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;securityLevel&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;strict&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;startOnLoad&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;arrowMarkerAbsolute&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;er&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;diagramPadding&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;layoutDirection&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;TB&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;minEntityWidth&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;minEntityHeight&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;entityPadding&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;stroke&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;fill&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;honeydew&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;fontSize&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;useMaxWidth&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;flowchart&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;diagramPadding&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;htmlLabels&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;curve&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;basis&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;sequence&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;diagramMarginX&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;diagramMarginY&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;actorMargin&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;width&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;height&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;boxMargin&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;boxTextMargin&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;noteMargin&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;messageMargin&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;messageAlign&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;center&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;mirrorActors&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;bottomMarginAdj&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;useMaxWidth&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;rightAngles&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;showSequenceNumbers&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;gantt&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;titleTopMargin&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;barHeight&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;barGap&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;topPadding&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;leftPadding&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;gridLineStartPadding&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;fontSize&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;fontFamily&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#34;Open-Sans&amp;#34;, &amp;#34;sans-serif&amp;#34;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;numberSectionStyles&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;axisFormat&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;%Y-%m-%d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;topAxis&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;mermaid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mermaidConfig&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;script&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This setup allows us to change Mermaid-generated diagrams&amp;rsquo; theme based on the website&amp;rsquo;s current (light/dark) theme. This configuration is borrowed from the &lt;a href="https://github.com/mermaid-js/mermaid/blob/develop/docs/Setup.md"&gt;Setup.md&lt;/a&gt; from mermaid-js (except the &lt;code&gt;theme&lt;/code&gt; part). You can find more information there about configuring mermaid.&lt;/p&gt;
&lt;p&gt;You can also do this in &lt;code&gt;/partials&lt;/code&gt;, but it will slow down the loading time because the mermaid js file is always loaded, regardless whether you are actually using mermaid.&lt;/p&gt;
&lt;p&gt;Next, we add the follow lines to the file &lt;code&gt;/layouts/shortcodes/mermaid.html&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-html" data-lang="html"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;center&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#34;mermaid&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; {{.Inner}}
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;center&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Feel free to remove the &lt;code&gt;&amp;lt;center&amp;gt;&lt;/code&gt; tag if you want to customize the diagram&amp;rsquo;s layout. And&amp;hellip; we are done!&lt;/p&gt;
&lt;p&gt;Here is an example sequenceDiagram. You should see that this diagram will adjust its theme accordingly based on light/dark mode. We use the example code from &lt;a href="https://mermaid-js.github.io/mermaid/#/"&gt;mermaid doc&lt;/a&gt; (just uncomment &lt;code&gt;mermaid&lt;/code&gt; in the shortcode &lt;code&gt;{{/*&amp;lt; mermaid &amp;gt;*/}}&lt;/code&gt;):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;{{/*&amp;lt; mermaid &amp;gt;*/}}
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;sequenceDiagram
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; participant Alice
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; participant Bob
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; Alice-&amp;gt;&amp;gt;John: Hello John, how are you?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; loop Healthcheck
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; John-&amp;gt;&amp;gt;John: Fight against hypochondria
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; end
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; Note right of John: Rational thoughts &amp;lt;br/&amp;gt;prevail!
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; John--&amp;gt;&amp;gt;Alice: Great!
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; John-&amp;gt;&amp;gt;Bob: How about you?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; Bob--&amp;gt;&amp;gt;John: Jolly good!
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;{{/*&amp;lt; /mermaid &amp;gt;*/}}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"&gt;&lt;/script&gt;
&lt;script&gt;
(function() {
if (window.mermaidInitialized) return;
window.mermaidInitialized = true;
const isDark = window.matchMedia('(prefers-color-scheme: dark)').matches;
mermaid.initialize({
theme: isDark ? 'dark' : 'default',
startOnLoad: true,
securityLevel: 'strict'
});
})();
&lt;/script&gt;
&lt;div class="mermaid"&gt;
sequenceDiagram
participant Alice
participant Bob
Alice-&gt;&gt;John: Hello John, how are you?
loop Healthcheck
John-&gt;&gt;John: Fight against hypochondria
end
Note right of John: Rational thoughts &lt;br/&gt;prevail!
John--&gt;&gt;Alice: Great!
John-&gt;&gt;Bob: How about you?
Bob--&gt;&gt;John: Jolly good!
&lt;/div&gt;
&lt;p&gt;This diagram will adjust its theme based on light/dark theme. You can find more features from the Mermaid &lt;a href="https://mermaid-js.github.io/mermaid/#/"&gt;website&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Cross Entropy Loss</title><link>https://www.bodunhu.com/blog/posts/cross-entropy-loss/</link><pubDate>Sun, 13 Feb 2022 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/cross-entropy-loss/</guid><description>&lt;p&gt;Many deep learning tasks involve classification, where a model outputs a series of probabilities for their corresponding labels. The goal is to correctly predict a given input&amp;rsquo;s label. Mathematically, it means generating max probabilities for the correct label. The probabilities are generated through a process called &lt;a href="https://d2l.ai/chapter_linear-networks/softmax-regression.html#softmax-operation"&gt;softmax&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The softmax function outputs a vector \(\hat{y}\), which represents estimated conditional probabilities of each class given an input \(x\), For example, \(\hat{y}_1 = P(y=\textrm{car}\ |\ x)\). Assume we have many features \(x^{(i)}\) and their corresponding labels \(y^{(i)}\). Then outputs of the model can be expressed succinctly as&lt;/p&gt;
&lt;p&gt;\[
P(Y\ |\ X) = \prod^{k}_{i=1} P(y^{(i)} | \ x^{(i)})
\]&lt;/p&gt;
&lt;p&gt;Our goal is to maximize \(P(Y | X)\). This is equivalent to minimizing the negative log-likelihood \( -\textrm{log} P(Y\ |\ X) = \sum^{k}_{i=1} -\textrm{log} P(y^{(i)} | \ x^{(i)}) \).&lt;/p&gt;
&lt;p&gt;This loss function called the &lt;em&gt;cross-entropy loss&lt;/em&gt;. It is widely used in many classification tasks. Our objective is to reduce the value of this loss function. This is equivalent to maximizing the predicted probability for the correct label.&lt;/p&gt;
&lt;p&gt;To see why this works. Let take a toy example. Suppose we have three classes. Our model produces a vector with three probabilities for each input given.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# produces two probability vector for two inputs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;y_hat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The label is represented as the indices of the probabilities in &lt;code&gt;y_hat&lt;/code&gt;, which will give us the generated probability for a the correct label.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then, we implement the cross-entropy loss function as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cross_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_hat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_hat&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_hat&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Finally, we calculate the loss value for our given probability vectors:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;cross_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_hat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The result is &lt;code&gt;array([2.30258509, 0.69314718])&lt;/code&gt;. In the first output &lt;code&gt;[0.1, 0.3, 0.6]&lt;/code&gt;, the label is at index 0. But our model gives max probability to index 2, and only \(0.1\) to the label, thus the greater loss value. In the second probability vector &lt;code&gt;[0.2, 0.3, 0.5]&lt;/code&gt;, we made the right prediction as we give the max probability to index 2 corresponding to the label, thus the smaller loss value.&lt;/p&gt;</description></item><item><title>Maximum Likelihood for Classification</title><link>https://www.bodunhu.com/blog/posts/maximum-likelihood-for-classification/</link><pubDate>Mon, 24 Jan 2022 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/maximum-likelihood-for-classification/</guid><description>&lt;p&gt;Let&amp;rsquo;s say we want to classify an input text \(y\) and give it a label \(x\). Formally, we want to find:&lt;/p&gt;
&lt;p&gt;\[
\textrm{argmax} P(x | y)
\]&lt;/p&gt;
&lt;p&gt;By Bayes&amp;rsquo; rule this is the same as&lt;/p&gt;
&lt;p&gt;\[
\textrm{argmax} \frac{P(y|x)P(y)}{P(x)}
\]&lt;/p&gt;
&lt;p&gt;Suppose we have five documents as training data and one document as the input as testing data. Our objective is to give a label to the test sentence.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/MLE/text-example.png" alt="text-example"&gt;&lt;/p&gt;
&lt;center&gt;Credit: Eunsol Choi&lt;/center&gt;
&lt;p&gt;Let&amp;rsquo;s define the probability of class as (\(N\) is the total number of classes)&lt;/p&gt;
&lt;p&gt;\[
p(x) = \frac{count(x)}{N}
\]&lt;/p&gt;
&lt;p&gt;and the probability of a word appearing given a class label (total number of vocabs)&lt;/p&gt;
&lt;p&gt;\[
p(w_i|x) = \frac{count(w_i,x) + 1}{count(x) + |V|}
\]&lt;/p&gt;
&lt;p&gt;The conditional probabilities for \(p(w_i|y)\) is&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/MLE/conditional_prob.png" alt="conditional-probabilities"&gt;&lt;/p&gt;
&lt;p&gt;Now, we want to find out which language label should we assign the sentence &amp;ldquo;Chinese Chinese Chinese Tokyo Japan&amp;rdquo;. This is the same as asking which labels (\(x\))) should we pick so that \(P(W|x)P(x)\) yields the greatest value. Mathematically, we want to find out where the gradient of the function \(P(W|x)P(x)\) is flat.&lt;/p&gt;
&lt;p&gt;If we label the sentence as j (Japanese), we have \(P(j | d_5) \propto \frac{1}{4}\cdot (\frac{2}{9}^3)\cdot \frac{2}{9}\cdot \frac{2}{9} \approx 0.0001\). If we calculate \(P(c|d_5)\), we get 0.0003, which generates the largest value for \(P(x | y)\).&lt;/p&gt;</description></item><item><title>Machine Learning System Resources</title><link>https://www.bodunhu.com/blog/posts/machine-learning-system-resources/</link><pubDate>Sat, 08 Jan 2022 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/machine-learning-system-resources/</guid><description>&lt;p&gt;This is my personal list of resources related to machine learning systems. Feel free to drop me an email if you think there&amp;rsquo;s something worth mentioning. I will try to update this page frequently to include the most recent stuffs in mlsys.&lt;/p&gt;
&lt;h2 id="resources"&gt;Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/facebookresearch/metaseq"&gt;Facebook&amp;rsquo;s external large-scale work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html"&gt;NGC Container Doc&lt;/a&gt;: great for development, without having to manually install CUDA, pytorch, and other dependencies.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/HuaizhengZhang/Awesome-System-for-Machine-Learning"&gt;Awesome-System-for-Machine-Learning&lt;/a&gt;: A curated list of research in machine learning systems (MLSys). Paper notes are also provided.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/"&gt;Mastering LLM Techniques: Inference Optimization&lt;/a&gt;: summary of techniques used for LLM deployment&lt;/li&gt;
&lt;li&gt;&lt;a href="https://bbycroft.net/llm"&gt;LLM Visualization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="courses"&gt;Courses&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dlsyscourse.org/"&gt;Deep Learning Systems: Algorithms and Implementation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlc.ai/summer22/"&gt;Machine Learning Compilation&lt;/a&gt;: offered by &lt;a href="https://tqchen.com/"&gt;Tianqi Chen&lt;/a&gt;, intro to ML compiler. Open to all.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://catalyst.cs.cmu.edu/15-884-mlsys-sp21/"&gt;15-884: Machine Learning Systems&lt;/a&gt;: offered by &lt;a href="https://tqchen.com/"&gt;Tianqi Chen&lt;/a&gt; at CMU.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cseweb.ucsd.edu/classes/wi19/cse291-f/"&gt;CSE 291F: Advanced Data Analytics and ML Systems&lt;/a&gt;: offered by &lt;a href="https://cseweb.ucsd.edu/~arunkk/"&gt;Arun Kumar&lt;/a&gt; at UCSD.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/tqchen/tinyflow"&gt;Tinyflow&lt;/a&gt;: tutorial code on how to build your own Deep Learning System in 2k Lines.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dlsys.cs.washington.edu/"&gt;CSE 599W: Systems for ML&lt;/a&gt;: offered by &lt;a href="https://tqchen.com/"&gt;Tianqi Chen&lt;/a&gt; at UW.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.google.com/document/d/1aLkd6Nxhoa9s7_zf8AiT9dcFBxtM3oJAywRHBMHHvmg/edit"&gt;CS8803-SMR: Special Topics: Systems for Machine Learning&lt;/a&gt;: offered by &lt;a href="https://faculty.cc.gatech.edu/~atumanov/"&gt;Alexey Tumanov&lt;/a&gt; at Georgia Tech. [&lt;a href="https://docs.google.com/spreadsheets/d/14sRj5WJ1P0UpZULLj5ysS5Avl1czeAVUv47gEr40nu4/edit"&gt;schedule&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;&lt;a href="http://pages.cs.wisc.edu/~akella/CS744/F17/"&gt;CS 744: Big Data Systems&lt;/a&gt;: offered by &lt;a href="https://www.cs.utexas.edu/~akella/"&gt;Aditya Akella&lt;/a&gt; back at UW-Madison.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stanford-cs329s.github.io/"&gt;CS 329S: Machine Learning Systems Design&lt;/a&gt;: offered by Stanford.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mosharaf/eecs598/tree/w21-ai"&gt;EECS 598: Systems for AI&lt;/a&gt;: offered by &lt;a href="https://www.mosharaf.com/"&gt;Mosharaf Chowdhury&lt;/a&gt; at UMich.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cs.utexas.edu/~akella/CS378/F24/index.html"&gt;CS 378: Systems for Machine Learning&lt;/a&gt;: offered by &lt;a href="https://www.cs.utexas.edu/~akella/"&gt;Aditya Akella&lt;/a&gt; at UT.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ucbrise.github.io/cs294-ai-sys-fa19/"&gt;Machine Learning Systems (Fall 2019)&lt;/a&gt;: from UCB&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sites.utexas.edu/neeraja/sysml-computer-systems-and-machine-learning-interplay-spring-2023/"&gt;ECE 382V SysML: Computer Systems and Machine Learning Interplay&lt;/a&gt;: taught by &lt;a href="https://sites.utexas.edu/neeraja/"&gt;Neeraja Yadwadkar&lt;/a&gt; at UT.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hao-ai-lab.github.io/cse234-w25/"&gt;CSE 234 Data Systems for Machine Learning from UCSD&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stanford-cs336.github.io/spring2025/"&gt;CS336: Language Modeling from Scratch from Stanford&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="labs--faculties"&gt;Labs &amp;amp; Faculties&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://catalyst.cs.cmu.edu/"&gt;CMU Catalyst&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rise.cs.berkeley.edu/"&gt;Berkeley RISE Lab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://dsail.csail.mit.edu/"&gt;MIT DASIL Lab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sampl.cs.washington.edu/"&gt;UW SAMPL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://symbioticlab.org/"&gt;SymbioticLab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://shivaram.org"&gt;Shivaram Venkataraman at UW-Madison&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://utns.cs.utexas.edu/"&gt;UTNS Lab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sites.utexas.edu/neeraja/"&gt;Neeraja Yadwadkar at UT Austin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://faculty.cc.gatech.edu/~atumanov/index.html#researchgroup"&gt;SAIL: Systems for Artificial Intelligence Lab @ GT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.microsoft.com/en-us/research/people/amar/"&gt;Amar Phanishayee @ MSR&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hongzhangblaze.github.io/"&gt;Hong Zhang @ University of Waterloo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://people.csail.mit.edu/ghobadi/"&gt;Manya Ghobadi @ MIT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://wuklab.io/"&gt;WukLab @ UCSD&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hao-ai-lab.github.io/"&gt;Hao AI Lab @ UCSD&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://people.cs.uchicago.edu/~junchenj/"&gt;Junchen Jiang&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="tutorials"&gt;Tutorials&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://jalammar.github.io/illustrated-transformer/"&gt;The Illustrated Transformer&lt;/a&gt;: best introduction for transformer&lt;/li&gt;
&lt;li&gt;&lt;a href="https://d2l.ai/"&gt;Dive into Deep Learning&lt;/a&gt;: interactive deep learning book with code, math, and discussions.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cs231n.github.io/"&gt;CS231n: Convolutional Neural Networks for Visual Recognition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.ezyang.com/2019/05/pytorch-internals/"&gt;Pytorch-Internals&lt;/a&gt;: must-read for PyTorch basics.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ericmjl.github.io/dl-workshop/index.html"&gt;Differential Programming with JAX&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://roberttlange.github.io/posts/2020/03/blog-post-10/"&gt;Getting started with JAX (MLPs, CNNs &amp;amp; RNNs)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://physicsbaseddeeplearning.org/intro.html"&gt;Physics-based Deep Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openmlsys.github.io/"&gt;机器学习系统：设计和实现&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tvm.d2l.ai/"&gt;Dive into Deep Learning Compiler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jax.readthedocs.io/en/latest/autodidax.html"&gt;Autodidax: JAX core from scratch&lt;/a&gt;: really really good resource for learning Jax internals.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/dfm/extending-jax"&gt;Extending JAX with custom C++ and CUDA code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://cs231n.stanford.edu/vecDerivs.pdf"&gt;Vector, Matrix, and Tensor Derivatives&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dlsys.cs.washington.edu/pdf/lecture9.pdf"&gt;ML Memory Optimization&lt;/a&gt;: slides from UW. Visualization of dataflow graph helps understand how to optimize memory.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zhuanlan.zhihu.com/p/652193676"&gt;Pytorch模型加速系列（一）——新的Torch-TensorRT以及TorchScript/FX/dynamo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.semianalysis.com/"&gt;Semianalysis&lt;/a&gt;: many good posts&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/ibm-data-ai/how-to-load-pytorch-models-340-times-faster-with-ray-8be751a6944c"&gt;How to Load PyTorch Models 340 Times Faster with Ray&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zhuanlan.zhihu.com/p/30976469"&gt;分布式深度学习系统&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zhuanlan.zhihu.com/p/104444471"&gt;MLsys各方向综述&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.zhihu.com/people/jin-xue-feng/columns"&gt;金雪锋&lt;/a&gt;: MindSpore 技术负责人&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zhuanlan.zhihu.com/p/495592456"&gt;解读谷歌Pathways架构（一）：Single-controller与Multi-controller&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://petewarden.com/2021/12/24/why-are-ml-compilers-so-hard/"&gt;Why are ML Compilers so Hard?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.googleblog.com/2022/05/alpa-automated-model-parallel-deep.html"&gt;Alpa: Automated Model-Parallel Deep Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.telesens.co/2022/04/23/data-transfer-speed-comparison-ray-plasma-store-vs-s3/"&gt;Data Transfer Speed Comparison: Ray Plasma Store vs. S3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.zhihu.com/column/giantpandacv"&gt;从零开始学深度学习编译器&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fkong.tech/posts/2023-05-20-dynamo/"&gt;一文搞懂 TorchDynamo 原理&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="llm-optimization"&gt;LLM Optimization&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://abderrahmanskiredj.github.io/the-illustrated-grpo/"&gt;The Illustrated GRPO: Group Relative Policy Optimization Explained&lt;/a&gt;: very good tutorial showing how GRPO works with examples.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="communication"&gt;Communication&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev-discuss.pytorch.org/t/pytorch-symmetricmemory-harnessing-nvlink-programmability-with-ease/2798"&gt;PyTorch SymmetricMemory: Harnessing NVLink Programmability with Ease&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2511.15076"&gt;GPU-Initiated Networking for NCCL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2504.19442"&gt;Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="seminars"&gt;Seminars&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mlsys.stanford.edu/"&gt;Stanford MLSys Seminar&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="papers"&gt;Papers&lt;/h2&gt;
&lt;p&gt;This section could potentially be extremely long..&lt;/p&gt;
&lt;h3 id="training"&gt;Training&lt;/h3&gt;
&lt;p&gt;Really broad topic&amp;hellip;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2407.21783"&gt;The Llama 3 Herd of Models&lt;/a&gt;: really great paper explaining how SOTA models are trained in real world!&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="llm"&gt;LLM&lt;/h2&gt;
&lt;p&gt;You an also refer to &lt;a href="https://github.com/Hannibal046/Awesome-LLM"&gt;Awesome-LLM&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://lilianweng.github.io/posts/2023-01-10-inference-optimization/"&gt;Large Transformer Model Inference Optimization &lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2303.17580.pdf"&gt;HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face&lt;/a&gt;: use LLM as the controller to coordinate with exernal models for complicated tasks&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2309.10285.pdf"&gt;Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity&lt;/a&gt;: load as sparse, compute as dense&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2312.04916.pdf"&gt;EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism&lt;/a&gt;: Overlapping LLM forward with KV caching computation.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2401.02669.pdf"&gt;Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache&lt;/a&gt;: managing KV cache in distributed settings.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2310.07707"&gt;MatFormer: Nested Transformer for Elastic Inference&lt;/a&gt;: adaptive blocks during inference.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="nas"&gt;NAS&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2411.19146"&gt;Puzzle: Distillation-Based NAS for Inference-Optimized LLMs&lt;/a&gt;: Applying block-wise local distillation to every alternative subblock replacement in parallel and scoring its quality and inference cost to build a &amp;ldquo;library&amp;rdquo; of blocks. Then, using Mixed-Integer-Programming to assemble a heterogeneous architecture that optimizes quality under constraints such as throughput, latency and memory usage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diffusion"&gt;Diffusion&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2502.01776"&gt;Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2502.06155"&gt;EFFICIENT-VDIT: Efficient Video Diffusion Transformers with Attention Tile&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2411.19108"&gt;Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="kv-cache"&gt;KV Cache&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2310.07240"&gt;CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2309.06180.pdf"&gt;Efficient Memory Management for Large Language Model Serving with PagedAttention&lt;/a&gt;: (vLLM) allows non-contiguous KV cache&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2312.07104.pdf"&gt;Efficiently Programming Large Language Models using SGLang&lt;/a&gt;: RadixAttention that allows Kv cache sharing in prompt engineering&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openreview.net/pdf?id=uNrFpDPMyo"&gt;MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS&lt;/a&gt;: dynamic KV cached token dropping to long sequence.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2309.17453.pdf"&gt;Efficient Streaming Language Models with Attention Sinks&lt;/a&gt;: (Stream-LLM)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2205.14135"&gt;FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2311.01282"&gt;FlashDecoding++: Faster Large Language Model Inference on GPUs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/blog/flexattention/"&gt;FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2405.04437"&gt;vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="datasets"&gt;Datasets&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2309.11998"&gt;LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset&lt;/a&gt;: real-world conversation&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="ml-compilers"&gt;ML Compilers&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1802.04799.pdf"&gt;TVM: An Automated End-to-End Optimizing Compiler for Deep Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1805.08166.pdf"&gt;Learning to Optimize Tensor Programs&lt;/a&gt;: facilitate efficient ML kernel search using ML.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1901.10008"&gt;The OoO VLIW JIT Compiler for GPU Inference&lt;/a&gt;: JIT Compiler to enable better GPU multiplexing.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2102.13267"&gt;LazyTensor: combining eager execution with domain-specific compilers&lt;/a&gt;: combining dynamic graph with JIT. &lt;a href="https://zhuanlan.zhihu.com/p/383547872"&gt;summary&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cs.utexas.edu/~bornholt/papers/quantized-cgo20.pdf"&gt;Automatic Generation of High-Performance Quantized Machine Learning Kernels&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://proceedings.mlsys.org/paper/2022/file/fa7cdfad1a5aaf8370ebeda47a1ff1c3-Paper.pdf"&gt;DietCode: Automatic Optimization for Dynamic Tensor Programs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://proceedings.mlsys.org/paper/2022/file/d3d9446802a44259755d38e6d163e820-Paper.pdf"&gt;The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2002.03794.pdf"&gt;The Deep Learning Compiler: A Comprehensive Survey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gnnsys.github.io/papers/GNNSys21_paper_10.pdf"&gt;Graphiler: A Compiler for Graph Neural Networks&lt;/a&gt;: ML compiler specifically designed for GNN.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="graph-optimization"&gt;Graph Optimization&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2009.13062.pdf"&gt;Accelerating Multi-Model Inference by Merging DNNs of Different Weights&lt;/a&gt;: combine multiple instances of a model into one computational graph&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="inference"&gt;Inference&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.usenix.org/system/files/atc21-romero.pdf"&gt;INFaaS: Automated Model-less Inference Serving&lt;/a&gt;: developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2207.00032.pdf"&gt;DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale&lt;/a&gt;: transformer-specific inference optimization done by DeepSpeed.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf"&gt;Clipper: A Low-Latency Online Prediction Serving System&lt;/a&gt;: a nice overview of inference system. Not SOTA but a good starter.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2209.00159.pdf"&gt;Orloj: Predictably Serving Unpredictable DNNs&lt;/a&gt;: shares similarity to Clipper, but targeting models that may yield unpredictable performance.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2302.11665.pdf"&gt;AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving&lt;/a&gt;: how to multiplex devices to serve multiple model while meeting latency constraint.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2310.18481"&gt;MOSEL: Inference Serving Using Dynamic Modality Selection&lt;/a&gt;: dynamic modality selections for accuracy and SLO tradeoff&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2404.18322"&gt;BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models&lt;/a&gt;: finer-grained LLM serving&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.usenix.org/conference/osdi22/presentation/yu"&gt;Orca: A Distributed Serving System for Transformer-Based Generative Models&lt;/a&gt;: iteration-based LLM inference&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2406.01566"&gt;Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs&lt;/a&gt;: inference on heterogeneous accelerators using max-flow&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2401.09670v2"&gt;DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving&lt;/a&gt;: automatic configuration of inter and intra-node parallelism for model for both prefill and decoding.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="multitenancy"&gt;Multitenancy&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dl.acm.org/doi/10.1145/3542929.3563510"&gt;MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2203.09040.pdf"&gt;A Survey of Multi-Tenant Deep Learning Inference on GPU&lt;/a&gt;: efficient resource management for multi-tenant inference.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dl.acm.org/doi/pdf/10.1145/3458817.3476143"&gt;Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction&lt;/a&gt;: overlap DNN operators of different models in an online fashion&lt;/li&gt;
&lt;li&gt;&lt;a href="https://utns.cs.utexas.edu/papers/nsdi20-themis.pdf"&gt;THEMIS : Fair and Efficient GPU Cluster Scheduling&lt;/a&gt;: minimize the maximum finish time fairness across all ML apps while efficiently utilizing cluster GPUs&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="dynamic-neural-network"&gt;Dynamic Neural Network&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2102.04906.pdf"&gt;Dynamic Neural Networks: A Survey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2204.00102.pdf"&gt;Dynamic Multimodal Fusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2106.04426.pdf"&gt;Hash Layers For Large Sparse Models&lt;/a&gt;: Using hashing for MoE gating.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2205.12755.pdf"&gt;An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems&lt;/a&gt;: use evolution algorithm to update model structure during the training phase.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2404.03865"&gt;FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping&lt;/a&gt;: dynamically skip FFN layers in Transformers suing cosine similarity.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2307.02628.pdf"&gt;SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference&lt;/a&gt;: monotonically decrease LLM layers to ease KV cache management&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="auto-placement"&gt;Auto Placement&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2201.12023.pdf"&gt;Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1807.05358.pdf"&gt;Beyond Data and Model Parallelism for Deep Neural Networks&lt;/a&gt;: FlexFlow.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="reasoning-llm"&gt;Reasoning LLM&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2501.12948"&gt;DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2505.00949"&gt;Llama-Nemotron: Efficient Reasoning Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="federated-learning"&gt;Federated Learning&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1907.09693"&gt;A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="switch--ml"&gt;Switch &amp;amp; ML&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.cl.cam.ac.uk/~nz247/publications/xiong2019dream.pdf"&gt;Do Switches Dream of Machine Learning? Toward In-Network Classification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2002.08987.pdf"&gt;Taurus: A Data Plane Architecture for Per-Packet ML&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="memory-management"&gt;Memory Management&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2104.07857.pdf"&gt;ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="system-design"&gt;System Design&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2203.12533.pdf"&gt;Pathways: Asynchronous Distributed Dataflow for ML&lt;/a&gt;: Google&amp;rsquo;s new DL systems, specifically designed for TPU.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/1712.05889.pdf"&gt;Ray: A Distributed Framework for Emerging AI Applications&lt;/a&gt;: RiseLab&amp;rsquo;s new distributed system. Using shared memory for data communication.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2110.14883.pdf"&gt;Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2110.15032.pdf"&gt;OneFlow: Redesign the Distributed Deep Learning Framework from Scratch&lt;/a&gt;: shared many similarities to Google&amp;rsquo;s Pathways.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="trade-off"&gt;Trade-off&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2208.06102.pdf"&gt;Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training&lt;/a&gt;: find trade-offs between DNN training performance optimization and energy consumption, by configuring batch size and GPU power limit.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="structured-llm-generation"&gt;Structured LLM Generation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2307.09702"&gt;Efficient Guided Generation for Large Language Models&lt;/a&gt;: Outline&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2212.06094"&gt;Prompting Is Programming: A Query Language for Large Language Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="async-training"&gt;Async Training&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2208.03306"&gt;Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models&lt;/a&gt;: training individual LMs before merging&lt;/li&gt;
&lt;li&gt;&lt;a href="DistBelief"&gt;Large Scale Distributed Deep Networks&lt;/a&gt;: parameter server&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="self-play"&gt;Self-play&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/GAIR-NLP/O1-Journey/blob/main/resource/report.pdf"&gt;O1 Replication Journey: A strategic Progress Report&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="costs"&gt;Costs&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2406.18665"&gt;RouteLLM: Learning to Route LLMs with Preference Data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="rag"&gt;RAG&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2404.16130"&gt;From Local to Global: A Graph RAG Approach to Query-Focused Summarization&lt;/a&gt;: GraphRAG, compressing information into graphs so retrieval may contain more information&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2407.08223"&gt;Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting&lt;/a&gt;: SpeculativeRAG: use smaller models with a subset of retrieved documents for faster draft generation, and use the larger LM for verification.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2311.09210"&gt;Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models&lt;/a&gt;: summarize retrieval documents instead of merely just retrieving documents as is.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2401.05856"&gt;Seven Failure Points When Engineering a Retrieval Augmented Generation System&lt;/a&gt;: a study showing how different configurations in different retrieval stages may affect generation quality.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2305.06983"&gt;Active Retrieval Augmented Generation&lt;/a&gt;: determine when to perform retrieval, and how to induce the query for retrieval&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2409.05591"&gt;MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Megatron with FastMoE</title><link>https://www.bodunhu.com/blog/posts/megatron-with-fastmoe/</link><pubDate>Wed, 01 Dec 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/megatron-with-fastmoe/</guid><description>&lt;p&gt;This is a guide on setting up &lt;a href="https://github.com/NVIDIA/Megatron-LM"&gt;Megatron-LM&lt;/a&gt; with &lt;a href="https://github.com/laekov/fastmoe"&gt;FastMoE&lt;/a&gt;. Megatron is a transformer developed by the Applied Deep Learning Research team at NVIDIA. FastMoE enables PyTorch support for the Mixture of Experts (MoE) models. We use the FastMoE layer to replace the MLP layers in the transformer language model.&lt;/p&gt;
&lt;h2 id="prerequisites"&gt;Prerequisites&lt;/h2&gt;
&lt;h3 id="docker"&gt;Docker&lt;/h3&gt;
&lt;p&gt;We recommend using one of &lt;a href="https://ngc.nvidia.com/catalog/containers/nvidia:pytorch"&gt;NGC&amp;rsquo;s recent PyTorch containers&lt;/a&gt;. The Megatron-LM repo uses &lt;code&gt;pytorch:20.12-py3&lt;/code&gt;. We pull the image with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;docker pull nvcr.io/nvidia/pytorch:20.12-py3
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Note: it&amp;rsquo;s possible to use the &lt;a href="https://hub.docker.com/r/pytorch/pytorch"&gt;official PyTorch image&lt;/a&gt;. However, there are a few dependencies missing, which requires manual installation. Also, PyTorch with versions greater than 1.8 seems to have problem during forward passing so we don&amp;rsquo;t use the official PyTorch image here.&lt;/p&gt;
&lt;p&gt;After the image is pulled successfully, we want to start a container. The &lt;a href="https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch"&gt;NGC site&lt;/a&gt; contains instructions on how to start a docker image. We use the following script:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;docker run --gpus all -it --rm --ipc&lt;span class="o"&gt;=&lt;/span&gt;host -v /home/edwardhu/:/home/edwardhu/ --name pytorch-moe &amp;lt;image_id&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Note: we might encounter problems before starting up the docker container. Make sure we set the GPG and remote repo for the &lt;code&gt;nvidia-docker2&lt;/code&gt; package on the host and install required packages:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;distribution&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;. /etc/os-release&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$ID$VERSION_ID&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey &lt;span class="p"&gt;|&lt;/span&gt; sudo apt-key add - &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl -s -L https://nvidia.github.io/nvidia-docker/&lt;span class="nv"&gt;$distribution&lt;/span&gt;/nvidia-docker.list &lt;span class="p"&gt;|&lt;/span&gt; sudo tee /etc/apt/sources.list.d/nvidia-docker.list
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;sudo apt-get update
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;sudo apt-get install -y nvidia-docker2
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;sudo systemctl restart docker
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="set-up-fastmoe"&gt;Set up FastMoE&lt;/h3&gt;
&lt;p&gt;After we spin up the container, we clone the &lt;a href="https://github.com/laekov/fastmoe"&gt;fastmoe repo&lt;/a&gt; and enter project. There is a &lt;code&gt;setup.py&lt;/code&gt; file in the root of the project. Then we execute:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;USE_NCCL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt; python setup.py install
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;to install FastMoE. For some reason, there is a compilation error saying that &lt;code&gt;broadcastUniqueNCCLID(&amp;amp;ncclID)&lt;/code&gt;&amp;rsquo;s definition can not be found. We see there is a condition check right above the error function:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-cpp" data-lang="cpp"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#if defined(TORCH_VERSION_MAJOR) &amp;amp;&amp;amp; (TORCH_VERSION_MAJOR &amp;gt; 1 || \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt; (TORCH_VERSION_MAJOR == 1 &amp;amp;&amp;amp; TORCH_VERSION_MINOR &amp;gt;= 8))
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For some reason, the check failed despite the container has PyTorch version &lt;code&gt;1.8.0a0+1606899&lt;/code&gt;. &lt;a href="https://github.com/laekov/fastmoe/issues/93"&gt;According to the author&lt;/a&gt;, the &lt;code&gt;if&lt;/code&gt; macro was to deal with PyTorch&amp;rsquo;s API variance between v1.7.x and v1.8.x. For now, we simply comment out the &lt;code&gt;if&lt;/code&gt; check and force the &lt;code&gt;broadcastUniqueNCCLID(&amp;amp;ncclID, c10d::OpType::SEND, &amp;quot;fastmoe_nccl_comm&amp;quot;, rank);&lt;/code&gt; to be used instead of the &lt;code&gt;broadcastUniqueNCCLID(&amp;amp;ncclID)&lt;/code&gt; function:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-cpp" data-lang="cpp"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;//#if defined(TORCH_VERSION_MAJOR) &amp;amp;&amp;amp; (TORCH_VERSION_MAJOR &amp;gt; 1 || \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// (TORCH_VERSION_MAJOR == 1 &amp;amp;&amp;amp; TORCH_VERSION_MINOR &amp;gt;= 8))
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;broadcastUniqueNCCLID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ncclID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;c10d&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;OpType&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SEND&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s"&gt;&amp;#34;fastmoe_nccl_comm&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;//#else
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;//broadcastUniqueNCCLID(&amp;amp;ncclID);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;//#endif
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;ncclComm_t&lt;/span&gt; &lt;span class="n"&gt;comm&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;NCCL_SAFE_CALL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ncclCommInitRank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;comm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;getSize&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;ncclID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;comm&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Finally, we need to download vocab file for later use since the Megatron repo doesn&amp;rsquo;t have one. Here, we use the vocab file from the &lt;a href="https://github.com/microsoft/SDNet/tree/master/bert_vocab_files"&gt;SDNet repo&lt;/a&gt;. Feel free to use something else.&lt;/p&gt;
&lt;h3 id="megatron-lm-setup"&gt;Megatron-LM Setup&lt;/h3&gt;
&lt;p&gt;After we set up FastMoE, we clone the &lt;a href="https://github.com/NVIDIA/Megatron-LM"&gt;Megatron-LM&lt;/a&gt; repo into the container. The &lt;a href="https://github.com/laekov/fastmoe/tree/master/examples/megatron"&gt;FastMoE&amp;rsquo;s example guide on Megatron&lt;/a&gt; uses Megatron &lt;code&gt;v2.2&lt;/code&gt; release, so we need to choose the &lt;code&gt;v2.2&lt;/code&gt; tag in the Megatron repo.&lt;/p&gt;
&lt;p&gt;Next, we follow the &lt;a href="https://github.com/laekov/fastmoe/tree/master/examples/megatron"&gt;FastMoE&amp;rsquo;s guide on Megatron&lt;/a&gt; and apply the &lt;code&gt;clip-grad-v2.2.path&lt;/code&gt; and &lt;code&gt;fmoefy-v2.2.patch&lt;/code&gt; accordingly. Instructions on how to apply patches in Linux is easy to find, for example, here is &lt;a href="https://www.cyberciti.biz/faq/appy-patch-file-using-patch-command/"&gt;one&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="race-dataset"&gt;RACE Dataset&lt;/h3&gt;
&lt;p&gt;After setting up Megatron-LM, we download the &lt;a href="https://www.cs.cmu.edu/~glai1/data/race/"&gt;RACE dataset&lt;/a&gt; for fine-tuning downstream tasks (RACE is used with BERT evaluation, the Megatron&amp;rsquo;s repo also has several other examples using GPT, here we stick to BERT). The &lt;a href="https://github.com/NVIDIA/Megatron-LM"&gt;Megatron repo&lt;/a&gt; also provides instructions on how to acquire these datasets for evaluation. For now, we just want to get the fine-tuning process up and running, without caring so much about the accuracy. Therefore, we don&amp;rsquo;t need to pre-train the BERT model just yet. After the dataset finished downloading, we simply need to decompress it.&lt;/p&gt;
&lt;h3 id="summury"&gt;Summury&lt;/h3&gt;
&lt;p&gt;The most important line to change a model to FastMoE style is through:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Initialize FastMoE&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fmoefy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;fmoe.megatron&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;patch_forward_step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patch_model_provider&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;forward_step_func&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;patch_forward_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;forward_step_func&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;model_provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;patch_model_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;More information can be found in the fmoefy patch &lt;a href="https://github.com/laekov/fastmoe/blob/master/examples/megatron/fmoefy-v2.2.patch"&gt;file&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Set up Slurm across Multiple Machines</title><link>https://www.bodunhu.com/blog/posts/set-up-slurm-across-multiple-machines/</link><pubDate>Tue, 16 Nov 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/set-up-slurm-across-multiple-machines/</guid><description>&lt;p&gt;To install &lt;a href="https://slurm.schedmd.com/documentation.html"&gt;Slurm&lt;/a&gt;, we need to have admin access to the machine. This post explains how I got Slurm running in multiple Linux servers. All servers are running on Ubuntu 18.04 LTS.&lt;/p&gt;
&lt;h2 id="setup-munge"&gt;Setup Munge&lt;/h2&gt;
&lt;p&gt;First, we need to make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster. We need to create two users: &lt;code&gt;slurm&lt;/code&gt; and &lt;code&gt;munge&lt;/code&gt; across all servers.
z&lt;/p&gt;
&lt;p&gt;Then, we install &lt;a href="https://linux.die.net/man/7/munge"&gt;Munge&lt;/a&gt; for authentication:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ apt install munge libmunge2 libmunge-dev
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;To test if munge is installed successfully:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ munge -n &lt;span class="p"&gt;|&lt;/span&gt; unmunge &lt;span class="p"&gt;|&lt;/span&gt; grep STATUS
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;STATUS: Success &lt;span class="o"&gt;(&lt;/span&gt;0&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Next, we create a munge authentication key on one of the servers:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ /usr/sbin/create-munge-key
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;After we generate munge authentication key, we copy the key &lt;code&gt;/etc/munge/munge.key&lt;/code&gt; on that server to all other servers (overwrite the &lt;code&gt;/etc/munge/munge.key&lt;/code&gt; on all other servers).&lt;/p&gt;
&lt;p&gt;We need to setup the rights for munge accordingly on every server:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ chmod &lt;span class="m"&gt;0700&lt;/span&gt; /etc/munge/ /var/log/munge/ /var/lib/munge/
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ chmod &lt;span class="m"&gt;0755&lt;/span&gt; /run/munge/
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then, we enable and start the munge service with (remember to not use sudo when running munge):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ systemctl &lt;span class="nb"&gt;enable&lt;/span&gt; munge
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ systemctl start munge
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You can then test whether munge works properly by executing:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;munge -n &lt;span class="c1"&gt;# Generate a credential on stdout&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;munge -n &lt;span class="p"&gt;|&lt;/span&gt; unmunge &lt;span class="c1"&gt;# Displays information about the MUNGE key &lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;munge -n &lt;span class="p"&gt;|&lt;/span&gt; ssh somehost unmunge
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If everything is setup properly, you shouldn&amp;rsquo;t see any error messages.&lt;/p&gt;
&lt;h2 id="setup-slurm"&gt;Setup Slurm&lt;/h2&gt;
&lt;p&gt;Use &lt;code&gt;apt&lt;/code&gt; to install slurm in Ubuntu systems (make sure all nodes have the same slurm versions):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ apt install slurm-wlm
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Next, we need to configure slurm. Since we used package manager to install slurm, the version is lower than the latest release. Thus, it&amp;rsquo;s preferably to not use the official &lt;a href="https://slurm.schedmd.com/configurator.html"&gt;Slurm Configuration Tool&lt;/a&gt;. Instead, we can find the corresponding version&amp;rsquo;s configuration tool at &lt;code&gt;/usr/share/doc/slurmctld/slurm-wlm-configurator.html&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;After filling up the required fields in the form, we copy the generated file into &lt;code&gt;/etc/slurm-llnl/slurm.conf&lt;/code&gt; on all nodes. Then, you can execute &lt;code&gt;sinfo&lt;/code&gt; to check all nodes status. You can also launch jobs to see if it actually works, for example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;srun -N2 -l /bin/hostname
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This should print out the hostname for all the nodes in the cluster.&lt;/p&gt;
&lt;h2 id="add-gpu-support"&gt;Add GPU support&lt;/h2&gt;
&lt;p&gt;To add GPU support, we first create a file &lt;code&gt;gres.conf&lt;/code&gt; in &lt;code&gt;/etc/slurm-llnl/&lt;/code&gt;. Here is an example on one node:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Name=gpu File=/dev/nvidia0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Name=gpu File=/dev/nvidia1
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Name=gpu File=/dev/nvidia2
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then, we add &lt;code&gt;GresTypes=gpu&lt;/code&gt; into &lt;code&gt;/etc/slurm-llnl/slurm.conf&lt;/code&gt;. Next, we add the GPU information to &lt;code&gt;slurm.conf&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;NodeName=node1 Gres=gpu:3 State=UNKNOWN
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description></item><item><title>Paper Review - Dynamic Tensor Rematerialization</title><link>https://www.bodunhu.com/blog/posts/paper-review-dynamic-tensor-rematerialization/</link><pubDate>Tue, 09 Nov 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/paper-review-dynamic-tensor-rematerialization/</guid><description>&lt;p&gt;Dynamic Tensor Rematerialization (&lt;a href="https://arxiv.org/pdf/2006.09616.pdf"&gt;DTR&lt;/a&gt;) treats GPU memory as a large cache, where tensors can be evicted to save memory, and recomputed if needed later.&lt;/p&gt;
&lt;p&gt;DTR&amp;rsquo;s eviction policy relies on the heuristic \(h\). The heuristic assigns a value \(h(t)\) to each resident tensor \(t\), approximating the cost of evicting the tensor. DTR evicts the tensor with the lowest cost based on the value of \(h\). \(h\) can factor in arbitrary metadata.&lt;/p&gt;
&lt;p&gt;During every operator call in &lt;a href="https://pytorch.org/"&gt;PyTorch&lt;/a&gt;, DTR intercepts the call and performs the following tasks:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/DTR/DTR-operator-intercept.png#center" alt="DTR-operator-intercept"&gt;&lt;/p&gt;
&lt;p&gt;In short, whenever we perform an operation, we first recursively re-calculate all the non-resident tensors the current operation depends on, while evicting tensors we don&amp;rsquo;t need until there are enough GPU space left. To decide which tensors to evict, DTR uses the tensor with the lowest value \(h\):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/DTR/tensor-evict.png#center" alt="tensor-evict"&gt;&lt;/p&gt;
&lt;p&gt;The heuristic \(h\) evicts tensors based on three properties: staleness, size, and compute cost. It evicts tensors that are: least recently used, takes large GPU memory space, and easy to recompute. \(H _{DTR}\) is computed as:&lt;/p&gt;
&lt;p&gt;\[
h _{DTR}(s, m, c) (t) := \frac{c(t)}{m(t) \cdot s(t)&amp;rsquo;}
\]&lt;/p&gt;
&lt;p&gt;Recomputing an evicted tensor \(t\) may result in recomputing many more tensors that \(t\) recursively depends on. Thus, the paper proposes an improved heuristic to take the recursive recomputations into account (with more maintenance cost). These tensors are called &lt;em&gt;evicted neighborhood \(e ^{*} (t)\)&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;\[
h_ {DTR-improved}(s, m, c) (t) := \frac{c(t) + \sum _{u \in e ^{*} (t)} c(u)}{m(t) \cdot s(t)&amp;rsquo;}
\]&lt;/p&gt;
&lt;p&gt;This heuristic captures the recomputation costs for all tensors that \(t\) recursively depend on.&lt;/p&gt;</description></item><item><title>Paper Review - Capuchin: Tensor-based GPU Memory Management for Deep Learning</title><link>https://www.bodunhu.com/blog/posts/paper-review-capuchin-tensor-based-gpu-memory-management-for-deep-learning/</link><pubDate>Sun, 07 Nov 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/paper-review-capuchin-tensor-based-gpu-memory-management-for-deep-learning/</guid><description>&lt;p&gt;This &lt;a href="https://dl.acm.org/doi/pdf/10.1145/3373376.3378505"&gt;paper&lt;/a&gt; aims to reduce GPU memory usage during DNN training. Capuchin achieves this goal though &lt;em&gt;swapping&lt;/em&gt; and &lt;em&gt;recomputation&lt;/em&gt;, using &lt;em&gt;tensor&lt;/em&gt; as unit of operation. The major question is how to balance between swapping and recomputation to achieve max resource utilization.&lt;/p&gt;
&lt;h2 id="swap-and-recomputation-benefit"&gt;Swap and Recomputation Benefit&lt;/h2&gt;
&lt;p&gt;The ultimate goal of swapping and recomputation is to hide the overhead as much as possible to minimize the wait time of &lt;em&gt;back-access&lt;/em&gt; (a tensor evicted earlier being accessed again). For swapping, we should increase the overlap between swapping and computing; for recomputation, we should use cheap operations.&lt;/p&gt;
&lt;h3 id="determining-tensor-re-generation-cost"&gt;Determining Tensor Re-generation Cost&lt;/h3&gt;
&lt;p&gt;For swapping, it is usually not optimal to swap back in a tensor only when we access it. The reason is copying tensor from CPU memory to GPU memory usually introduces overhead greater than the computation itself. It&amp;rsquo;s thus better to swap in a tensor &lt;em&gt;earlier&lt;/em&gt; or &lt;em&gt;proactively&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The paper uses &lt;em&gt;in-trigger&lt;/em&gt; as the term. It means we use other tensor access between &lt;em&gt;evicted-access&lt;/em&gt; (a tensor access that triggers the self-eviction after used in the computation) and &lt;em&gt;back-access&lt;/em&gt; to bring back an evicted tensor a little bit earlier.&lt;/p&gt;
&lt;p&gt;Of course, this may raise two questions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How do we know when &lt;em&gt;in-trigger&lt;/em&gt; should happen?&lt;/li&gt;
&lt;li&gt;How to deal with PCIe lane interferences? E.g. one swap-in may happen later than in-trigger due to a previous swap-in still not finished.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The answer is quite simple. We use the runtime feedback at the back-access of a tensor. If the tensor is still being swapped in, it means the in-trigger time should be adjusted earlier. Note, this is based on the assumption of &lt;strong&gt;regular tensor access pattern in deep learning training&lt;/strong&gt;, as illustrated in the paper.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/capuchin/tensor-access-pattern.png#center" alt="tensor-access-pattern"&gt;&lt;/p&gt;
&lt;p&gt;Recomputation, on the other hand, is performed only in &lt;em&gt;on-demand&lt;/em&gt; manner. No in-trigger is used for recomputation.&lt;/p&gt;
&lt;p&gt;Capuchin relies on the principle that swap can be largely overlapped with computation, while recomputation will certainly incur performance penalty. Thus, it chooses swapping as the first choice until we cannot choose an in-trigger to perfectly hide prefetching overhead.&lt;/p&gt;
&lt;p&gt;One thing to note here is when we select a tensor \(T\) to be recomputed, but such tensor relies on another tensor that is evicted, then we need to recompute the parent of the evicted tensor instead. This could potentially happen multiple times if more recomputation targets tensor \(T\). In short, recomputation and swapping cannot occur at the same time.&lt;/p&gt;
&lt;p&gt;For more information, please refer to the original &lt;a href="https://dl.acm.org/doi/pdf/10.1145/3373376.3378505"&gt;paper&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Starting Out PhD</title><link>https://www.bodunhu.com/blog/posts/starting-out-phd/</link><pubDate>Fri, 05 Nov 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/starting-out-phd/</guid><description>&lt;p&gt;Today marks the third month of my PhD life. Things finally start to become a little bit clearer. I finally have some potentially concrete ideas to work on.&lt;/p&gt;
&lt;p&gt;Finding a research topic was the most difficult part. For several months, I was wondering around like a headless chicken, reading papers after papers: serverless, ML inference, compiler, pathlet routing, RDMA, you name it. The feeling of not having a topic was suffocating.&lt;/p&gt;
&lt;p&gt;Talking to other people, especially people not from my own research areas, is extremely beneficial. In fact, I was able to narrow down what I want to work on after discussions with a friend of mine who was working on NLP, an area complete outside of networking. Chatting with lab mates and collaborators are also extremely helpful. They usually would ask questions I would never have thought of, and save me from spending countless hours exploring cluelessly.&lt;/p&gt;
&lt;p&gt;To me, current system research feels like application-driven. Many research projects are designed to address a very specific challenge faced in the application level. Thus, it is very likely to find an interesting system problem in a non-system conference like KDD or even ICLR.&lt;/p&gt;</description></item><item><title>Handle GitHub Password Authentication Deprecation</title><link>https://www.bodunhu.com/blog/posts/handle-github-password-authentication-deprecation/</link><pubDate>Tue, 19 Oct 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/handle-github-password-authentication-deprecation/</guid><description>&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;em&gt;use &lt;a href="https://docs.github.com/en/github-ae@latest/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account"&gt;ssh key&lt;/a&gt; to access the repo is strongly recommended.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Recently, GitHub &lt;a href="https://github.blog/2020-12-15-token-authentication-requirements-for-git-operations/"&gt;deprecated&lt;/a&gt; the use of password for repos. You will have to &lt;a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token"&gt;generate GitHub tokens&lt;/a&gt; to access repos. It&amp;rsquo;s difficult for me to memorize the token without serious efforts. Fortunately, it&amp;rsquo;s easy to mitigate the problem.&lt;/p&gt;
&lt;p&gt;After a repo is cloned, simply execute&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;git remote remove origin
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;to remote the old remote. Then, execute the following command:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;git remote add origin https://&amp;lt;TOKEN&amp;gt;@github.com/&amp;lt;GITHUB_USERNAME&amp;gt;/&amp;lt;REPO&amp;gt;.git
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Finally, execute the following command to setup upstream:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;git push --set-upstream origin main
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;After this, no password is needed for git operations inside this repo. Beware you need to perform this operation for every new repo.&lt;/p&gt;</description></item><item><title>Consensus Problem in Distributed Systems</title><link>https://www.bodunhu.com/blog/posts/consensus-problem-in-distributed-systems/</link><pubDate>Mon, 18 Oct 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/consensus-problem-in-distributed-systems/</guid><description>&lt;p&gt;In a distributed system, it is common for processes to reach consensus. When all non-faulty processes terminate, we must guarantee that everyone agrees on a specific value. Unfortunately, FLP 1985 &lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt; proved that no asynchronous algorithms could achieve consensus.&lt;/p&gt;
&lt;p&gt;Why is that an issue? The key lies in the fact that asynchronous communication doesn&amp;rsquo;t preserve &lt;em&gt;order&lt;/em&gt; of message arrivals. To fully answer that question, we must introduce the concept of &lt;em&gt;bi-valent&lt;/em&gt; state and &lt;em&gt;\(v\)-valent&lt;/em&gt; state. The state of a system consists of possible states of all processes and all message queues. Bi-valent state means the system state can reach both decisions depending message arrival time; \(v\)-valent states indicates the system state can only reach one decision \(v\) (uni-valent).&lt;/p&gt;
&lt;h3 id="initial-state-must-be-bi-valent"&gt;Initial State Must be Bi-valent&lt;/h3&gt;
&lt;p&gt;We will use these facts to show why an algorithm will run infinitely under asynchronous assumptions. First, we claim that some initial state is bi-valent. This claim must be true by the proof of contradiction. Suppose we have some initial state that is uni-valent. We say two states are adjacent if all processes in both states agree on all values except one process. Two adjacent states must be either 0-valent or 1-valent (agree the global value is either 0 or 1).&lt;/p&gt;
&lt;p&gt;For example, let&amp;rsquo;s say we have five processes in an initial uni-valent state (0-valent). We modify only one process&amp;rsquo;s value from 0 to 1 between two adjacent states. At some point in time, the system states changes from 0-valent to 1-valent (we refer to this moment as the crossover point).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consensus-problem/bi-valent-state.png" alt="bi-valent-state"&gt;&lt;/p&gt;
&lt;p&gt;We see at the crossover point the system state becomes 1-valent. Assume \(p_2\) fails before the crossover point, then according to adjacent states, some decision must be made, which contradicts with our assumption that all initial states are uni-valent. Therefore, we need the initial state to be bi-valent.&lt;/p&gt;
&lt;h3 id="always-bi-valent"&gt;Always Bi-valent&lt;/h3&gt;
&lt;p&gt;Second, given the initial state is bi-valent, we then show that for any bi-valent state, we can always deliver the next message to process \(p\) while staying bi-valent. This means we can&amp;rsquo;t reach a certain decision no matter how much time has passed. More precisely, we will prove that no message can transform a system into a decided state by contradiction.&lt;/p&gt;
&lt;p&gt;Suppose we have a system. Initially, this system state is 0-valent. We execute an event \(e = (p,m)\) (process \(p\) sends a message \(m\)). This event can either transform the system state \(C\) to 0-valent ot 1-valent at some point, while system state \(C\) can either be 0-valent or 1-valent.&lt;/p&gt;
&lt;p&gt;At some point, the system state will change from \(C_0\) to \(C _1\). Let&amp;rsquo;s assume a different event \(e&amp;rsquo; = (p&amp;rsquo;, m&amp;rsquo;)\) causes such transition of states. Now, we know previously that event \(e\) can transform the system to 0-valent or 1-valent. Now we have two situations: (1) \(e\) transforms \(C_0\) to 0-valent; (2) \(e\) transforms \(C _0\) to \(C _1\). We know \(e\) will tranform \(C _1\) to 1-valent. Therefore, we can also say after \(e\) is applied to \(C_0\) and leads to state 0-valent, we use apply \(e&amp;rsquo;\) to the system such that the system state becomes 1-valent. Therefore, if \(p \neq p&amp;rsquo;\), then \(e\) and \(e&amp;rsquo;\) commute. This will leads to contradiction.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consensus-problem/uni-valent-contradiction.png#center" alt="uni-valent"&gt;&lt;/p&gt;
&lt;p&gt;To expand on this further, suppose event \(e\) leads to uni-valent state. Then, the absence of \(e\) (\(p\))-free will lead \(C _0\) to a decided state. In other words, \(p\)-free will lead the system to one decided states. However, we&amp;rsquo;ve shown \(e\) and \(e&amp;rsquo;\) commute, and now \(p\)-free, \(e\), and \(e&amp;rsquo;\)
all commute, \(p\)-free will, in fact, leads the system to a &amp;lsquo;&amp;lsquo;decided&amp;rsquo;&amp;rsquo; state that is actually not decided, meaning we can&amp;rsquo;t reach a consensus. This scenario is shown below:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consensus-problem/not-uni-valent.png#center" alt="not-uni-valent"&gt;&lt;/p&gt;
&lt;p&gt;So, given the proof that we can&amp;rsquo;t reach consensus under asynchronous assumptions, what is the best we can do in practice?&lt;/p&gt;
&lt;h3 id="paxos-algorithm"&gt;Paxos Algorithm&lt;/h3&gt;
&lt;p&gt;We will back up a little bit from the asynchronous case, and make compromises by having: (1) consistency/agreement under the case of asynchrony; (2) settle for the lack of termination.&lt;/p&gt;
&lt;p&gt;In Paxos algorithm, we assume the majority of process are non-faulty. That means \(N &amp;gt; 2F\). Processes have three roles: proposer, acceptor, and learner. Proposers broadcast its proposals to all acceptors; acceptors accept a proposal (e.g. based on arrival times of proposals) and broadcasts accept messages to all learners; learners decide which values to accept (e.g. the majority of a accepted value).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consensus-problem/paxos-basics.png#center" alt="Paxos-basics"&gt;&lt;/p&gt;
&lt;p&gt;The problem is: even if one acceptor fails, the total number non-faulty processes will be \(N-1\) and we can&amp;rsquo;t say \(N-1 &amp;gt; 2F\) must hold.&lt;/p&gt;
&lt;p&gt;To fix this problem, we can reduce the number proposers to one. At any moment, only one proposer is the leader. If a leader fails, a new leader becomes available and start sending out proposals.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consensus-problem/paxos-leader.png#center" alt="paxos-leader"&gt;&lt;/p&gt;
&lt;p&gt;However, this solution still suffers from the same problem: if new leader is elected, it can not know whether the old leader&amp;rsquo;s proposal will be accepted.&lt;/p&gt;
&lt;p&gt;Therefore, it is necessary for the new leader to establish a majority. The new leader must first broadcast a \(prepare\) message to all acceptors. Then acceptors will reply back with a \(prepared\) message back to the new leader. Therefore, when a majority is prepared, the new leader knows whether the old leader&amp;rsquo;s proposal &lt;em&gt;might&lt;/em&gt; be accepted.&lt;/p&gt;
&lt;p&gt;To see why this is the case, suppose an old leader dies at some point, then a new leader comes in. This new leader broadcasts a \(prepare\) message to acceptors. Some acceptors reply back. We argue that if the old leader&amp;rsquo;s proposal is accepted by the &lt;em&gt;majority&lt;/em&gt; of acceptors, then this new leader will preserve the old proposal by evaluating the replies from acceptors and re-proposing the latest accepted value in the prepare phase.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consensus-problem/paxos.png#center" alt="paxos"&gt;&lt;/p&gt;
&lt;p&gt;The reason is: &lt;em&gt;if there exists a majority of acceptors in the previous round, then there must exist at least one acceptor in both the old and the current round&lt;/em&gt;, because add the sum of majorities exceeds the total number of processes. Therefore, there must be at least one acceptor that can send its accepted value back to the new leader. Since there is only one leader at any time, we can safely say the old value will be sent to the new leader.&lt;/p&gt;
&lt;p&gt;In summary, if the new leader sees a prepare message with a value, then that value must come from the previous round and the new leader can re-propose that value; otherwise, the new leader can safely propose a new value.&lt;/p&gt;
&lt;div class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;&lt;a href="https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf"&gt;Impossibility of Distributed Consensus with One Faulty Process&lt;/a&gt;&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description></item><item><title>Fault Tolerance in Distributed Systems</title><link>https://www.bodunhu.com/blog/posts/fault-tolerance-in-distributed-systems/</link><pubDate>Tue, 05 Oct 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/fault-tolerance-in-distributed-systems/</guid><description>&lt;p&gt;No systems can provide fault-free guarantees, including distributed systems. However, failures in distributed systems are &lt;em&gt;independent&lt;/em&gt;. It means only a subset of processes fail at once. We can exploit this feature and provide some degree of fault tolerance. The problem is, fault tolerance makes everything else much more difficult.&lt;/p&gt;
&lt;p&gt;The most common fault models is the &lt;em&gt;fail-stop&lt;/em&gt;. It means a process completely &amp;ldquo;bricks&amp;rdquo;. When a process fail-stops, no messages can emerge from this process any more. We also don&amp;rsquo;t know if this process will ever restart. In addition, we must account for all possible states of the faulty process, including unsent messages in the process&amp;rsquo;s queue. On the other hand, it&amp;rsquo;s important to point out a process that takes a long time to respond is indistinguishable from a fail-stop. The intuition is such processes and the faulty ones may take an unknown amount of time before message emerge.&lt;/p&gt;
&lt;p&gt;We use an example here to illustrate how and why a system fails to provide fault-tolerance. We take a system that replicates data by broadcasting operations using logical timestamps. This system uses &lt;a href="https://en.wikipedia.org/wiki/Lamport_timestamp"&gt;Lamport clock&lt;/a&gt; to update local clocks (our previous &lt;a href="https://www.bodunhu.com/blog/posts/lamport-distributed-mutual-exclusion/"&gt;post&lt;/a&gt; on Lamport Distributed Mutual Exclusion explains how Lamport clock works). In short, we overwrite the value stored in a replica if an incoming message has later timestamp \(ts\).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/fault-tolerance-in-distributed-systems/replication.png#center" alt="replication"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;This system is not fault-tolerant. Imagine when one replica receives a write (marked by red incoming &amp;ldquo;write&amp;rdquo;) request and then tries to write the value to other replicas. This replica then fail-stops right after it writes to one replica and never writes to the other replica. In this case, not all replicas see the writes, thus violating consistency.&lt;/p&gt;
&lt;p&gt;The solution to this problem is quite simple: reliable atomic broadcast. Under atomic broadcast, all writes are all-or-nothing. It eliminates the possibility for a process to fail-stop amid broadcasting.&lt;/p&gt;
&lt;p&gt;Now let&amp;rsquo;s take the above example and update it with additional requirements. Instead of overwriting existing values, we append writes to an array and want to ensure every replica has the same array eventually. The major difference is that replicas needs to wait for acks with higher timestamps before it can append to its array.&lt;/p&gt;
&lt;p&gt;This system is also not fault-tolerant. If one replica fail-stops, others will wait forever on a write, because every replicas relies on acks with higher timestamps before committing the append.&lt;/p&gt;
&lt;p&gt;Thus, we want to extend the atomic broadcast so that updates become ordered. Under ordered atomic broadcast, writes are all-or-nothing and everyone agrees on the order of updates. If we assume the system to be fully asynchronous, ordered atomic broadcasts are not possible: (1) we can&amp;rsquo;t guarantee termination under asynchrony; (2) we could lose order. Thus, we rely on the synchronous approach.&lt;/p&gt;
&lt;p&gt;Under the synchronous assumption, we can safely say a process fails after waiting for time \(t _{fail}\), where \(t _{fail} = 3 t _{msg} + 2 t _{turn}\). Here, \(t _{msg}\) is the message transfer latency, and \(t _{turn}\) is the max time to respond to a message.&lt;/p&gt;
&lt;p&gt;To see why \(t _{fail}\) is calculated this way, we use the following example to explain the process:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/fault-tolerance-in-distributed-systems/fail-time.png#center" alt="process-fail"&gt;&lt;center&gt;Source: Ken McMillan&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;Imagine process \(p\) sends a message to \(q\) and waits for an ack from \(q\). Before \(q\) is able respond with the ack, it somehow crashes. The max time taken for \(p\) to see an aco from \(q\) would be two message transfer time plus the execution time by \(q\), which is \(2t _{msg} + t _{turn}\). We add \(t\) to indicate the elapsed time when \(p\) sends the request.&lt;/p&gt;
&lt;p&gt;Now imagine \(q\) is able to broadcast a request right before it fails. Later, another process is able to forward this request back to \(p\). Then \(p\) needs to wait for three message transfer time plus two message processing time before it can assume that it will no longer receive message from \(q\).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/fault-tolerance-in-distributed-systems/time-fail.png#center" alt="process-fail"&gt;&lt;/p&gt;
&lt;p&gt;Under the synchronous assumption, ordered reliable atomic broadcast works as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When a client send a request to process \(p\), the process records logical time \(r _l(p,p)\) and physical time \(r _{t}(p,p)\). Then it broadcast the request.&lt;/li&gt;
&lt;li&gt;When process \(p\) receives a message \(m\), and if message \(m\) contains previously unseen timestamp \(t_m\), then we record the logical time we see message \(m\) at \(p\), denoted as \(r_l(p, p_m)\) as well as the physical time \(r _t(p, p_m)\). Then we broadcast the request. Finally, we send an ack back to the originator \(p_m\) without updated timestamp.&lt;/li&gt;
&lt;li&gt;When process \(p\) receives a message \(m\), \(p\) updates \(t _l(p, p_m)\). It means we update \(p\)&amp;rsquo;s notion of the latest timestamp of another process who just acked.&lt;/li&gt;
&lt;li&gt;Process \(p\) sets another process \(q\) as &amp;ldquo;failed&amp;rdquo; (denoted by \(f(p, q)\)) if \(t _l(p, q) \leq r _l(p, p&amp;rsquo;) &amp;lt; +\infty\) and \(t _t &amp;lt; r _t(p, p&amp;rsquo;) + t _{fail}\). In short, it means if we broadcast message and don&amp;rsquo;t receive any response after time \(t _{fail}\), and our last recorded logical time of process \(q\) is before our broadcast, then we know process \(q\) must have failed.&lt;/li&gt;
&lt;li&gt;Then we perform updates when for all \(q \neq p\), \(r _l(p,p) &amp;lt; r _l (p, q)\) (meaning everyone else&amp;rsquo;s request time is later than mine) and \(t _l(p,q) &amp;gt; r _l(p,p) \lor f(p, q)\). Intuitively, it means we only perform updates when we receive message from other processes after our broadcast, and when we think other processes&amp;rsquo; timestamps are after us, or when they have all failed.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Consistency Models Explained</title><link>https://www.bodunhu.com/blog/posts/consistency-models-explained/</link><pubDate>Thu, 23 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/consistency-models-explained/</guid><description>&lt;p&gt;In a distributed system, eventual consistency provides a weak guarantee that data updates will be reflected in all nodes eventually. However, the downside of eventual consistency is that clients could potentially observe awkward intermediate states. For example, appending numbers to a client may result in states like [10], [10,13], [10,12,13].&lt;/p&gt;
&lt;p&gt;Therefore, we need stronger consistency guarantees, which is easier to reason about. These consistency models provide various degree of consistency guarantees. However, it&amp;rsquo;s not always feasible to provide the strongest consistency guarantee. Usually, one needs to trade off consistency for availability and partition resilience (&lt;a href="https://en.wikipedia.org/wiki/CAP_theorem"&gt;CAP theorem&lt;/a&gt;). Many contents here are attributed to Prof. &lt;a href="https://mcmil.net/wordpress/"&gt;Ken McMillan&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="global-consistency"&gt;Global Consistency&lt;/h2&gt;
&lt;p&gt;The most primitive notion of consistency is &lt;em&gt;global consistency&lt;/em&gt;. It means we have some history of events that is totally ordered.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It is (globally) consistency if it respects the intended sequential semantics of the events.&lt;/p&gt;
&lt;p&gt;&amp;ndash; &lt;em&gt;Kenneth McMillan&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Take read/write as an example, we have a sequence of write operations to the same memory location \(12\) at various timestamps, and we want to read values from the this location:&lt;/p&gt;
&lt;p&gt;\[
\texttt{write}(0, 12, 2) \rightarrow \texttt{write}(1, 12, 3) \rightarrow \texttt{read}(2, 12, 3)
\]&lt;/p&gt;
&lt;p&gt;If we have global consistency, every read at a given memory location \(x\) will yield the most recently written value to \(x\).&lt;/p&gt;
&lt;p&gt;In reality, it&amp;rsquo;s impossible to implement a global clock in a distributed system. Therefore, no node can observe the entire history of ordered events, making global consistency hard to implement.&lt;/p&gt;
&lt;h2 id="linearizability"&gt;Linearizability&lt;/h2&gt;
&lt;p&gt;In essence, &lt;a href="https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf"&gt;linearizability&lt;/a&gt; is an approximation of global consistency. The difference is linearizability is based on &lt;em&gt;logical time&lt;/em&gt;, as opposed to physical time used in global consistency. It means we don&amp;rsquo;t care in what order the event occur physically. We just want the ordering of events to be consistent to what we know about time, and what we know about time is based on &lt;em&gt;causality&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Linearizability has two assumptions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Clients don&amp;rsquo;t have shared clock, but are able to send messages to each other. In other words, we want to create an illusion of global consistency by causality.&lt;/li&gt;
&lt;li&gt;If an event \(e_1\) ends before another event \(e_2\) begins in physical time, then \(e_1\) &lt;em&gt;happens-before&lt;/em&gt; \(e_2\). We define the &lt;em&gt;happen before&lt;/em&gt; relationship as \(hb(e_1, e_2)\). In simpler terms, is it possible for us to assume a &lt;em&gt;causal connection&lt;/em&gt; between two events?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Take the following scenario as an example. We can say \(hb(e_1, e_2)\) holds because we can establish a causality relation between these two events. We are able to assume \(P_1\) sent a message \(m_1\) to \(P_2\), which caused the execution of \(e_2\).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/causality.png#center" alt="causality"&gt;&lt;/p&gt;
&lt;p&gt;Here is another example. Here, we can not establish a causal connection between \(e_1\) and \(e_2\) because we can not assume \(m_1\) caused the execution of \(e_2\)&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/non-causality.png#center" alt="non-causality"&gt;&lt;/p&gt;
&lt;p&gt;To say a set of events, denoted as \(E\), is &lt;em&gt;linearizable&lt;/em&gt;, the following conditions must be met:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;There exists a total order \(&amp;lt;_{lin}\) over events \(E\) s.t.
&lt;ul&gt;
&lt;li&gt;\((E,\ &amp;lt;_{lin})\) is globally consistent.&lt;/li&gt;
&lt;li&gt;\(hb(e_1, e_2) \rightarrow e_1 &amp;lt;_{lin} e_2\)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, a set of event \(E\) is linearizable if it respects the happen-before relationship, and is totally ordered.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s look at one example. Suppose we have two processes \(P_1\) and \(P_2\). \(P_1\) writes 1 to location 0, and later reads from 0 and gets 1. \(P_2\) writes 2 to location 0 before \(P_1\) finishes its write.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/linearizable.png#center" alt="linearizable"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;These events are linearizable. We can order these three events as:&lt;/p&gt;
&lt;p&gt;\[
\texttt{write}(1, 0, 2) \rightarrow \texttt{write}(0, 0, 1) \rightarrow \texttt{read}(0, 0, 1)
\]&lt;/p&gt;
&lt;p&gt;We know \(\texttt{write}(0, 0, 1)\) happens before \(\texttt{read}(0, 0, 1)\). The read gets the most recently written (not physical time, but causality) value, satisfying global consistency. Therefore, these events are linearizable.&lt;/p&gt;
&lt;p&gt;Here is another example showing events that are not linearizable.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/not-linearizable.png#center" alt="not-linearizable"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;No matter how you order these four events, there will always be a contradiction. For example, \(rd(0,0,1)\) happens after \(wr(1,0,2)\) and \(wr(0,0,1)\). In order to satisfy global consistency requirement (reading the most recently written value), we must order these three events as&lt;/p&gt;
&lt;p&gt;\[
wr(1,0,2) \rightarrow wr(0,0,1)\rightarrow rd(0,0,1)\rightarrow\ ?
\]&lt;/p&gt;
&lt;p&gt;However, \(wr(0,0,1)\) happens before \(rd(1,0,2)\), so \(rd(1,0,2)\) must be put after \(wr(0,0,1)\), but that way the most recently written value would be 1 and it would be impossible to read value 2, thus violating global consistency.&lt;/p&gt;
&lt;!-- In summary, linearizability is about asking one question: can we use causality to order a set of events such that every one agrees on the order? --&gt;
&lt;h3 id="commit-points"&gt;Commit Points&lt;/h3&gt;
&lt;p&gt;A different and perhaps easier way of thinking linearizability is using commit points. We say a set of events \(E\) is linearizable if every event can be assigned a physical commit time \(t_e\) such that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;\(t_e\) occurs during the execution of an event \(e\).&lt;/li&gt;
&lt;li&gt;\((E,\lt _{lin})\) is globally consistent, where \(e &amp;lt; _{lin}\ d\) iff \(t_e &amp;lt; t_d\)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The following picture presents a scenario where we set three commit points on three write/read operations.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/commit-point.png#center" alt="commit-points"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;We know these events are linearizable because the three commit points we picked respects the \(\lt_{\textrm{lin}}\) relationship. The commit point \(wr(0,0,1)\) is set after the commit point for \(wr(1,0,2)\) and we know \(wr(1,0,2)&amp;lt; _{lin}wr(0,0,1)\).&lt;/p&gt;
&lt;h2 id="sequential-consistency"&gt;Sequential Consistency&lt;/h2&gt;
&lt;p&gt;We can relax the requirement of linearizability even more, which leads us to sequential consistency. Sequential consistency (&lt;a href="http://www.lamport.org/"&gt;Lamport&lt;/a&gt;) is based on slightly different assumptions compared to linearizability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Assume clients don&amp;rsquo;t send messages to each other&lt;/li&gt;
&lt;li&gt;\(hb _{sc}(e_1, e_2)\) only holds if \(e _1\) executed before \(e _2\) in the &lt;em&gt;same&lt;/em&gt; process.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These assumptions indicates each process doesn&amp;rsquo;t know the relative order of operations happening on other processes. Thus, we don&amp;rsquo;t have happen-before arc between processes.&lt;/p&gt;
&lt;p&gt;Take the following example:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/sequential.png#center" alt="sequential"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;We know these events meets sequential consistency by the following order. The reason is that we can&amp;rsquo;t say \(hb_{sc}(wr(0,0,1), wr(1, 0, 2))\) must hold. This example would not be linearizable because \(wr(0,0,1)\) happens before \(wr(1,0,2)\).&lt;/p&gt;
&lt;p&gt;\[
wr(1, 0, 2) \rightarrow wr(0,0,1) \rightarrow rd(0,0,1)
\]&lt;/p&gt;
&lt;p&gt;Take another example that is not sequentially consistent:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/not-sequential.png#center" alt="not-sequential"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;For \(rd(0,0,2)\) to be true, it must be that \(hb_{sc}(wr(0,0,1), wr(1,0,2))\) holds; for \(rd(1,0,1)\) to be true, iut must be that \(hb_{sc}(wr(1,0,2), wr(0,0,1))\) holds. Now we have a circle of ordering constraint, thus reaching a contradiction.&lt;/p&gt;
&lt;h2 id="causal-consistency"&gt;Causal Consistency&lt;/h2&gt;
&lt;p&gt;Causal consistency is an even weaker consistency model compared to sequential consistency. However, unlike all the consistency models we discussed before, causal consistency only applies to read/write operations. In causal consistency model, we define a causal order on those read/write operations such that read operations must see writes in order that respects causality.&lt;/p&gt;
&lt;p&gt;Precisely, we define a reads-from map, denoted as \(RF\). \(RF\) of a read event \(e\) is going to produce the write operation that gave me the read value (there will be ambiguity if there are two writes writing the same value). For example, \(RF(rd(1,0,2))\) will produce the value \(2\), which is equal to the value written by a write operation \(wr(0,0,2)\). Putting \(RF\) in formal terms:&lt;/p&gt;
&lt;p&gt;\[
RF(rd(p,a,v)) = wr(p&amp;rsquo;,a&amp;rsquo;,v&amp;rsquo;) \rightarrow a=a&amp;rsquo; \land v = v'
\]&lt;/p&gt;
&lt;p&gt;In addition, \(hb_{RF}\) is the least transitive relation such that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;\(hb_{SC}(e, e&amp;rsquo;) \rightarrow hb_{RF}(e,e&amp;rsquo;)\)&lt;/li&gt;
&lt;li&gt;\(RF(e&amp;rsquo;) = e \rightarrow hb_{RF}(e,e&amp;rsquo;)\). It means whoever gave me the value must happen before me, which represents a notion of causality.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We say a set of events \(E\) is causally consistent if there exists a \(RF\) map for \(E\) such that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For all reads \(e\in E\), there is no write \(e&amp;rsquo; \in E\) such that \(hb_{RF}(RF(e),e&amp;rsquo;)\) and \(hb_{RF}(e&amp;rsquo;,e)\) have the same address.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In layman&amp;rsquo;s term, it says that if a write operation \(e\) causes us to read a value \(x\), it can&amp;rsquo;t be that there is another write operation \(e&amp;rsquo;\) that happens after \(e\) and writes some value to the same address. Because if there is such a write operation, then \(RF\) will produce write operation \(e&amp;rsquo;\) instead of \(e\).&lt;/p&gt;
&lt;p&gt;Take a previous example here:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/not-linearizable.png#center" alt="causal-consistency"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;These events are causally consistent because \(RF(rd(0,0,1)) = wr(0,0,1)\) and \(RF(rd(1,0,2)) = wr(1,0,2)\). Thus \(hb_{RF}(wr(0,0,1), rd(0,0,1))\) and \(hb _{RF}(wr(1,0,2), rd(1,0,2))\). We also know we can&amp;rsquo;t say \(hb _{SC}(wr(0,0,1), wr(1,0,2))\) because sequential consistency assumes no communication between processes. Therefore, \(hb _{RF}(wr(0,0,1), wr(1,0,2))\) doesn&amp;rsquo;t hold, and we can safely say these events are causally consistent.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s look at another example:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/not-causal.png#center" alt="not-causally-consistent"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;This is because \(RF(rd(2,0,3)) = wr(0,0,3)\). However, there is a write operation \(wr(0,0,1)\) happening after \(wr(0,0,3)\) that write 1 to location 0. Therefore, it can&amp;rsquo;t be that \(wr(0,0,3)\) causes \(rd(2,0,3)\) because \(wr(0,0,1)\) interferes and creates a contradiction. The easiest way to detect whether a set of events is causally consistent is to see if there is a &lt;em&gt;circle of dependencies&lt;/em&gt;.&lt;/p&gt;
&lt;h2 id="s3-strong-consistency"&gt;S3 Strong Consistency&lt;/h2&gt;
&lt;p&gt;A consistency model widely used in production systems is the S3 consistency used in &lt;a href="https://aws.amazon.com/s3/"&gt;Amazon S3&lt;/a&gt; storage service. The S3 consistency models holds:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if \(hb(w_1, w_2)\) and \(hb(w_2, r)\), then \(RF(r) \neq w_1\)&lt;/li&gt;
&lt;li&gt;Two reads must agree on the order of writes that &lt;em&gt;happen before&lt;/em&gt; both reads.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here is an example that is causally consistent, but not S3 consistent:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/not-linearizable.png#center" alt="causal-consistency"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;The reason is \(rd(1,0,2)\) sees write options as \(wr(0,0,1) \rightarrow wr(1,0,2)\) and \(rd(0,0,1)\) sees \(wr(1,0,2)\rightarrow wr(0,0,1)\).&lt;/p&gt;
&lt;p&gt;However, with slight adjustment to the example, we have S3 consistency.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/S3-consistent.png#center" alt="S3-consistent"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;The reason is because \(hb(wr(1,0,2),rd(0,0,2)\) doesn&amp;rsquo;t hold. So even if \(rd(1,0,2)\) sees write options as \(wr(0,0,1) \rightarrow wr(1,0,2)\) and \(rd(0,0,1)\) sees \(wr(1,0,2)\rightarrow wr(0,0,1)\), only \(wr(0,0,1)\) happens before both reads, thus they would agree on the ordering of writes.&lt;/p&gt;
&lt;h2 id="summary"&gt;Summary&lt;/h2&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/consistency-models/consistency-hierarchy.png#center" alt="consistency-hierarchy"&gt;&lt;/p&gt;
&lt;center&gt;Source: Ken McMillan&lt;/center&gt;
&lt;p&gt;The consistency models discussed are only a tip of the iceberg. In fact, different storage service providers usually provide different consistency models. This may result in vendor lock-in because applications designed for one storage system may fall apart when deployed to another due to varying consistency implications.&lt;/p&gt;</description></item><item><title>Lamport Distributed Mutual Exclusion</title><link>https://www.bodunhu.com/blog/posts/lamport-distributed-mutual-exclusion/</link><pubDate>Tue, 21 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/lamport-distributed-mutual-exclusion/</guid><description>&lt;p&gt;Normally, having consistent event ordering in a distributed system is hard because we have no common clock. Since we don&amp;rsquo;t have a common clock to measure with, we rely on logical properties of time in the absence of clock. Here we use &lt;em&gt;causality&lt;/em&gt; replation between events.&lt;/p&gt;
&lt;p&gt;In essence, &lt;em&gt;Causality&lt;/em&gt; indicates a clock \(C\) is map from events to time satisfying: \(e\rightarrow e&amp;rsquo;\) implies \(C(e) &amp;lt; C(e&amp;rsquo;)\)&lt;/p&gt;
&lt;p&gt;We can synthesize a clock by a simple protocol, usually referred as &lt;em&gt;scalar clock&lt;/em&gt; or &lt;a href="https://en.wikipedia.org/wiki/Lamport_timestamp"&gt;&lt;em&gt;Lamport clock&lt;/em&gt;&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each process \(p\) has a local clock \(C(p)\).&lt;/li&gt;
&lt;li&gt;A message send by a process is stampled with the its corresponding local clock.&lt;/li&gt;
&lt;li&gt;On receiving \(M\), set the process&amp;rsquo;s local clock to be \(max(C(p), C(M)) + 1\).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This will give us a consistent total order of events in a distributed system.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s take Lamport distributed mutual exclusion (DME) as an example. We use scalar clock to agree on the order of access to critical sections. Each process broadcasts a request with its local clock time. Receiver stores the request time and responds with its update local time (\(max(C(p), C(M)) + 1\)).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/lamport-dme/lamport-dme.png#center" alt="lamport-dme"&gt;&lt;/p&gt;
&lt;p&gt;A process can only enter critical section given the condition \(W\) is met: \(W \equiv \forall q \neq p,\ t(p,q) &amp;gt; r(p,p) \land r(p,p) &amp;lt; r(p,q)\). \(t(p, q)\) represents the latest time received by \(p\) from \(q\). \(r(p, q)\) is the request time received by \(p\) from \(q\) or \(+\infty\). Intuitively, it says if a process&amp;rsquo;s request time is smaller than all repsonses time and the process&amp;rsquo;s request time is smaller than all the other request time, then this process is the first one to send out the request and thus should enter critical section.&lt;/p&gt;
&lt;p&gt;The reason why this protocol works is illustrated below:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/lamport-dme/lamport-dme-process.png#center" alt="lamport-dme-process"&gt;&lt;/p&gt;
&lt;p&gt;When \(p_1\) sends a request at timestamp 2 and gets a repsonse with timestamp 3, we know \(p_1\) has the greatest clock value and \(p_0\) will update its own clock based on the timestamp sent from \(p_1\). Now \(p_1\) sees the response message from \(p_0\) with timestamp 3, it knows any request from \(p_0\) must have already been received, because the network channel is ordered and any request sent by \(p_0\) already arrived before the response with timestamp 3.&lt;/p&gt;
&lt;p&gt;To see Lamport DME in action, we use &lt;a href="https://microsoft.github.io/ivy/language.html"&gt;Ivy&lt;/a&gt; to specify the protocol. The source file is borrowed from &lt;a href="https://mcmil.net/wordpress/"&gt;Ken&amp;rsquo;s&lt;/a&gt; presentation. The code is annotated and self-explanatory:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#lang ivy1.8
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# This is an implememtation of Lamport&amp;#39;s distributed mutual excluson
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# (DME) algorithm.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;include&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;include&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# We start the module with a &amp;#39;global&amp;#39; section. This contaions the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# declarations of any resources that are used in common by all
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# processes. These usually include:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# - Data types
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# - Services, such as network services
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# - Immutable global parameters, such as netwrok addresses
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# We can&amp;#39;t have mutable global variables, since processes, being
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# distributed, don&amp;#39;t have shared memory.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;global&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Our first global data type is the type of host identifiers. We
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# will have one process for each value of this type. Host
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# identifiers take on integer values from `0` to `node_max`.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# We create the host identifier type by instantiating the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# `host_iterable` module. It has a parameter `max` that gives the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# maximum value of the type (and is supplied at run time).
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;instance&lt;/span&gt; &lt;span class="nl"&gt;host_id&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Since we have three kinds of messages in our protocol, we define
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# an enumerated type for the message kind with three symbolic
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# values.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="n"&gt;msg_kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request_kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;reply_kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;release_kind&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# In addition, we use a sequence type to represent timestamps. The
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# `unbounded_sequence` template in the `order` library gives a
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# discrete totally ordered type with a least value `0` and a
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# `next` operator.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;instance&lt;/span&gt; &lt;span class="nl"&gt;timestamp&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;unbounded_sequence&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Our messages are stucts with three fields: the message kind and the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# host identifier of the sender and a timestamp. We order messages
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# according to the timestamp. This ordering is useful in the proof
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# of correctness.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;msg_t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="nl"&gt;kind&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg_kind&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="nl"&gt;sender_id&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;host_id&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="nl"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# definition (M1:msg_t &amp;lt; M2:msg_t) = ts(M1) &amp;lt; ts(M2)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# A useful enumerated type to describe node state:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="kt"&gt;state_t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;waiting&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;critical&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Finally we instantiate a network service via which our processes
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# will communicate. Here, `transport.net` is a template defined in the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# `network` library that we included above. The template takes one
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# parameter, which is the type of messages to be sent. Our instance
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# of this template is an object called `net`.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;instance&lt;/span&gt; &lt;span class="nl"&gt;net&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;net&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;msg_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# After the global section, we introduce some distribtued processes.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# A process with parameters has one instance for each value of the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# parameters. In this case we have one parameter of type `host_id`
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# which means there is one process in the system for each value of
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# `host_id` in the range `0..host_id.max`. The parameter is named `self`.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# This means that the process can refer to its own host identifier by
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# the name `self`.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;process&lt;/span&gt; &lt;span class="nf"&gt;node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;self&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# A process usually begins by declaring an *interface*. This
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# consists of a set of *actions* that are either calls in from the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# environment (exports) or calls out to the environment (imports).
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Our action is an export `request`, which our client uses to
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# request to enter the critical section. It takes no parameters.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Our second action is an import `grant`. This is a callback to
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# the client indicating that is is safe to enter the critical
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# section.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;import&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="n"&gt;grant&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Our third action is an export `release`. This is called by the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# client when exiting the critical section, indicating it is safe to
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# another process to enter.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="n"&gt;release&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;common&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;specification&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="nf"&gt;client_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;H&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;state_t&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;init&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;client_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;before&lt;/span&gt; &lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;self&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;require&lt;/span&gt; &lt;span class="nf"&gt;client_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;client_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;waiting&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;before&lt;/span&gt; &lt;span class="nf"&gt;grant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;self&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;require&lt;/span&gt; &lt;span class="nf"&gt;client_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;waiting&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;require&lt;/span&gt; &lt;span class="nf"&gt;client_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;~=&lt;/span&gt; &lt;span class="n"&gt;critical&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;client_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;critical&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;before&lt;/span&gt; &lt;span class="nf"&gt;release&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;self&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;require&lt;/span&gt; &lt;span class="nf"&gt;client_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;critical&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;client_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;implementation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Next we declare per-process objects. Each process needs a socket
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# on network `net` in order to communicate. We declare the socket
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# here. The socket `sock` is an instance of the template `socket`
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# declared by the network service `net`.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;instance&lt;/span&gt; &lt;span class="nl"&gt;sock&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# We also declare some local (per-process) types and variables.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="nl"&gt;state&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;state_t&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# We also keep track of the current timestamp
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="nl"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Each process maintains a &amp;#39;request queue&amp;#39;, which a map from host_ids to
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# the timestamp of the current request from that host, or `0` if none.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# This map records the highest timestamp of a reply received from
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# each host.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="nf"&gt;reply_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Having declared our variables, we initialize them. Code in an
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# `after init` section runs on initialization of the process. You
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# aren&amp;#39;t allowed to do much here, just assign values to local
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# variables.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;init&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nl"&gt;state&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nl"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;reply_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Now we come to the implementation code. Here we implement our
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# exported actions, if any, and also any callback actions from the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# services we use (i.e., actions that these services import from
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# us).
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# We start with the `request` action. This builds a request message,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# appends it to the request queue, and broadcasts it. The action `broadcast` is
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# a local action (i.e., a subroutine) and is defined later.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;implement&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nl"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="nl"&gt;outgoing&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;msg_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;kind&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request_kind&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;sender_id&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nl"&gt;state&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;waiting&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# BUG: should check waiting condition here, if host_id.max = 0
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Next we implement the callback `recv` from our network socket,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# indicating we have an incoming message. This is called
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# `sock.recv`. It gives us as input parameters the network address
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# of the sending socket (not useful here) and the incoming
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# message.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;implement&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kt"&gt;msg_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# debug &amp;#34;recv&amp;#34; with self = self, src = src, msg = incoming;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# First, we update out timestamp to reflect the incoming
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# message.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nl"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# We partly construct an outgoing message
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="nl"&gt;outgoing&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;msg_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;sender_id&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# What we do here depends on the kind of message.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# When we receive a `request` message, we put it on our request queue,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# and return a reply message to the sender.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request_kind&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;kind&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reply_kind&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;unicast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# When we receive a `release` message, the sender&amp;#39;s request
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# must be at the head of our queue. We dequeue it.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;release_kind&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# On a reply, we update the highest timestamp received from
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# this sender. Because of in-order devlivery, the timestamps
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# are received in increasing order, so the incoming one must
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# be the greatest so far.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reply_kind&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;reply_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sender_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Having proceesed the incoming message, we might now be able
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# to enter our critical section. We do this if:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# - We are in the waiting state
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# - Our request message has the least timestamp in lexicographic order
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# - Every host has sent a reply later than our request
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# debug &amp;#34;waiting&amp;#34; with self = self, rq = request_ts(X), ts = reply_ts(X);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;waiting&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;forall&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;~=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;lexord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="nf"&gt;reply_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nl"&gt;state&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;critical&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;grant&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;implement&lt;/span&gt; &lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nl"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;request_ts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="nl"&gt;outgoing&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;msg_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;sender_id&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;ts&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;kind&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;release_kind&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nl"&gt;state&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# At the end, we have definitions of internal (non-interface)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# actions (in other words, subroutines) and functions (i.e., pure
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# functions).
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# This function takes two timestamp-host_id pairs and determines
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# whether (X1,Y1) &amp;lt; (X2,Y2) in lexicogrpahic order.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;lexord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;X1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;Y1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;X2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;Y2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;X1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;X2&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;X1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X2&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;Y1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Y2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# The action `unicast` sends a message to just one process.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# To actually send a mesage to a socket, we call the `send` action
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# of our socket, giving it the receiving socket&amp;#39;s network address
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# and the message to be sent. Notice we can get the network
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# address of process with identifier `idx` with the expression
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# `node(idx).sock.id`. This might seem odd, as we asre asking for
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# the local state of an object in another process. This is allowed
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# because the network addresses of the sockets are immutable
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# parameters that are determined at initialization and are
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# provided to all processes.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="nf"&gt;unicast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kt"&gt;msg_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;dst_id&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# debug &amp;#34;send&amp;#34; with dst = dst_id, msg = outgoing;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dst_id&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# Action `broadcast` sends a message to all processes with
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# identifiers not equal to `self`. We use a &amp;#39;for&amp;#39; loop to
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# iterate over the type `host_id`. The &amp;#39;for&amp;#39; construct defines
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# two variables:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# - `it` is an &amp;#39;iterator&amp;#39; of type `host.iter`
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# - `dst_id` is the value of the type the iterator refers to
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# The reason we do it this way is the the finite subrange type
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# `host_id` has no value the is &amp;#39;past the end&amp;#39; of the type, so
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# you can&amp;#39;t write a traditional &amp;#39;for&amp;#39; loop over this type. The
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# iterator type, however, does have a value corresponding to
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# &amp;#39;past the end&amp;#39;.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kt"&gt;msg_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;dst_id&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cp"&gt;# do not send to self!
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dst_id&lt;/span&gt; &lt;span class="o"&gt;~=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;unicast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outgoing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dst_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# To compile and run with 3 nodes:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# $ ivyc lamport_mutex.ivy
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# $ ivy_launch host_id.max=3
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# To test:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# $ ivyc target=test lamport_mutex.ivy
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# $ ivy_launch host_id.max=3
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# Bounded model checking:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# TODO: As usual, we need the assumption that all endpoint id&amp;#39;s are
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# distinct.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;axiom&lt;/span&gt; &lt;span class="nf"&gt;node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# This says to try bounded model checking up to 20 steps (but Ivy
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# won&amp;#39;t actually get that far). The second parameter say to unroll the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# loops three times. This means that BMC ignores all executions in
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# which a loop is executed more than three times. We need this because of
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# the loop in `node.broadcast`
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;attribute&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bmc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# Try adding a bug and see if you can find it with testing and bmc. Change
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# this definition above:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# function lexord(X1:timestamp,Y1:host_id,X2:timestamp,Y2:host_id) =
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# X1 &amp;lt; X2 | X1 = X2 &amp;amp; Y1 &amp;lt; Y2
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# to this:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# function lexord(X1:timestamp,Y1:host_id,X2:timestamp,Y2:host_id) =
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# X1 &amp;lt;= X2 | X1 = X2 &amp;amp; Y1 &amp;lt; Y2
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# This mistake could allow two nodes with requests with the same timestamp
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# to enter the CS at the same time. Here&amp;#39;s a counter-example produced
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# by BMC (it takes a while!):
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# &amp;gt; node.request(1)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# &amp;gt; node.request(0)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# &amp;gt; node.sock.recv(0,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:request,msg_t.sender_id:1,msg_t.ts:1})
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# &amp;gt; node.sock.recv(1,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:request,msg_t.sender_id:0,msg_t.ts:1})
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# &amp;gt; node.sock.recv(1,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:reply,msg_t.sender_id:0,msg_t.ts:2})
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# &amp;lt; node.enter_cs(1)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# &amp;gt; node.sock.recv(0,{tcp.endpoint.addr:...,tcp.endpoint.port:...},{msg_t.kind:reply,msg_t.sender_id:1,msg_t.ts:2})
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# &amp;lt; node.enter_cs(0)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;# lamport_mutex_save.ivy: line 137: error: assertion failed
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description></item><item><title>Specifying Token Ring for Mutual Exclusion</title><link>https://www.bodunhu.com/blog/posts/specifying-token-ring-for-mutual-exclusion/</link><pubDate>Sat, 11 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/specifying-token-ring-for-mutual-exclusion/</guid><description>&lt;p&gt;Mutual exclusion is a common term appearing frequently in computer sciences. In essence, it&amp;rsquo;s a mechanism of concurrency control allowing exclusive access to some resource (or &amp;ldquo;critical region&amp;rdquo;). Token passing is an algorithm for distributed mutual exclusion (DME) and will be our focus in this post.&lt;/p&gt;
&lt;p&gt;DME specifications usually make the following assumptions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Network delivers message in order, e.g. TCP (sometimes)&lt;/li&gt;
&lt;li&gt;Every message is eventually delivered (usually)&lt;/li&gt;
&lt;li&gt;Messages are never duplicated. Duplication may result granting resources to multiple clients, which is not what mutual exclusion demands (usually)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thing we might want to guarantee for DME specifications are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Mutual exclusion, at most one client is in a critical section (always)&lt;/li&gt;
&lt;li&gt;Non-starvation. A requesting client enters critical section eventually (usually)&lt;/li&gt;
&lt;li&gt;Non-overtaking. A client cannot enter critical section more than once while another client waits (usually)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition, we need to analyze DME algorithms&amp;rsquo; performance metrics, which usually includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Message complexity, e.g. number of messages sent between clients being served&lt;/li&gt;
&lt;li&gt;response time, or time between request and entering CS&lt;/li&gt;
&lt;li&gt;Throughput, or rate of processing CS requests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let&amp;rsquo;s take a token ring as an example. In a token ring, a client holds a token and then sends it to the next one after exiting its critical section. When we make assumptions about a token ring, we&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;do not need to have network delivering messages in order, because at any given time in a token, there is at most one message in transit.&lt;/li&gt;
&lt;li&gt;ensure every message is eventually delivered. Otherwise, the system won&amp;rsquo;t make progress, and we will not have non-starvation guarantee.&lt;/li&gt;
&lt;li&gt;need non-duplication for messages. Otherwise, we violate the fundamental properties of this protocol, or no mutual exclusion.&lt;/li&gt;
&lt;li&gt;clients don&amp;rsquo;t spuriously release. This will be clear later when we demonstrate what happens if clients release multiple times.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/token-ring-protocol/token-ring.png#center" alt="token-ring"&gt;&lt;/p&gt;
&lt;p&gt;We want to guarantee that&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;mutual exclusion holds.&lt;/li&gt;
&lt;li&gt;non-starvation&lt;/li&gt;
&lt;li&gt;non-overtaking, because token will get through every client in the network first because repetition happens.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To analyze token performance, we use the above performance metrics (message complexity, response time, and throughput)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Message complexity: when the system is under low load, the message complexity is unbounded because there may be an arbitrary number of messages being sent throughout the network where no one is in the critical section. When system is under high load, the message complexity is 1.&lt;/li&gt;
&lt;li&gt;Response time: when the system is under low load, there could be \(N\) messages times (where \(N\) is the total number of clients). When under high load, the response time would be 1 message time.&lt;/li&gt;
&lt;li&gt;Throughput: the maximum throughput would be 1/(message time + CS time)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A naive specification for mutex in &lt;a href="https://microsoft.github.io/ivy/language.html"&gt;Ivy&lt;/a&gt; would be:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;action&lt;/span&gt; &lt;span class="n"&gt;grant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;action&lt;/span&gt; &lt;span class="n"&gt;release&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;specification&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;X:&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;bool&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;init&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;before&lt;/span&gt; &lt;span class="n"&gt;grant&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;require&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;X&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;before&lt;/span&gt; &lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;require&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;false&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;To see token ring in action, we use the demo from &lt;a href="http://mcmil.net/wordpress/"&gt;Ken&lt;/a&gt;&amp;rsquo;s presentation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="n"&gt;lang&lt;/span&gt; &lt;span class="n"&gt;ivy1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;include&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;include&lt;/span&gt; &lt;span class="n"&gt;numbers&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;global&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;type&lt;/span&gt; &lt;span class="n"&gt;host_id&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;alias&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="n"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;instance&lt;/span&gt; &lt;span class="n"&gt;net&lt;/span&gt; &lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;process&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;host_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;action&lt;/span&gt; &lt;span class="n"&gt;grant&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="n"&gt;release&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;specification&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt; &lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="n"&gt;bool&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;init&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;lock&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;before&lt;/span&gt; &lt;span class="n"&gt;grant&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;require&lt;/span&gt; &lt;span class="n"&gt;forall&lt;/span&gt; &lt;span class="kt"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;lock&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;before&lt;/span&gt; &lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;require&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;lock&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;implementation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;instance&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt; &lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;init&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;pass&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="n"&gt;pass&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt; &lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="kr"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt; &lt;span class="kr"&gt;else&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;implement&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;grant&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;implement&lt;/span&gt; &lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We put the above code into a file called &lt;code&gt;token_ring.ivy&lt;/code&gt; and compile it using &lt;code&gt;ivyc token_ring.ivy&lt;/code&gt;. Then we launch the program using &lt;code&gt;ivy_launch max=2 token_ring.ivy&lt;/code&gt;, which opens three terminal windows.&lt;/p&gt;
&lt;p&gt;If we type in host(1) with &lt;code&gt;host.release&lt;/code&gt;, we see in host(2) it outputs &lt;code&gt;host.grant&lt;/code&gt;, which seems to show that the token work properly. However, if we type &lt;code&gt;host.release&lt;/code&gt; again in host(1), &lt;code&gt;host.grant&lt;/code&gt; will show up again in host(2), resulting in multiple tokens getting created, which violates the requirement that there is at most one token in the ring at any given time.&lt;/p&gt;
&lt;p&gt;If we execute &lt;code&gt;ivyc target=test token_ring.ivy &amp;amp;&amp;amp; ivy_launch max=2 token_ring.ivy&lt;/code&gt;, then we see the token ring work properly. The reason is we have specified the requirement for &lt;code&gt;grant&lt;/code&gt; and &lt;code&gt;release&lt;/code&gt; (&lt;code&gt;require forall X. ~host(X).lock&lt;/code&gt; for &lt;code&gt;grant&lt;/code&gt; and &lt;code&gt;require lock&lt;/code&gt; for &lt;code&gt;release&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;&lt;code&gt;grant&lt;/code&gt; is an action &lt;em&gt;imported&lt;/em&gt; from the environment, thus we know when grant happens, all clients in the network do not hold a lock. On the other hand, &lt;code&gt;release&lt;/code&gt; is an action &lt;em&gt;exported&lt;/em&gt; from the system, which means the tester must perform grant given the host has the lock. So the tester won&amp;rsquo;t perform &lt;em&gt;release&lt;/em&gt; multiple times like we did above because the tester can not violate the &lt;code&gt;require lock&lt;/code&gt; requirement.&lt;/p&gt;</description></item><item><title>Writing Specifications for a Distributed System using Ivy</title><link>https://www.bodunhu.com/blog/posts/writing-specifications-for-a-distributed-system-using-ivy/</link><pubDate>Wed, 08 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/writing-specifications-for-a-distributed-system-using-ivy/</guid><description>&lt;p&gt;Before we jump into writing specifications in a distributed setting, we first define what a specification is. I take the definition from the magnificent &lt;a href="http://mcmil.net/wordpress/"&gt;Ken McMillan&lt;/a&gt;: a specification is a &lt;em&gt;statement&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;A statement describes an abstract view of a program. The view itself is often at an interface, which hides or abstracts internal states. A specification is stated in terms of two elements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Assumption: properties of the environment the system relies on&lt;/li&gt;
&lt;li&gt;Guarantee: properties that most hold &lt;em&gt;if&lt;/em&gt; the assumption(s) is met&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The way we write specifications is through an abstract program that observes or monitors all program events. This abstract program is able to remember the execution history of program being monitored, and decides, at any given moment, whether an action is allowable according to the specification.&lt;/p&gt;
&lt;p&gt;One way to implement this abstract monitor program is to use guarded command form:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Let \(A\) be a set of program actions&lt;/li&gt;
&lt;li&gt;An event \(e(x_1,\ &amp;hellip;,\ x_n)\) is an action \(e\in A\) with parameter values \(x_1,\ &amp;hellip;,\ x_n\) of the right types for \(e\).&lt;/li&gt;
&lt;li&gt;Let \(S\) be a set of states and \(s_0 \in S\) be the initial state.&lt;/li&gt;
&lt;li&gt;Guarded command set \(G\) is specified as:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;\[e(V):\ \gamma (S,V) \rightarrow {S := \tau(S, V)}\]&lt;/p&gt;
&lt;p&gt;It means if a guarded command \(\gamma\) determines a given event \(e\) satisfies certain specifications with parameter \(V\) under state \(S\), then we accept the code and then deterministically update the state with a function \(\tau\).&lt;/p&gt;
&lt;p&gt;The observation \(E\) of system is going to be a finite sequence of events, which corresponds to the system behavior, denoted as \(e_0(V_0)&amp;hellip;e_{n-1}(V_{n-1})\). A run of \(E\) is a state sequence \(s_0\ &amp;hellip;s_n\) such that for \(i\in 0\ &amp;hellip; n- 1\), \(\gamma(s_i, V_i)\) is true and \(s_{i+1} = \tau(s_i, v_i)\). Observation \(E\) is accepted by the specification iff it has a run. We can test whether an observation is accepted by just executing the guarded commands. In layman&amp;rsquo;s term, if all guarded commands accepts the their corresponding event at a given time, then the sequence events must satisfy our specification and should be accepted.&lt;/p&gt;
&lt;p&gt;Now let&amp;rsquo;s replicated file as an example. Out first informal attempt to the specification for &amp;ldquo;append&amp;rdquo; operation would be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Assumption: network is ordered and non-duplicating&lt;/li&gt;
&lt;li&gt;Guarantee: if no further append requests, eventually replicas are equal&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, the problem with this specification is that this is a liveness property, meaning that we can&amp;rsquo;t practically test such property by observing a finite sequence of events. Therefore, we resort to a different safety specification we can test:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If all sent messages are delivered, the two replicas are identical.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now we convert liveness to safety by explicitly defining the moment hen the eventuality should hold.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Liveness property means a good thing eventually happens. A liveness property can be &lt;em&gt;refuted&lt;/em&gt; by finite execution. Safety property means a bad thing never happens. A safety property can always be refuted by a finite execution.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To see how replicated file specification plays in action, we use the example given in &lt;a href="http://mcmil.net/wordpress/"&gt;Prof. McMillan&lt;/a&gt;&amp;rsquo;s presentation. The code is written in &lt;a href="http://microsoft.github.io/ivy/language.html"&gt;Ivy&lt;/a&gt; and is pretty self-explanatory. In this demo we only have two processes.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To install Ivy, simply execute &lt;code&gt;virtualenv ivyenv &amp;amp;&amp;amp; source ivyenv/bin/activate &amp;amp;&amp;amp; pip install ms-ivy&lt;/code&gt;. This is tested on Ubuntu 18.04 LTS and may vary slight on other distros.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="n"&gt;lang&lt;/span&gt; &lt;span class="n"&gt;ivy1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;include&lt;/span&gt; &lt;span class="n"&gt;numbers&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;include&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;include&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;global&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;alias&lt;/span&gt; &lt;span class="n"&gt;byte&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="n"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;instance&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;type&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;instance&lt;/span&gt; &lt;span class="n"&gt;net&lt;/span&gt; &lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;process&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;action&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;instance&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt; &lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="n"&gt;contents&lt;/span&gt; &lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;contents&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;implement&lt;/span&gt; &lt;span class="n"&gt;append&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;contents&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;implement&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;contents&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then we form our specification based on the guarantee that if all sent messages are delivered, the two replicas are identical. The specification is equivalent to the &lt;em&gt;guarded command&lt;/em&gt; we&amp;rsquo;ve talked about earlier.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;specification&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;var&lt;/span&gt; &lt;span class="n"&gt;msg_count&lt;/span&gt; &lt;span class="kt"&gt;:&lt;/span&gt; &lt;span class="n"&gt;nat&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;init&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;msg_count&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dst&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;msg_count&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;msg_count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;msg_count&lt;/span&gt; &lt;span class="kt"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;msg_count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;ensure&lt;/span&gt; &lt;span class="n"&gt;msg_count&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We wrote the above code into a file named &lt;code&gt;append.ivy&lt;/code&gt; and we generate the testing code using &lt;code&gt;ivyc target=test append.ivy&lt;/code&gt;. Then we run the code using &lt;code&gt;ivy_launch append.ivy&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Interestingly, the program yields an error message:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;`ivy_shell`; ./append &amp;#34;[[0,{addr:0x7f000001,port:49124}],[1,{addr:0x7f000001,port:49125}]]&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;gt; host.append(1,251)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(1,[251])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(0,[251])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;gt; host.append(1,46)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(1,[251,46])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;gt; host.append(0,183)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(0,[251,183])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(0,[251,183,46])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(1,[251,46,183])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;assertion_failed(&amp;#34;append.ivy: line 49&amp;#34;)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;append.ivy: line 49: error: assertion failed
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;What happens is the program generates tests that randomizes message arrival times and we can see a delivered message may arrive after its target sends another message, therefore creating corrupted file contents.&lt;/p&gt;
&lt;p&gt;Notice that here we are actually running on real network to find counter examples, the downside is the test may be arbitrary long depending on the randomized testing cases. Instead, we will use bounded model checking (BMC) to test if the specification is correct. This way we can reply purely on the logic of our specification instead of running on the real network. The Ivy checker uses &lt;a href="https://en.wikipedia.org/wiki/Z3_Theorem_Prover"&gt;Z3 Theorem Prover&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;BMC construct a boolean formula that is satisfiable if and only if the underlying state transition system can realize a finite sequence of state transitions that reaches certain states of interest.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To tell Ivy using bounded model checking, we add the following lines to &lt;code&gt;append.ivy&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;axiom&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;~=&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;attribute&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bmc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Executing &lt;code&gt;ivy_check detailed=false append.ivy&lt;/code&gt;, we see an error message:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;gt; host.append(1,80)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(1,[80])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;gt; host.append(0,64)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(0,[64])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;gt; host.sock.recv(0,{tcp.endpoint.addr:...,tcp.endpoint.port:...},80)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(0,[64,80])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;gt; host.sock.recv(1,{tcp.endpoint.addr:...,tcp.endpoint.port:...},64)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt; host.show(1,[80,64])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;append.ivy: line 49: error: assertion failed
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Sometimes BMC can help us find the error faster because it is systematically checking all possible actions. However, increasing the number of steps for the BMC can result in the exploration space growing exponentially, so we are going to use some combination of BMC and randomized test cases.&lt;/p&gt;</description></item><item><title>Whiz: Data-Driven Analytics Execution</title><link>https://www.bodunhu.com/blog/posts/whiz-data-driven-analytics-execution/</link><pubDate>Sun, 05 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/whiz-data-driven-analytics-execution/</guid><description>&lt;p&gt;This paper by &lt;a href="https://utns.cs.utexas.edu/"&gt;UTNS&lt;/a&gt; lab appeared in &lt;a href="https://www.usenix.org/conference/nsdi21"&gt;NSDI 2021&lt;/a&gt;. It presents a data-analytics framework that decouples intermediate data from computations.&lt;/p&gt;
&lt;p&gt;Whiz addresses several challenged posed by current analytics frameworks. The first one is data opacity. Most modern data analytics frameworks relies on MapReduce execution engine. The developer specifies the map and reduce function, which then get submitted to the analytics framework. The workflow can be expressed as a logical graph; the physical graph (which includes the cluster configuration, disk quota, etc.) is generated transparently. The workflow is shown below:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/whiz-data-driven-analytics-execution/limitation1.png#center" alt="mapreduce-limitation"&gt;&lt;/p&gt;
&lt;p&gt;The problem is in the region marked yellow. It shows the execution engine has limited runtime visibility into the intermediate data. Thus, adapting processing logic of tasks based on the states of intermediate data becomes challenging.&lt;/p&gt;
&lt;p&gt;In addition, task parallelism and intermediate data partition strategy are often static. In the graph above, the intermediate data partition tasks and the final reduce tasks might be determined prematurely, without taking the intermediate data partition characteristics into account. For example, &lt;a href="https://itnext.io/handling-data-skew-in-apache-spark-9f56343e58e8"&gt;data skew&lt;/a&gt; (unevenly distributed jobs) causes different reduce nodes to process different amount of tasks. The graph below illustrates how the shuffle stage can result in disproportional intermediate data partitions.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/whiz-data-driven-analytics-execution/data-skew.png#center" alt="data-skew"&gt;&lt;/p&gt;
&lt;p&gt;Finally, Whiz addresses the limitation posed by compute-driven scheduling. In compute-driven scheduling, one stage usually relies on the completion of the upstream tasks, the may lead to compute idling waiting for remaining data to become available, even if the a subset of workers in the current stage is ready for execution. Decoupling data from computation enables the execution engine to treat intermediate data as first-class citizen, thus allowing finer-grained control of data processing.&lt;/p&gt;
&lt;p&gt;In summary, Whiz solves two problems presented in compute-centric execution engines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tight coupling between intermediate data and compute.&lt;/li&gt;
&lt;li&gt;intermediate data agnosticity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thus, Whiz creates a feedback loop between the execution service and the data service so that the execution can dynamically adjust its policy based on the information offered by the data service to optimize system performances.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/whiz-data-driven-analytics-execution/whiz-control-flow.png#center" alt="whiz-control-flow"&gt;&lt;/p&gt;
&lt;p&gt;Whiz classifies itself as a &lt;strong&gt;data-driven&lt;/strong&gt; execution engine, which drives execution based on intermediate data properties. Making intermediate data visible opens door for optimization opportunities, thus increasing performances. For more technical details regarding the architecture and implementation of Whiz, please refer to the original &lt;a href="https://www.usenix.org/system/files/nsdi21-grandl.pdf"&gt;paper&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>In-Network Aggregation for Shared Machine Learning Clusters</title><link>https://www.bodunhu.com/blog/posts/in-network-aggregation-for-shared-machine-learning-clusters/</link><pubDate>Tue, 31 Aug 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/in-network-aggregation-for-shared-machine-learning-clusters/</guid><description>&lt;p&gt;This &lt;a href="https://proceedings.mlsys.org/paper/2021/file/eae27d77ca20db309e056e3d2dcd7d69-Paper.pdf"&gt;paper&lt;/a&gt; by Nadeen appeared in MLSys 2021. It presents an in-network aggregation framework called &lt;em&gt;PANAMA&lt;/em&gt; for distributed ML training tasks. &lt;em&gt;PANAMA&lt;/em&gt; has two components: (1) an in-network hardware accelerator with support for floating-point gradient aggregation; (2) a domain-specific load-balancing and congestion control protocol.&lt;/p&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;The primary motivation behind &lt;em&gt;PANAMA&lt;/em&gt; is the &lt;em&gt;data-parallel&lt;/em&gt; training (in which the neural network is replicated across \(N\) worker where each worker processes a subset of the training data) demands constant local gradient exchanging at every iteration, thus creating a huge amount of traffic.&lt;/p&gt;
&lt;p&gt;For example, for a training job with \(1000\) workers and 1 GB DNN model size requring \(1000\) iterations, the total traffic will be about 2 PB.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/in-network-aggregation-for-shared-machine-learning-clusters/network-flow-size.png#center" alt="in-network-aggregation-traffic"&gt;&lt;/p&gt;
&lt;h2 id="network-design"&gt;Network Design&lt;/h2&gt;
&lt;p&gt;The paper assumes a traditional data center multi-tier folded &lt;a href="https://en.wikipedia.org/wiki/Clos_network"&gt;Clos topology&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/in-network-aggregation-for-shared-machine-learning-clusters/clos.png#center" alt="clos-topology"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;PANAMA&lt;/em&gt; uses multiple aggregation trees per training job to spread the traffic across multiple paths and avoid congestion hotspots. This is different to equal-cost multi-path (ECMP) protocol because the aggregation flows are typically large. Bounding such flows to a single aggregation tree will create network imbalance.&lt;/p&gt;
&lt;h2 id="congestion-control"&gt;Congestion Control&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;PANAMA&lt;/em&gt; uses &lt;strong&gt;implicit acknowledgments&lt;/strong&gt; instead of traditional point-to-point approaches. Because each aggregated packets are constructed on the fly, one-to-one mapping between packets and the acknowledgements is unnecessary, if a worker receives aggregation results, that automatically serves as an &lt;em&gt;implicit&lt;/em&gt; acknowledgement. This eliminated the need to keep a per-flow congestion state at PSwitches.&lt;/p&gt;
&lt;p&gt;Similar to &lt;a href=""&gt;DCTCP&lt;/a&gt;, &lt;em&gt;PANAMA&lt;/em&gt; relies on ECN marks in the IP header to react to the network congestion. Since aggregation packets are created on the switch, each hardware accelerator need to perform a bitwise \(OR\) on the ECN field of received packets to mirror the traditional ECN bit.&lt;/p&gt;
&lt;h2 id="hardware-design"&gt;Hardware Design&lt;/h2&gt;
&lt;p&gt;The design of the aggregation accelerator in &lt;em&gt;PANAMA&lt;/em&gt; is straightforward: it utilized the SIMD architecture in which the gradients are partitioned across adder trees. Adder tree can operate in parallel and pack the results and sent them to the output ports. The VID fields are merely used to correct aggregation.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/in-network-aggregation-for-shared-machine-learning-clusters/aggregater-arch.png#center" alt="aggregator-arch"&gt;&lt;/p&gt;
&lt;p&gt;Overall, the workflow is really simple and illustrated below:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/in-network-aggregation-for-shared-machine-learning-clusters/network-aggregation-workflow.png" alt="network-aggregation-workflow"&gt;&lt;/p&gt;</description></item><item><title>Deploy Hugo Site to GitHub Pages</title><link>https://www.bodunhu.com/blog/posts/deploy-hugo-site-to-github-pages/</link><pubDate>Fri, 27 Aug 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/deploy-hugo-site-to-github-pages/</guid><description>&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;em&gt;The &lt;a href="https://gohugo.io/hosting-and-deployment/hosting-on-github/"&gt;official guide from Hugo&lt;/a&gt; is for deploying from public repo. This post is intended for deploying from private repo&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This post assumes the user has already setup two separate repositories: a private repository for Hugo source files, and a public repository for &lt;a href="https://pages.github.com/"&gt;GitHub Pages&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: test Hugo site by executing &lt;code&gt;hugo server&lt;/code&gt; in the source code directory to make sure the site is generated properly.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Then, we need to generate a pair of keys by using the following command:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ssh-keygen -t rsa -b &lt;span class="m"&gt;4096&lt;/span&gt; -C &lt;span class="s2"&gt;&amp;#34;&lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;git config user.email&lt;span class="k"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&lt;/span&gt; -f deployment -N &lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This will create two files: &lt;code&gt;deployment&lt;/code&gt; and &lt;code&gt;deployment.pub&lt;/code&gt;, which corresponds to a private key and a public key.&lt;/p&gt;
&lt;p&gt;Next, execute &lt;code&gt;cat deployment&lt;/code&gt; and copy the private key. Navigate to the private &lt;em&gt;source repository -&amp;gt; Settings -&amp;gt; Secrets -&amp;gt; New repository secret&lt;/em&gt;. Paste the private key and save the change.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/deploy-hugo-site-to-github-pages/github-actions-secrets.png" alt="Github-actions-secrets"&gt;&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve already added the private key to the source directory and named it &lt;code&gt;PRIVATE_KEY&lt;/code&gt;. You can named it however you want.&lt;/p&gt;
&lt;p&gt;Then, we go to the public repository for hosting our website. Navigate to the public &lt;em&gt;site repository -&amp;gt; Settings -&amp;gt; Deploy keys -&amp;gt; Add deploy key&lt;/em&gt;. Execute &lt;code&gt;cat deployment.pub&lt;/code&gt; and copy paste the result. You should see a SSH key added:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/deploy-hugo-site-to-github-pages/github-keys.png" alt="Github-keys"&gt;&lt;/p&gt;
&lt;p&gt;Finally, create a directory in the private repository in the following directory: &lt;code&gt;.github/workflows/deploy.yml&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;github pages&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;push&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="l"&gt;main &lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c"&gt;# Set a branch to deploy&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;pull_request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;deploy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;runs-on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;ubuntu-20.04&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;actions/checkout@v2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;submodules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c"&gt;# Fetch Hugo themes (true OR recursive)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;fetch-depth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c"&gt;# Fetch all history for .GitInfo and .Lastmod&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Setup Hugo&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;peaceiris/actions-hugo@v2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;hugo-version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;latest&amp;#39;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;extended&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Build&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;hugo --minify&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Deploy&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;peaceiris/actions-gh-pages@v3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;deploy_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;${{ secrets.PRIVATE_KEY }}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;external_repository&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;your_username/public_repository_name&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;publish_branch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;branch_to_publish&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;publish_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;./public&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Finally, make sure you create a file named &lt;code&gt;.nojekyll&lt;/code&gt; in the root directory of the public repository to prevent GitHub Pages from building the site using Jekyll.&lt;/p&gt;
&lt;p&gt;Every time you make commits to the private repository, the site will be automatically generated and published on the public repository.&lt;/p&gt;</description></item><item><title>Quantum State in a Nutshell</title><link>https://www.bodunhu.com/blog/posts/quantum-state-in-a-nutshell/</link><pubDate>Thu, 19 Aug 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/quantum-state-in-a-nutshell/</guid><description>&lt;p&gt;There are thousands of articles trying to explain what exactly a quantum state is. Many of them boiled down to &amp;ldquo;the state of a qubit is 0, 1, or 0 and 1 at the same time&amp;rdquo;. This statement leads to both confusion and misinterpretation. The explanation I found on &lt;a href="https://quantum.country/qcvc"&gt;Quantum computing for the very curious&lt;/a&gt; is by far the most elegant and simplest:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The state of a qubit is a vector in a two-dimensional vector space. This vector space is known as state space.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I will use many of great content from &lt;a href="https://quantum.country/qcvc"&gt;Quantum computing for the very curious&lt;/a&gt; to explain things.&lt;/p&gt;
&lt;h2 id="mapping-qubits-to-classical-bits"&gt;Mapping qubits to classical bits&lt;/h2&gt;
&lt;p&gt;We&amp;rsquo;ve described what a qubit state is, but provided no link between a qubit state and a classical bit state. There are two possible states for a classical bit: 0 and 1. The corresponding states for a qubit is slightly fancier: \(|0\rangle \) and \(|1\rangle \).&lt;/p&gt;
&lt;p&gt;The notation with \(|\) and \(\rangle\) is called a \(ket\) notation. With a number wrapped between them, \(0\) or \(1\) are called \(kets\). A \(ket\) is a fancy term for a vector. In fact, \(|0\rangle\) is really just \(
\begin{bmatrix}
1 \newline
0
\end{bmatrix}
\); \(|1\rangle\) is really just \(
\begin{bmatrix}
0 \newline
1
\end{bmatrix}
\).&lt;/p&gt;
&lt;h2 id="states-between-0-and-1"&gt;States between 0 and 1&lt;/h2&gt;
&lt;p&gt;Both the states \(|0\rangle\) and \(|1\rangle\) are called computational basis states, which maps to classical 0 and 1 states. There are more states for a qubit. We&amp;rsquo;ve already learned that a quantum state is a two-dimensional vector. An example is given:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://quantum.country/qcvc#general-states-of-a-qubit"&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/quantum-state-in-a-nutshell/quantum-state-eg.png#center" alt="quantum-state-example"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The state \(0.6|0\rangle + 0.8|1\rangle\) is just a combination of the \(|0\rangle\) vector and the \(|1\rangle\) vector. A state like this is a &lt;em&gt;superposition&lt;/em&gt; of \(|0\rangle\) and \(|1\rangle\), a fancy way of saying a linear combination of \(|0\rangle\) and \(|1\rangle\). \(0.6\) is the &lt;em&gt;amplitude&lt;/em&gt; for state \(|0\rangle\), and \(0.8\) is the &lt;em&gt;amplitude&lt;/em&gt; for state \(|1\rangle\).&lt;/p&gt;
&lt;p&gt;Not all linear combination of vector \(|0\rangle\) and \(|1\rangle\) are qubit states. There is one constraint: &lt;em&gt;the sums of the squares of the amplitudes must be 1&lt;/em&gt;. For example, we can compute \(0.6^2 + 0.8^2\) and verify the result is 1.&lt;/p&gt;
&lt;p&gt;For general quantum states, the amplitudes can be complex numbers as well. Denoting both amplitudes as \(\alpha\) and \(\beta\), a quantum state can be formally written as:&lt;/p&gt;
&lt;p&gt;\[\alpha |0\rangle + \beta |1\rangle \wedge \alpha^2 + \beta^2 = 1\]&lt;/p&gt;
&lt;p&gt;\(\alpha^2 + \beta^2 = 1\) is called the &lt;em&gt;normalization constraint&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;If we think of \(|0\rangle\) and \(|1\rangle\) as orthonormal vectors, we can visualize the possible linear combination of these two vectors as a circle of radius 1:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://quantum.country/qcvc#general-states-of-a-qubit"&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/quantum-state-in-a-nutshell/quantum-states.png#center" alt="quantum-states"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Since amplitudes can be complex numbers, the state space really becomes a sphere. Summing all these up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;the quantum state of a qubit is a vector of unit length in a two-dimensional complex vector space known as state space.&lt;/p&gt;
&lt;p&gt;&amp;ndash; &lt;em&gt;Quantum computing for the very curious&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="measuring-a-qubit"&gt;Measuring a qubit&lt;/h2&gt;
&lt;p&gt;Suppose we have qubit in a quantum state \(\alpha |0\rangle + \beta |1\rangle\). We want to observe the state of this specific qubit. It turns out the law of physics prohibits us from figuring out the the amplitudes \(\alpha\) and \(\beta\) if they start out unknown. In short, the quantum state of any system is not directly observable.&lt;/p&gt;
&lt;p&gt;To figure out the quantum state. We rely on a process called &lt;em&gt;measurement in the computational basis&lt;/em&gt;. Suppose a qubit is in the state \(\alpha |0\rangle + \beta |1\rangle\). Measuring the state of this qubit gives us the outcome \(0\) with probability \(|\alpha|^2\), or 1 with probability \(|\beta|^2\). The state of the qubit after the measurement is thus either \(|0\rangle\) or \(|1\rangle\). After the measurement, \(\alpha\) and \(\beta\) are gone.&lt;/p&gt;</description></item><item><title>Writing in the Sciences - Writing Process</title><link>https://www.bodunhu.com/blog/posts/writing-in-the-sciences-writing-process/</link><pubDate>Mon, 09 Aug 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/writing-in-the-sciences-writing-process/</guid><description>&lt;p&gt;This post covers the topics mentioned in &lt;a href="https://www.coursera.org/learn/sciwrite"&gt;Writing in the Sciences&lt;/a&gt; offered on &lt;a href="https://www.coursera.org/"&gt;Coursera&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="writing-process"&gt;Writing Process&lt;/h2&gt;
&lt;p&gt;The writing process includes three steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prewriting
&lt;ul&gt;
&lt;li&gt;Collect and organize information&lt;/li&gt;
&lt;li&gt;Brainstorm take-home messages&lt;/li&gt;
&lt;li&gt;Work out ideas away from the computer&lt;/li&gt;
&lt;li&gt;Develop a road map&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Writing the first draft
&lt;ul&gt;
&lt;li&gt;Putting ideas together in organized prose&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Revision
&lt;ul&gt;
&lt;li&gt;Read out loud&lt;/li&gt;
&lt;li&gt;Cut the clutter&lt;/li&gt;
&lt;li&gt;Verb check&lt;/li&gt;
&lt;li&gt;Get feedback&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A lot of people often convolute step 2 and 3. They try to write and revise at the same time, which is anything but efficient. It&amp;rsquo;s hard to resist the impulse to be perfect. Paying too much attention to details obfuscates the whole picture. Unsurprisingly, the class poll shows most people focus on the writing step:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/writing-in-the-sciences-writting-process/writing-poll.png#center" alt="writing-poll"&gt;&lt;/p&gt;
&lt;p&gt;A better time allocation would look like:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Prewriting (70%)&lt;/li&gt;
&lt;li&gt;Writing the first draft (10%)&lt;/li&gt;
&lt;li&gt;Revision (20%)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="the-prewriting"&gt;The Prewriting&lt;/h2&gt;
&lt;p&gt;The key to prewriting is to get organized first. What it means is you shouldn&amp;rsquo;t try to write and gather information simultaneously. Instead, you should gather and organize information before writing the first draft. That means you need to have an organization system to help you keep track of various thoughts. I personally found writing this &lt;a href="https://www.bodunhu.com/blog/"&gt;blog&lt;/a&gt; a really good way to keep myself motivated but there are definitely alternatives.&lt;/p&gt;
&lt;h3 id="compositional-organization"&gt;Compositional Organization&lt;/h3&gt;
&lt;p&gt;Here are some simply tips to help organizing ideas:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Group like ideas/paragraphs together, which often reveals necessary repetition.&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t &amp;lsquo;&amp;lsquo;Bait-and_switch&amp;rsquo;&amp;rsquo; you readers. Switching arguments too many times leads to confusion.
&lt;ul&gt;
&lt;li&gt;When discussion controversy, flow the arguments -&amp;gt; counter-arguments -&amp;gt; rebuttals pattern.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="the-writing"&gt;The Writing&lt;/h2&gt;
&lt;p&gt;This is hardest step for most people. This where you pop up a blank windows and start up writing. The biggest tips is to &lt;em&gt;not be a perfectionist&lt;/em&gt;. The first draft should aims to get the ideas down in complete sentences in order. You should even purposefully set a low bar to get the first draft out quickly.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Focus on logical organization more than sentence-level details.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The recommended order for writing a manuscript is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tables and Figures&lt;/li&gt;
&lt;li&gt;Results
&lt;ul&gt;
&lt;li&gt;Summarize what the data show by (1) pointing out simple relationships; (2) describing big-picture trends; (3) citing figure or table that present supporting data.&lt;/li&gt;
&lt;li&gt;Avoid simply repeating the numbers already available in tables and figures.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Methods&lt;/li&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Discussion&lt;/li&gt;
&lt;li&gt;Abstract&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Step 1 to 3 involve the most concrete things to put down. They help you frame the introduction.&lt;/p&gt;
&lt;h3 id="tips-for-writing-results"&gt;Tips for Writing Results&lt;/h3&gt;
&lt;p&gt;Here are a few tips for writing results:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Break into subsections, with headings&lt;/li&gt;
&lt;li&gt;Complement the information that is already in tables and figures
&lt;ul&gt;
&lt;li&gt;Give precise values that are not available in the figure&lt;/li&gt;
&lt;li&gt;Report the percent change or percent difference if absolute values are given in the table&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Repeat/highlight only the most important numbers&lt;/li&gt;
&lt;li&gt;Talk about negative and control results&lt;/li&gt;
&lt;li&gt;Reserve the term &amp;lsquo;&amp;lsquo;significant&amp;rsquo;&amp;rsquo; for statistically significant&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t mix results with methods
&lt;ul&gt;
&lt;li&gt;Don&amp;rsquo;t discuss the rationale for statistical analyses within the Results section&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Reserve comments on the meaning of your results for the discussion section. (show vs meaning)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="writing-introductions"&gt;Writing Introductions&lt;/h3&gt;
&lt;p&gt;The good news is that the introduction is easier to write than you may realize. Typically, the recommended range for an introduction is 2 to 5 paragraphs long. The introduction forms a cone structure:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/writing-in-the-sciences-writting-process/cone.png#center" alt="introduction-structure-cone"&gt;&lt;/p&gt;
&lt;p&gt;The idea is to start from something general, then quickly narrow down to your specific study. So an introduction starts from some general background, then to what&amp;rsquo;s unknown. Then we narrow down to our hypothesis. In summary, the introduction is divided into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What&amp;rsquo;s known&lt;/li&gt;
&lt;li&gt;What&amp;rsquo;s unknown&lt;/li&gt;
&lt;li&gt;Your burning question&lt;/li&gt;
&lt;li&gt;You experimental approach&lt;/li&gt;
&lt;li&gt;Why your experimental approach is new and different and important&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The structure corresponds to roughly 3 paragraphs: step 1 = paragraph 1; step 2 = paragraph 2; 3-5 = paragraph 3.&lt;/p&gt;
&lt;p&gt;Some of the tips for writing an introduction include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Keep paragraphs short&lt;/li&gt;
&lt;li&gt;writing for a general audience&lt;/li&gt;
&lt;li&gt;Known -&amp;gt; Unknown -&amp;gt; Hypothesis&lt;/li&gt;
&lt;li&gt;Emphasize the unknown&lt;/li&gt;
&lt;li&gt;Be explicit about your research hypothesis: &amp;lsquo;&amp;lsquo;We asked whether&amp;rsquo;&amp;rsquo;; &amp;lsquo;&amp;lsquo;Our aims/s were&amp;rsquo;&amp;rsquo;&lt;/li&gt;
&lt;li&gt;Do now answer the research question&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="the-revision"&gt;The Revision&lt;/h2&gt;
&lt;p&gt;Surprising to me, the first big tip is to read your writing out loud, because the brain processes the spoken word differently than the written word.&lt;/p&gt;
&lt;p&gt;The second tip is to do a verb check. You should underline the main verb in each sentence, and watch out for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Lackluster verbs (e.g. There &lt;u&gt;are&lt;/u&gt; &amp;hellip;)&lt;/li&gt;
&lt;li&gt;passive verbs (e.g. Something &lt;u&gt;was observed&lt;/u&gt; by me.)&lt;/li&gt;
&lt;li&gt;buried verbs (e.g. A careful monitoring of achievement levels before and after the introduction of computers in the teaching of our course &lt;u&gt;revealed&lt;/u&gt; no appreciable change in students&amp;rsquo; performances.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some words should also be cut out:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dead weight words&lt;/li&gt;
&lt;li&gt;Empty words&lt;/li&gt;
&lt;li&gt;Long words that can be short&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition, watch for&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Unnecessary jargon and acronyms&lt;/li&gt;
&lt;li&gt;Repetitive words&lt;/li&gt;
&lt;li&gt;Adverbs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most of these tips are already covered before in &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-cut-the-clutter/"&gt;Cut the Clutter&lt;/a&gt; and &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-verbs/"&gt;Verbs&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The next tips is to do an organizational review. For example, you can tag each paragraph with a phrase or sentence that sums up the main point in the margins of the paper. Then you can move paragraphs around to improve logical flow and bing similar ideas together.&lt;/p&gt;
&lt;p&gt;Another interesting tip is to get feedback, especially those from people without any technical background. Ask if they can grasp the main findings or significance of the work, as well as those hard-to-read sentences and paragraphs. If an average Joe can understand your paper, chances people in your field can understand it are much higher.&lt;/p&gt;
&lt;h2 id="more-tips"&gt;More Tips&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Use past tense for completed actions (e.g. We &lt;u&gt;found&lt;/u&gt; that&amp;hellip;)&lt;/li&gt;
&lt;li&gt;Use the present tense for assertions that continue to be true, such as what the tables show, what you believe, adn what the data suggests (e.g. Figure 2 &lt;u&gt;shows&lt;/u&gt;&amp;hellip;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Other notes including &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-verbs/"&gt;Cut the Clutter&lt;/a&gt;, &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-verbs/"&gt;Verbs&lt;/a&gt;, and &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-structure/"&gt;Structure&lt;/a&gt; are also available.&lt;/p&gt;</description></item><item><title>Writing in the Sciences - Structure</title><link>https://www.bodunhu.com/blog/posts/writing-in-the-sciences-structure/</link><pubDate>Sun, 08 Aug 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/writing-in-the-sciences-structure/</guid><description>&lt;p&gt;This post covers how to improve sentence structures, and builds to to writing strong paragraphs. Most contents comes from the &lt;a href="https://www.coursera.org/learn/sciwrite"&gt;Writing in the Sciences&lt;/a&gt; course offered on &lt;a href="https://www.coursera.org/"&gt;Coursera&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="punctuation"&gt;Punctuation&lt;/h2&gt;
&lt;p&gt;Here is the list of punctuations ranked based on their power to separate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Comma (,)&lt;/li&gt;
&lt;li&gt;Colon (:)&lt;/li&gt;
&lt;li&gt;Dash (-)&lt;/li&gt;
&lt;li&gt;Parentheses ( () )&lt;/li&gt;
&lt;li&gt;Semicolon (;)&lt;/li&gt;
&lt;li&gt;Period (.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The formality of these punctuations are ranked as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dash (-)&lt;/li&gt;
&lt;li&gt;Parentheses ( () )&lt;/li&gt;
&lt;li&gt;The others (comma (,), colon (:), semicolon (;), period (.))&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;A dash is a mark of separation stronger than a comma, less formal than a colon, and more relaxed than parentheses.&lt;/p&gt;
&lt;p&gt;&amp;ndash; Strunk and White&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id="semicolon"&gt;Semicolon&lt;/h3&gt;
&lt;p&gt;It &lt;em&gt;connects&lt;/em&gt; two independent clauses (a clause always contains a subject and predicate; an independent clause can stand alone as complete sentence.)&lt;/p&gt;
&lt;p&gt;Here is an example: &amp;lsquo;&amp;lsquo;It was the best of times; it was the worst of times.&amp;rsquo;&amp;rsquo;&lt;/p&gt;
&lt;p&gt;Semicolons can also be used to separate items in lists that contain internal punctuation. If some clauses contain commas, the comma inside the clause is no longer sufficient to separate different items in a list, because we don&amp;rsquo;t know where the boundaries are.&lt;/p&gt;
&lt;h2 id="parenthesis"&gt;Parenthesis&lt;/h2&gt;
&lt;p&gt;Parentheses are used to insert an afterthought or explanation into a passage that is grammatically complete without it.&lt;/p&gt;
&lt;h2 id="colon"&gt;Colon&lt;/h2&gt;
&lt;p&gt;Colons are used after an independent clause to introduce a list, quote, explanation, conclusion, or amplification.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The colon has more effect than the comma, less power to separate than the semicolon, and more formality than the dash.&lt;/p&gt;
&lt;p&gt;&amp;ndash; Strunk and White&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="dash"&gt;Dash&lt;/h2&gt;
&lt;p&gt;Dash can add emphasis or insert an abrupt definition of description almost anywhere in the sentence.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Use a dash only when a more common mark of punctuation seems inadequate.&lt;/p&gt;
&lt;p&gt;&amp;ndash; Strunk and White&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here is an example illustrating how dash emphasizes and adds information: &amp;lsquo;&amp;lsquo;Researchers who study shipworms say these mislabeled animals&amp;ndash;they are clams, not worms&amp;ndash;are actually a scientific treasure&amp;rsquo;&amp;rsquo;.&lt;/p&gt;
&lt;p&gt;I like the example provided in the class to illustrate how to use dash to join and condense a sentence. The original sentence is:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Finally, the lessons of clinical epidemiology are not meant to be limited to academic physician-epidemiologists, &lt;u&gt;who&lt;/u&gt; sometimes have more interest in analyzing data than caring for patents. &lt;u&gt;Clinical&lt;/u&gt; epidemiology holds the promise of providing clinicians with the tools necessary to improve the outcomes of their patients.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;By using dash, we can connect these two sentences together, whiling maintaining the description on physician-epidemiologists:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Finally, clinical epidemiology is not limited to academic physician-epidemiologists&amp;ndash;who are sometimes more interested in analyzing data than caring for patients&amp;ndash;but provides clinicians with tools to improve their patients&amp;rsquo; outcomes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="parallelism"&gt;Parallelism&lt;/h2&gt;
&lt;p&gt;It is often better&amp;ndash;in scientific writing&amp;ndash;to write paris of ideas joined by &amp;lsquo;&amp;lsquo;and&amp;rsquo;&amp;rsquo;, &amp;lsquo;&amp;lsquo;or&amp;rsquo;&amp;rsquo;, or &amp;lsquo;&amp;lsquo;but&amp;rsquo;&amp;rsquo; in parallel form.&lt;/p&gt;
&lt;p&gt;Here is an example sentence with a list of things in parallel form: &amp;lsquo;&amp;lsquo;NASA&amp;rsquo;s intrepid Mars rover, Curiosity, has been through a lot in the past year. It flew 354 million miles, blasted through the Mars atmosphere, deployed a supersonic parachute, unfurled a giant sky crane, and touched down gently on the surface of Mars&amp;rsquo;&amp;rsquo;.&lt;/p&gt;
&lt;h2 id="paragraph"&gt;Paragraph&lt;/h2&gt;
&lt;p&gt;There are several tips fo writing paragraphs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1 paragraph = 1 idea&lt;/li&gt;
&lt;li&gt;Give away the punch line early. Scientists like putting details, details, details, data, and conclusion, which is a nightmare for readers. Invert the order!&lt;/li&gt;
&lt;li&gt;Paragraph flow is helped by:
&lt;ul&gt;
&lt;li&gt;logical flow of ideas. Less pointers improves readability.&lt;/li&gt;
&lt;li&gt;parallel sentence structures&lt;/li&gt;
&lt;li&gt;&lt;em&gt;if necessary&lt;/em&gt;, transition words.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Reader remembers the first and the last sentence best.&lt;/li&gt;
&lt;li&gt;Sequential in time&lt;/li&gt;
&lt;li&gt;From general to specific&lt;/li&gt;
&lt;li&gt;Logical arguments (if else)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="repetition"&gt;Repetition&lt;/h2&gt;
&lt;p&gt;It&amp;rsquo;s ok to repeat a word. It&amp;rsquo;s important to ask yourself if the second instance of the word necessary. If the word is needed, is a synonym really better than repeating the word? Using synonyms&amp;ndash;especially in scientific writing&amp;ndash;may lead readers to think you are referring to a different instrument, model, etc.&lt;/p&gt;
&lt;p&gt;Other notes including &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-verbs/"&gt;Cut the Clutter&lt;/a&gt;, &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-verbs/"&gt;Verbs&lt;/a&gt;, and &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-writing-process/"&gt;Writing Process&lt;/a&gt; are also available.&lt;/p&gt;</description></item><item><title>Writing in the Sciences - Verbs</title><link>https://www.bodunhu.com/blog/posts/writing-in-the-sciences-verbs/</link><pubDate>Sat, 31 Jul 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/writing-in-the-sciences-verbs/</guid><description>&lt;p&gt;This is an overview of the second chapter of &lt;a href="https://www.coursera.org/learn/sciwrite"&gt;Writing in the Sciences&lt;/a&gt; offered by &lt;a href="https://www.stanford.edu/"&gt;Stanford&lt;/a&gt;. This chapter focuses on writing with strong, active verbs. Lessons include how to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;write in the active voice&lt;/li&gt;
&lt;li&gt;avoid turning verbs into nouns&lt;/li&gt;
&lt;li&gt;choose strong verbs&lt;/li&gt;
&lt;li&gt;get to the main verb of a sentence quickly&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="active-voice"&gt;Active Voice&lt;/h2&gt;
&lt;p&gt;There are three advantages of using active voice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Emphasizes author responsibility&lt;/li&gt;
&lt;li&gt;Improves readability&lt;/li&gt;
&lt;li&gt;Reduces ambiguity&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="author-responsibility"&gt;Author responsibility&lt;/h3&gt;
&lt;p&gt;Here is an example sentence: &amp;lsquo;&amp;lsquo;No attempt &lt;u&gt;was made&lt;/u&gt; to contact non-responders because they &lt;u&gt;were deemed&lt;/u&gt; unimportant to the analysis&amp;rsquo;&amp;rsquo;. When we put it in the active voice, we get &amp;lsquo;&amp;rsquo;&lt;u&gt;We did not attempt to&lt;/u&gt; contact non-responders because &lt;u&gt;we deemed&lt;/u&gt; them unimportant to the analysis&amp;rsquo;&amp;rsquo;. The active voice version places more emphasis on the role of the authors in the decision making, subtly indicating human judgement and potential fallibility.&lt;/p&gt;
&lt;h3 id="readability"&gt;Readability&lt;/h3&gt;
&lt;p&gt;Putting sentences into active voice often leads us to be more direct. For example, putting &amp;lsquo;&amp;lsquo;a strong correlation was found between use of passive voice and other sins of writing&amp;rsquo;&amp;rsquo; into active voice yields &amp;lsquo;&amp;lsquo;We found a strong correlation between use of the passive voice and other sins of writing&amp;rsquo;&amp;rsquo;. Active voice tends to make sentences more &lt;strong&gt;direct&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id="ambiguity"&gt;Ambiguity&lt;/h3&gt;
&lt;p&gt;The example sentence is: &amp;lsquo;&amp;lsquo;General dysfunction of the immune system at the leukocyte level &lt;u&gt;is suggested&lt;/u&gt; by both animal and human studies. Turning the sentence into active voice gives: &amp;lsquo;&amp;lsquo;Both human and animal studies suggest that &lt;u&gt;diabetics&lt;/u&gt; have general immune dysfunction at the leukocyte level&amp;rsquo;&amp;rsquo;. A sentence in form of &lt;em&gt;agent - verb - recipient&lt;/em&gt; forces us to be more specific, thus reducing ambiguity of a sentence.&lt;/p&gt;
&lt;p&gt;It is important to point out that passive voice may be appropriate in the methods section where what was done is more important than who did it.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After all, human agents are responsible for
designing experiments, and they are present in the laboratory.
Writing awkward phrases to avoid admitting their responsibility and
their presence is an odd way of being objective.&lt;/p&gt;
&lt;p&gt;&amp;ndash; Jane J. Robinson, &lt;em&gt;Science&lt;/em&gt; 7 June 1957: 1160.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="write-with-verbs"&gt;Write with Verbs&lt;/h2&gt;
&lt;h3 id="verbs-with-embedded-meaning"&gt;Verbs with Embedded Meaning&lt;/h3&gt;
&lt;p&gt;For example, phrases like &amp;lsquo;&amp;lsquo;reports that approximately&amp;rsquo;&amp;rsquo; can be shortened to &amp;lsquo;&amp;rsquo;estimates&amp;rsquo;&amp;rsquo; with &amp;lsquo;&amp;lsquo;approximately&amp;rsquo;&amp;rsquo; as its embedded meaning. They can make a big difference in sentences.&lt;/p&gt;
&lt;h3 id="avoid-to-be-verbs"&gt;Avoid &amp;lsquo;&amp;rsquo;to be&amp;rsquo;&amp;rsquo; verbs&lt;/h3&gt;
&lt;p&gt;There verbs are rather boring. Substituting &amp;lsquo;&amp;rsquo;to be&amp;rsquo;&amp;rsquo; verbs can lead to exciting contents.&lt;/p&gt;
&lt;h3 id="dont-turn-verbs-into-nouns"&gt;Don&amp;rsquo;t Turn Verbs into Nouns&lt;/h3&gt;
&lt;p&gt;Nouns slow readers down by the lack of action. Turning nouns into verbs gives a clearer picture of what is going. It has a bonus of avoiding ambiguity.&lt;/p&gt;
&lt;p&gt;Turning verbs into nouns sometimes leads to the use of &lt;em&gt;weaker&lt;/em&gt; verbs. For example, &amp;lsquo;&amp;lsquo;decide&amp;rsquo;&amp;rsquo; can be transformed into &amp;lsquo;&amp;lsquo;make a decision&amp;rsquo;&amp;rsquo;, where &amp;lsquo;&amp;lsquo;make&amp;rsquo;&amp;rsquo; is a much weaker verb than &amp;lsquo;&amp;lsquo;decide&amp;rsquo;&amp;rsquo;.&lt;/p&gt;
&lt;h3 id="dont-bury-the-main-verb"&gt;Don&amp;rsquo;t Bury the Main Verb&lt;/h3&gt;
&lt;p&gt;The principle is to keep the predicate close to the subject. Here is a sentence:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;lsquo;&amp;rsquo;&lt;u&gt;one study of 930 adults with multiple sclerosis (MS) receiving care and one of two managed care settings or in a fee-for-service setting&lt;/u&gt; &lt;u&gt;found that&lt;/u&gt; only two-thirds of those needing to contact a neurologist for an MS-related problem in the prior 6 months had done so&amp;rsquo;&amp;rsquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Readers struggle to understand the sentence due the clutter between the subject and the predicate. Moving &amp;lsquo;&amp;lsquo;found&amp;rsquo;&amp;rsquo; to the front of the sentence gives us &amp;lsquo;&amp;lsquo;One study found that&amp;hellip;&amp;rsquo;&amp;rsquo;. The reader are less bothered by the descriptive stuff as long as he/she has gotten the verb.&lt;/p&gt;
&lt;h2 id="example"&gt;Example&lt;/h2&gt;
&lt;p&gt;Here is a great example provided in the course:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Important studies to examine the descriptive epidemiology of autism, including the prevalence and changes in the characteristics of the population over time, have begun.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There are multiple problems in this sentence. 1) the main verb appears at the very end of the sentence while the main subject &amp;lsquo;&amp;lsquo;studies&amp;rsquo;&amp;rsquo; is placed at the beginning.; 2) fluff words like &amp;lsquo;&amp;lsquo;important&amp;rsquo;&amp;rsquo;. 3) redundant phrases: &amp;lsquo;&amp;lsquo;changes&amp;rsquo;&amp;rsquo; almost always happen &amp;lsquo;&amp;lsquo;over time&amp;rsquo;&amp;rsquo;; 4) &amp;lsquo;&amp;lsquo;of the population&amp;rsquo;&amp;rsquo; sounds vague. After addressing those issues, the sentence becomes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Studies have begun to describe the epidemiology of autism, including recent changes in the disorder&amp;rsquo;s prevalence and characteristics.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="grammar-tips"&gt;Grammar Tips&lt;/h2&gt;
&lt;p&gt;Data is/are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;lsquo;&amp;lsquo;Data&amp;rsquo;&amp;rsquo; is plural.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Compared to/with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Compared to: used to point out &lt;em&gt;similarities&lt;/em&gt; between two objects.&lt;/li&gt;
&lt;li&gt;Compared with: (used more often in science) used to pointed our &lt;em&gt;differences&lt;/em&gt; between similar things.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That/which:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;lsquo;&amp;lsquo;That&amp;rsquo;&amp;rsquo; is the restrictive (defining) pronoun (doesn&amp;rsquo;t have &lt;em&gt;comma&lt;/em&gt;). Eliminating essential clause changes the meaning of the sentence.&lt;/li&gt;
&lt;li&gt;&amp;lsquo;&amp;lsquo;Which&amp;rsquo;&amp;rsquo; is the nonrestrictive (non-defining) pronoun. Eliminating the non-essential clause alters the basic meaning of the sentence. (must be set off by commas).&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Careful writers, watchful for small conveniences, go witch-hunting, remove the defining which-es, and by doing so improve their work.&lt;/p&gt;
&lt;p&gt;&amp;ndash; &lt;em&gt;Strunk and White&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Singular antecedents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Do not use &amp;lsquo;&amp;rsquo;they&amp;rsquo;&amp;rsquo; or &amp;lsquo;&amp;rsquo;their&amp;rsquo;&amp;rsquo; when the subject is singular. To avoid gender choice, turn to a plural.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Other notes including &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-verbs/"&gt;Cut the Clutter&lt;/a&gt;, &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-structure/"&gt;Structure&lt;/a&gt;, and &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-writing-process/"&gt;Writing Process&lt;/a&gt; are also available.&lt;/p&gt;</description></item><item><title>Writing in the Sciences - Cut the Clutter</title><link>https://www.bodunhu.com/blog/posts/writing-in-the-sciences-cut-the-clutter/</link><pubDate>Fri, 30 Jul 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/writing-in-the-sciences-cut-the-clutter/</guid><description>&lt;p&gt;This is an overview over the first chapter of &lt;a href="https://www.coursera.org/learn/sciwrite"&gt;Writing in the Sciences&lt;/a&gt; offered by &lt;a href="https://www.stanford.edu/"&gt;Stanford&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The secret of good writing is to strip every sentence to its cleanest components.
Every word that serves no function, every long word that could be a short word,
every adverb that carries the same meaning that&amp;rsquo;s already in the verb,
every passive construction that leaves the reader unsure of who is doing what.
These are the thousand and one adulterants that weaken the strength of a sentence.
And they usually occur in proportion to the education and rank.&lt;/p&gt;
&lt;p&gt;&amp;ndash; William Zinssler in &lt;em&gt;On Writing Well,&lt;/em&gt; 1976&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="cutting-extra-words"&gt;Cutting Extra words&lt;/h2&gt;
&lt;p&gt;Here are some common clutters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dead-weight words and phrases such as &amp;lsquo;&amp;lsquo;as it is well known&amp;rsquo;&amp;rsquo;, &amp;lsquo;&amp;lsquo;as it has been shown&amp;rsquo;&amp;rsquo;&lt;/li&gt;
&lt;li&gt;Empty words and phrases: &amp;lsquo;&amp;lsquo;important&amp;rsquo;&amp;rsquo;, &amp;lsquo;&amp;lsquo;methodologic&amp;rsquo;&amp;rsquo;, &amp;lsquo;&amp;lsquo;basic tenets of&amp;rsquo;&amp;rsquo;
&lt;ul&gt;
&lt;li&gt;Hedge words: appreciable changes. One may ask: &amp;lsquo;&amp;lsquo;what is an appreciable change?&amp;rsquo;&amp;rsquo; Hedge words intends to introduce ambiguity, probability, or indecisiveness.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Long words or phrases that could be short: a majority of -&amp;gt; most, a number of -&amp;gt; many, &amp;lsquo;&amp;rsquo;neonatal population&amp;rsquo;&amp;rsquo; -&amp;gt; &amp;lsquo;&amp;rsquo;newborns&amp;rsquo;&amp;rsquo;, etc.&lt;/li&gt;
&lt;li&gt;Unnecessary jargon and acronyms. No one wants to constantly look for what &amp;lsquo;&amp;lsquo;miR&amp;rsquo;&amp;rsquo; means&lt;/li&gt;
&lt;li&gt;Repetitive words or phrases: illustrate/demonstrate&lt;/li&gt;
&lt;li&gt;adverbs: very, really, generally, basically&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;I have only made this letter rather
long because I have not had time to make it shorter&lt;/p&gt;
&lt;p&gt;&amp;ndash; Lettres provinciales, 16, Dec. 14, 1656&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="little-tricks"&gt;Little Tricks&lt;/h2&gt;
&lt;p&gt;Here are a few other small tricks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Get rid of negatives. The sentence usually becomes much clearer using the positive construction. &amp;lsquo;&amp;lsquo;Not honest -&amp;gt; honest&amp;rsquo;&amp;rsquo;, &amp;lsquo;&amp;lsquo;does not have -&amp;gt; lacks&amp;rsquo;&amp;rsquo;&lt;/li&gt;
&lt;li&gt;Eliminate superfluous uses of &amp;lsquo;&amp;rsquo;there is/are&amp;rsquo;&amp;rsquo;. For example, we can change the sentence &amp;lsquo;&amp;lsquo;There are few single genes that can cause autism in isolation&amp;rsquo;&amp;rsquo; to &amp;lsquo;&amp;lsquo;Few single genes cause autism in isolation&amp;rsquo;&amp;rsquo;.&lt;/li&gt;
&lt;li&gt;Omit needless prepositions. For example, &amp;lsquo;&amp;rsquo;that&amp;rsquo;&amp;rsquo; and &amp;lsquo;&amp;lsquo;on&amp;rsquo;&amp;rsquo; are often superfluous. This is useful to cut off words in abstract with word limitations. For example, you can simplify &amp;lsquo;&amp;rsquo;they agreed that it was true&amp;rsquo;&amp;rsquo; to &amp;lsquo;&amp;rsquo;they agreed it was true&amp;rsquo;&amp;rsquo;.&lt;/li&gt;
&lt;li&gt;Use verbs than adjectives: protective for -&amp;gt; protect against.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="example"&gt;Example&lt;/h2&gt;
&lt;p&gt;Here is an example sentence: &amp;lsquo;&amp;lsquo;Clinical seizures have been estimated to occur in 0.5% to 2.3% of the neonatal populations&amp;rsquo;&amp;rsquo;. We can perform the first elimination: &amp;lsquo;&amp;lsquo;Clinical seizures &lt;del&gt;have been estimated to&lt;/del&gt; occur in 0.5% to 2.3% of the neonatal populations&amp;rsquo;&amp;rsquo;. The range of percentages presents possibilities of variance, making &amp;lsquo;&amp;rsquo;estimated&amp;rsquo;&amp;rsquo; unnecessary.&lt;/p&gt;
&lt;p&gt;Upon first glance, &amp;lsquo;&amp;rsquo;neonatal&amp;rsquo;&amp;rsquo; seems like a essential word. However, upon inspection, &amp;lsquo;&amp;rsquo;neonatal population&amp;rsquo;&amp;rsquo; is merely fancy way of saying &amp;lsquo;&amp;rsquo;newborns&amp;rsquo;&amp;rsquo;. So the sentence can be stripped down to &amp;lsquo;&amp;lsquo;Clinical seizures occur in 0.5% to 2.3% of newborns&amp;rsquo;&amp;rsquo;.&lt;/p&gt;
&lt;p&gt;Other notes including &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-verbs/"&gt;Verbs&lt;/a&gt;, &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-structure/"&gt;Structure&lt;/a&gt;, and &lt;a href="https://www.bodunhu.com/blog/posts/writing-in-the-sciences-writing-process/"&gt;Writing Process&lt;/a&gt; are also available.&lt;/p&gt;</description></item><item><title>Unitary Matrix</title><link>https://www.bodunhu.com/blog/posts/unitary-matrix/</link><pubDate>Wed, 14 Jul 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/unitary-matrix/</guid><description>&lt;p&gt;Recently, I was trying to get the hang of quantum computing. I found myself in a position where I forgot most of the linear algebra stuff I&amp;rsquo;ve learned in past semesters. So again, I decide to put them down in hope that some of the knowledge here will stay in my memory a bit longer.&lt;/p&gt;
&lt;h2 id="general-single-qubit-gates"&gt;General single-qubit Gates&lt;/h2&gt;
&lt;p&gt;Trying to understand unitary matrix in the context of pure linear algebra is, I must admit, rather boring. Perhaps that is one reason why I brushed them off so quickly and so easily. However, explaining it in the context of quantum computing feels a lot more fun. Maybe it&amp;rsquo;s because I can associate a unitary matrix with a quantum gate, which is something a bit more concrete, or simply because the term &amp;lsquo;&amp;lsquo;quantum computing&amp;rsquo;&amp;rsquo; makes me sound smarter.&lt;/p&gt;
&lt;p&gt;Speaking of something concrete, here ara two example unitary matrices: the NOT gate (\(X\)) and &lt;a href="https://en.wikipedia.org/wiki/Quantum_logic_gate"&gt;Hadamard gate&lt;/a&gt; (\(H\)):&lt;/p&gt;
&lt;p&gt;\[
X =\begin{bmatrix}
0 &amp;amp; 1 \\ 1 &amp;amp; 0
\end{bmatrix}
;\
H = \frac{1}{\sqrt{2}} \begin{bmatrix}
1 &amp;amp; 1 \\ 1 &amp;amp; -1
\end{bmatrix}
\]&lt;/p&gt;
&lt;p&gt;For example, if we take the Hadamard gate (\(H\)) and compute its adjoint \(H^{\dagger}\):&lt;/p&gt;
&lt;p&gt;\[
H^{\dagger} = \begin{pmatrix} \begin{pmatrix}\frac{1}{\sqrt{2}} \begin{bmatrix}
1 &amp;amp; 1 \\ 1 &amp;amp; -1
\end{bmatrix} \end{pmatrix}^T \end{pmatrix}^{*}
\]&lt;/p&gt;
&lt;p&gt;We know the transpose of \(H\) is still \(H\), and taking the complex conjugate of \(H^T\) doesn&amp;rsquo;t do anything since \(H^T\) is a real matrix. Thus, we can verify that \(H^{\dagger}H = I\).&lt;/p&gt;
&lt;p&gt;There are other single-qubit quantum gates such as the &lt;a href="https://en.wikipedia.org/wiki/Pauli_matrices"&gt;\(Y\) and \(Z\) matrices&lt;/a&gt; (Pauli matrices) introduced by physicist &lt;a href="https://en.wikipedia.org/wiki/Wolfgang_Pauli"&gt;Wolfgang Pauli&lt;/a&gt;. It&amp;rsquo;s a good exercise to verify they are also unitary matrices.&lt;/p&gt;
&lt;h2 id="what-does-it-mean-for-a-matrix-to-be-unitary"&gt;What does it mean for a matrix to be unitary&lt;/h2&gt;
&lt;p&gt;The most important property of unitary matrices is that they &lt;em&gt;preserve the length of inputs&lt;/em&gt;. It means that given a quantum state, represented as vector \(|\psi\rangle\), it must be that \( \left\lVert U|\psi\rangle \rangle \right\rVert = \left\lVert |\psi\rangle \right\rVert \).&lt;/p&gt;
&lt;p&gt;Proving unitary matrix is length-preserving is straightforward. We wanna show that \( \left\lVert U |\psi\rangle \right\rVert_2 = \left\lVert |\psi\rangle \right\rVert_2 \):&lt;/p&gt;
&lt;p&gt;\[\begin{aligned} \left\lVert U |\psi\rangle \right\rVert_2^2 &amp;amp;= (U |\psi\rangle)^H(U |\psi\rangle) \\ &amp;amp;= |\psi\rangle^H U^H U |\psi\rangle \\ &amp;amp;=|\psi\rangle^H |\psi\rangle \\ &amp;amp;= \left\lVert |\psi\rangle \right\rVert_2^2 \end{aligned}\]&lt;/p&gt;
&lt;h2 id="why-are-unitaries-the-only-matrices-that-preserve-length"&gt;Why are unitaries the only matrices that preserve length&lt;/h2&gt;
&lt;p&gt;Previously, we use the &lt;em&gt;ket&lt;/em&gt; notation for quantum state vectors. We can extend the two-dimensional quantum state vectors to more general vectors and the properties of unitary matrix will still hold.&lt;/p&gt;
&lt;p&gt;Putting our questions in formal terms, we want to show that if \(A \in \mathbb{C}^{m \times m}\) preserves length (\(\left\lVert A x \right\rVert_2 = \left\lVert x \right\rVert_{2}\ \forall x \in \mathbb{C}^m\), then \(A\) is unitary).&lt;/p&gt;
&lt;p&gt;We first prove that \((Ax)^H(Ay) = x^Hy\) for all \(x\), \(y\) by considering that \( \left\lVert x - y \right\rVert_2^2 = \left\lVert A(x - y) \right\rVert_2^2 \). Then we will the result to evaluate \(e_i^H A^HAe_j\).&lt;/p&gt;
&lt;p&gt;Let \(x\), \(y \in \mathbb{C}^m\), then we can use the alternative definition for the matrix 2-norm (e.g. \(\left\lVert y \right\rVert_2 = y^Hy\)) for \( \left\lVert x - y \right\rVert_2^2 = \left\lVert A(x - y) \right\rVert_2^2 \),&lt;/p&gt;
&lt;p&gt;\[
(x-y)^H(x-y) = (A(x-y))^HA(x-y)
\]&lt;/p&gt;
&lt;p&gt;Based on that fact that the hermitian transpose rule that \((Ax)^H = x^HA^H\), we get&lt;/p&gt;
&lt;p&gt;\[
(x-y)^H(x-y) = (x-y)^HA^HA(x-y)
\]&lt;/p&gt;
&lt;p&gt;Multiplying the above formula out,&lt;/p&gt;
&lt;p&gt;\[
\begin{align}
x^Hx - y^Hx - x^Hy + y^Hy &amp;amp;= x^HA^HAx - y^HA^HAx \\
&amp;amp;\quad - x^HA^HAy + y^HA^HAy
\end{align}
\]&lt;/p&gt;
&lt;p&gt;The alternative definition for \(y^Hx\) is \(\overline{x^Hy}\), so we apply the definition here,&lt;/p&gt;
&lt;p&gt;\[
\begin{align}
x^Hx - (\overline{x^Hy} + x^Hy) + y^Hy &amp;amp;= x^HA^HAx - (\overline{x^HA^HAy} + x^HA^HAy) \\
&amp;amp;\quad + y^HA^HAy
\end{align}
\]&lt;/p&gt;
&lt;p&gt;We know that \(A\) preserves length, and that \(\frac{\alpha + \overline{\alpha}}{2} = Re(\alpha)\). so we can simplify the above formula as:&lt;/p&gt;
&lt;p&gt;\[
Re(x^Hy) = Re((Ax)^H(Ay))
\]&lt;/p&gt;
&lt;p&gt;We know that \(A\) preserves length, and thus we need to show that \(A^HA = I\) by using the fact that the standard basis vectors have the property that&lt;/p&gt;
&lt;p&gt;\[
\begin{equation}
e_i^H e_j = \begin{cases} 1 &amp;amp; \text{if \(i = j\)}\\ 0 &amp;amp; \text{otherwise} \end{cases}
\end{equation}
\]&lt;/p&gt;
&lt;p&gt;Therefore, \(e_i M e_j\) will essentially extract the \(i,\ j\)th entry in matrix \(M\). So we know that&lt;/p&gt;
&lt;p&gt;\[
e_i A^HA e_i = \left\lVert Ae_i \right\rVert^2 = \left\lVert e_i \right\rVert^2 = 1
\]&lt;/p&gt;
&lt;p&gt;We can conclude that all the diagonal elements of \(A^HA\) are \(1\).&lt;/p&gt;
&lt;p&gt;A side question remains, how do we prove that all the off-diagonal elements in \(A^HA\) are \(0\)? Turns out it very straightforward to illustrate the process if we resort back to the two-dimensional quantum vector state matrix.&lt;/p&gt;
&lt;p&gt;Suppose we have \(|\psi\rangle = |e_i\rangle + |e_j\rangle\), we already know that \(\left\lVert A |\psi\rangle \right\rVert^2 = \left\lVert |\psi\rangle \right\rVert^2 = 1 + 1 = 2\), and we know we can expand \(\left\lVert A |\psi\rangle \right\rVert^2\) to \(1 + e_i A^HA e_j + e_j A^HA e_i + 1\), we would get \(e_i A^HA e_j + e_j A^HA e_i = 0\).&lt;/p&gt;
&lt;p&gt;Then, suppose instead we have \(|\psi\rangle = |e_i\rangle + i|e_j\rangle\), following the same process, we would get \(e_i A^HA e_j - e_j A^HA e_i = 0\). Combining with the fact that \(e_i A^HA e_j + e_j A^HA e_i = 0\), we&amp;rsquo;ve proven that the off-diagonal elements in \(A^HA\) are all \(0\). We can extend the vector \(\psi\) to higher-dimensional vectors and the proof will be similar.&lt;/p&gt;</description></item><item><title>BGP in a Nutshell</title><link>https://www.bodunhu.com/blog/posts/bgp-in-a-nutshell/</link><pubDate>Tue, 06 Jul 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/bgp-in-a-nutshell/</guid><description>&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Border_Gateway_Protocol"&gt;Border Gateway Protocol (BGP)&lt;/a&gt; protocol has a very simple purpose: choose the fastest and the most efficient route to deliver a message from one &lt;a href="https://www.cloudflare.com/learning/network-layer/what-is-an-autonomous-system/"&gt;autonomous system (AS)&lt;/a&gt; to another. In layman&amp;rsquo;s term, BGP is the GPS for the internet. Many contents here are credit to Prof. &lt;a href="https://www.cs.utexas.edu/~gouda/"&gt;Mohamed G. Gouda&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In a nutshell, BGP informs each router \(R\) how to route packets to an IP prefix \(pf\) (i.e. block of IP addresses) that is used in \(AS_i\) different from \(AS_j\), where \(R\) is located:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.cs.utexas.edu/~gouda/CS%20356%20slides/CS%20356%20slides%20(53-86)/slides-53-86.pdf"&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/how-bgp-works/bgp.png#center" alt="bgp-in-a-nutshell"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;BGP consists of two parts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Border_Gateway_Protocol"&gt;external BGP (eBGP)&lt;/a&gt;: informs each gateway&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Border_Gateway_Protocol"&gt;internal BGP (iBGP)&lt;/a&gt;: informs each non-gateway router&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Gateway: A &lt;strong&gt;gateway&lt;/strong&gt; is defined as a router that is connected to computer in two or more ASes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Abstractly, each router has a BGP routing table in the form of:&lt;/p&gt;
&lt;p&gt;\[(\text{prefix in another AS},\ \text{best ngh (next gateway hop) to reach prefix})\]&lt;/p&gt;
&lt;h2 id="ebgp"&gt;eBGP&lt;/h2&gt;
&lt;p&gt;First we will go over eBGP. We know BGP uses &lt;a href="https://en.wikipedia.org/wiki/Transmission_Control_Protocol"&gt;TCP&lt;/a&gt; to send messages and eBGP is no exception. The TCP connection exists between:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;each two gateways in the same AS, and&lt;/li&gt;
&lt;li&gt;each two &amp;lsquo;&amp;lsquo;adjacent&amp;rsquo;&amp;rsquo; gateways in different ASes.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These gateway pairs sent &lt;strong&gt;route advertisements&lt;/strong&gt; in the following form (represented as a tuple):&lt;/p&gt;
&lt;p&gt;\[(prefix,\ AS\text{-}path,\ next\text{-}hop)\]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The BGP next-hop attribute is the next hop IP address that is going to be used to reach a certain destination.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here is an example illustrating how one AS lets other ASes know the route leading to itself. Assume we have three ASes. AS1 wants other ASes to know how to reach itself:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.cs.utexas.edu/~gouda/CS%20356%20slides/CS%20356%20slides%20(53-86)/slides-53-86.pdf"&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/how-bgp-works/route-adv.jpg#center" alt="eBGP"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;AS1 is trying to broadcast the path to reach itself, it sends the first route advertisement message (1) as:&lt;/p&gt;
&lt;p&gt;\[
(pf,\ (AS1),\ x)
\]&lt;/p&gt;
&lt;p&gt;Now AS2 receives the message from AS1, it updates the AS-path by appending itself to the path list. It also needs to update the next-hop attribute because an incoming message needs to find the best address to reach AS2 before reaching AS1. AS2 broadcasts the message (2) as:&lt;/p&gt;
&lt;p&gt;\[(pf,\ (AS1, AS2),\ y)\]&lt;/p&gt;
&lt;p&gt;Each gateway \(A_i\) or \(B_j\) will also add an entry to its BGP routing table, thus each gateway in the picture will have its routing table like:
\(A_2:\ (pf,\ B_1)\), \(B_2:\ (pf,\ \text{ngh to }x)\), \(A_3:\ (pf,\ B_2)\).&lt;/p&gt;
&lt;p&gt;Notice \(B_2\) doesn&amp;rsquo;t have an explicit next-gateway-hop in its routing table. This is because so far we&amp;rsquo;ve only covered eBGP which works across different ASes. For a message to go from \(B_2\) to \(x\), it must go through routers inside AS2 internally, which brings up internal BGP (iBGP).&lt;/p&gt;
&lt;h2 id="ibgp"&gt;iBGP&lt;/h2&gt;
&lt;p&gt;For iBGP, there is a TCP connection between each two routers in the same AS, given the only one of them is a gateway.&lt;/p&gt;
&lt;p&gt;Here is an example. Suppose the gateway \(A\) in \(AS_j\) needs to broadcast its routing information to other routers in \(AS_j\):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/how-bgp-works/iBGP.png#center" alt="iBGP"&gt;&lt;/p&gt;
&lt;p&gt;The normal eBGP advertisement will be something like \((pf,\ (AS_1,\ &amp;hellip;,\ AS_j),\ x)\).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For iBGP, the protocol states that the next hop advertised by eBGP should be carried into iBGP. Therefore, for all routers \(C_1,\ &amp;hellip;,\ C_n\), the next hop to reach \(pf\) will be \(x\) which is provided by \(AS_i\). It&amp;rsquo;s your job to make sure \(x\) is reachable via &lt;a href="https://en.wikipedia.org/wiki/Interior_gateway_protocol"&gt;IGP&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For the iBGP advertisement, it will only be \((pf,\ x)\). Intuitively, this makes sense, because we only need to worry about the next best hop to reach \(pf\) without worrying about changing the AS. All the routers (\(C_1,\ &amp;hellip;,\ C_n\)) receiving the advertisement from \(A\) will add the information to their routing table as:&lt;/p&gt;
&lt;p&gt;\[
(pf,\ \text{best ngh to }x)
\]&lt;/p&gt;
&lt;p&gt;That is it. There&amp;rsquo;s really nothing difficult about BGP in general.&lt;/p&gt;</description></item><item><title>From Autotools to CMake</title><link>https://www.bodunhu.com/blog/posts/from-autotools-to-cmake/</link><pubDate>Mon, 21 Jun 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/from-autotools-to-cmake/</guid><description>&lt;p&gt;Since my &lt;a href="https://ieeexplore.ieee.org/document/9238617"&gt;paper on GPU benchmarking&lt;/a&gt; was published, every once in a while, I got emails asking me why &lt;a href="https://github.com/utcs-scea/altis"&gt;Altis&lt;/a&gt; doesn&amp;rsquo;t build on their platforms. It almost always has something to do a small &lt;a href="https://github.com/utcs-scea/altis/blob/deprecated/config/find_cuda_libs.sh"&gt;script&lt;/a&gt; which is responsible for finding CUDA dependencies. This script is invoked every single time &lt;code&gt;make&lt;/code&gt; is executed. For some reason, the regular expression in the script sometimes breaks randomly, depending on the Linux distro, the kernel version, the host architecture, or even the CUDA version. After enough requests piled up in my inbox, I decided enough is enough and it&amp;rsquo;s time to ditch the autotools shenanigans for CMake.&lt;/p&gt;
&lt;p&gt;When I started this project, it already had a skeleton build system based on autotools. The old build system generates automake and autoconf files. It worked fine for me on the server I used so I never bothered to make big adjustments. However, the problem soon arises when I upgrade some packages or switched to another server in our lab.&lt;/p&gt;
&lt;p&gt;Because our servers are constantly being used for GPU research, the execution environment is constantly changing. Sometimes the benchmark suite would build in the morning and stop working in the afternoon because of a CUDA downgrade. My directories were filed with strange files like &lt;code&gt;configure.ac&lt;/code&gt;, &lt;code&gt;Makefile.am&lt;/code&gt;, &lt;code&gt;Makefile.in&lt;/code&gt;, &amp;hellip; It also uses a helper script called M4 which I still don&amp;rsquo;t quite understand. Hand-made shell scripts are everywhere. automake has a million versions and you&amp;rsquo;ll never know why it doesn&amp;rsquo;t build on someone else&amp;rsquo;s OS. Getting OptiX to work is like having a constipation because the LD flag doesn&amp;rsquo;t get placed in the right location.&lt;/p&gt;
&lt;p&gt;Switching to CMake was a lot smoother than I anticipated. I took about two days to rewrite the entire build system from scratch. Contrary to autotools, CMake&amp;rsquo;s syntax is easier to learn. The terminal output is easier to read by default, instead of throwing every single detail to my face. And, it&amp;rsquo;s colored! Debugging build issues no longer takes multiple hours.&lt;/p&gt;
&lt;p&gt;Perhaps the best part is, since CMake 3.8, CUDA is natively supported. Compiling &lt;code&gt;.cu&lt;/code&gt; files is as easy as adding &lt;code&gt;CUDA&lt;/code&gt; as the project&amp;rsquo;s LANGUAGES. That alone should be the reason to use CMake if there&amp;rsquo;s anything CUDA related. The only caveat is there&amp;rsquo;s a small difference between how CMake handle CUDA architecture flags in different versions. Since CMake 3.16, &lt;code&gt;CMAKE_CUDA_ARCHITECTURES&lt;/code&gt; is introduced. Older versions still requires &lt;code&gt;find_package(CUDA)&lt;/code&gt; to set CUDA compilation flags though.&lt;/p&gt;</description></item><item><title>How SAT Solver works</title><link>https://www.bodunhu.com/blog/posts/how-sat-solver-works/</link><pubDate>Fri, 21 May 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/how-sat-solver-works/</guid><description>&lt;p&gt;This is a summary over the high-level design of &lt;a href="https://en.wikipedia.org/wiki/Boolean_satisfiability_problem"&gt;SAT solver&lt;/a&gt; covered in Prof. &lt;a href="https://www.cs.utexas.edu/~isil/"&gt;Dillig&lt;/a&gt;&amp;rsquo;s &lt;a href="https://www.cs.utexas.edu/~isil/cs389L/"&gt;Automated Logical Reasoning&lt;/a&gt; class. It&amp;rsquo;s meant to cover the basic steps towards determining whether a given boolean formula is satisfiable or not.&lt;/p&gt;
&lt;h2 id="convert-to-nnf"&gt;Convert to NNF&lt;/h2&gt;
&lt;p&gt;The first step in a SAT solver is to convert a given boolean formula to &lt;a href="https://en.wikipedia.org/wiki/Negation_normal_form"&gt;Negation Normal Form (NNF)&lt;/a&gt;. A normal form of a formula \(F\) is another formula \(F&amp;rsquo;\) such that \(F\) is equivalent to \(F&amp;rsquo;\), but obeys certain syntactic restrictions. NNF has two syntactic restrictions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The only logical connectives are \(\neg\), \(\land\), and \(\lor\).&lt;/li&gt;
&lt;li&gt;Negation appears only in literals. (i.e., no \(\neg(a \land b)\)).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="why-not-dnf"&gt;Why not DNF?&lt;/h2&gt;
&lt;p&gt;A formula in &lt;a href="https://en.wikipedia.org/wiki/Disjunctive_normal_form"&gt;disjunctive normal form (DNF)&lt;/a&gt; is a disjunction of conjunction of literals. It can be expressed as:&lt;/p&gt;
&lt;p&gt;\[
\begin{equation}
\bigvee_i \bigwedge_j l_{i,j} \mbox{ for literals }l_{i,j} \tag{1}
\end{equation}
\]&lt;/p&gt;
&lt;p&gt;That formula states that \(\lor\) can never appear inside \(\land\) or \(\neg\). If we take a look at formula in DNF, we might claim it&amp;rsquo;s trivial to determine satisfiability of such formula. This is because, if we can find one element that evaluates to true, then due to the nature of \(\lor\), it must be that the formula evaluates to true.&lt;/p&gt;
&lt;p&gt;Practically, this is impractical. This is because DNF conversion causes exponential blow-up in size. Take formula \((F_1 \land F_2) \land (F_3 \lor F_4)\) for example, the DNF will become \((F_1 \land F_3) \lor (F_1 \land F_4) \lor (F_2 \land F_3) \lor (F_2 \land F_4)\). The main problem is distributing \(\lor\) over \(\land\) will introduce more elements to the formula.&lt;/p&gt;
&lt;h2 id="cnf"&gt;CNF&lt;/h2&gt;
&lt;p&gt;The solution is to convert from NNF to &lt;a href="https://en.wikipedia.org/wiki/Conjunctive_normal_form"&gt;conjunctive normal form (CNF)&lt;/a&gt;. CNF is a conjunction of disjunction of literals:&lt;/p&gt;
&lt;p&gt;\[
\begin{equation}
\bigwedge_i \bigvee_j l_{i,j} \mbox{ for literals }l_{i,j} \tag{2}
\end{equation}
\]&lt;/p&gt;
&lt;p&gt;What this says that \(\land\) is not allowed inside \(\lor\) and \(\neg\), or clauses.&lt;/p&gt;
&lt;h3 id="cnf-vs-dnf"&gt;CNF vs DNF&lt;/h3&gt;
&lt;p&gt;Unlike CNF, it is not trivial to determine satisfiability of formula in CNF. However, it is just as expensive to convert formula to CNF as to DNF. So why do most solvers convert to CNF although it&amp;rsquo;s easier to determine satisfiability in DNF?&lt;/p&gt;
&lt;h2 id="tseitins-transformation"&gt;Tseitin&amp;rsquo;s Transformation&lt;/h2&gt;
&lt;p&gt;The answer is &lt;a href="https://en.wikipedia.org/wiki/Tseytin_transformation"&gt;Tseitin&amp;rsquo;s Transformation&lt;/a&gt;. The most important thing about Tseitin&amp;rsquo;s Transformation is that &lt;strong&gt;it converts formula \(F\) to equisatisfiable formula \(F&amp;rsquo;\) in CNF with only a linear increase in size&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Tseitin&amp;rsquo;s Transformation has three major steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Introduce a new variable \(p_G\) for every subformula \(G\) of \(F\).&lt;/li&gt;
&lt;li&gt;Consider each subformula \(G : G_1 \circ G_2\), stipulate representative of \(G\), or that \(p_G \leftrightarrow p_{G_1} \circ p_{G_2}\).&lt;/li&gt;
&lt;li&gt;Convert \(p_G \leftrightarrow p_{G_1} \circ p_{G_2}\) to CNF.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Eventually, we will introduce a new formula:&lt;/p&gt;
&lt;p&gt;\[
\begin{equation}
p_F \bigwedge_{(G=G_1\circ G_2)\in S_F} CNF(p_g \leftrightarrow p_{g_1} \circ p_{g_2}) \tag{3}
\end{equation}
\]&lt;/p&gt;
&lt;p&gt;Precisely, the size of resulting formula is bound by \(30n + 2\) where \(n\) is the size of original formula.&lt;/p&gt;
&lt;h2 id="dpll"&gt;DPLL&lt;/h2&gt;
&lt;p&gt;The &lt;a href="https://en.wikipedia.org/wiki/DPLL_algorithm"&gt;Davis-Putnam-Logemann-Loveland (DPLL)&lt;/a&gt; algorithm can be expressed as:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/SAT_solver/dpll.png#center" alt="DPLL"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;BCP&lt;/em&gt; stands for &lt;a href="https://en.wikipedia.org/wiki/Unit_propagation"&gt;boolean constraint propagation&lt;/a&gt;. It requires that one of the clauses must be a unit clause. Performing BCP (or unit resolution) is same as replacing a literal with true in the original clauses.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;choose_variable&lt;/em&gt; contains multiple heuristics. For now we can consider a variable is randomly picked.&lt;/p&gt;
&lt;p&gt;The red part is optimization to the original DPLL. &lt;a href="https://en.wikipedia.org/wiki/DPLL_algorithm"&gt;Pure literal Propagation (PLP)&lt;/a&gt;. All it does is if variable \(p\) occurs only appears &lt;strong&gt;only&lt;/strong&gt; in the form of \(p\) or \(\neg p\) in the entire formula, we will set all occurences of \(p\) or \(\neg p\) to true or false.&lt;/p&gt;</description></item><item><title>Experience on Dafny Programming</title><link>https://www.bodunhu.com/blog/posts/experience-on-dafny-programming/</link><pubDate>Sun, 16 May 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/experience-on-dafny-programming/</guid><description>&lt;p&gt;Because of Professor &lt;a href="https://www.cs.utexas.edu/~isil/"&gt;Dillig&lt;/a&gt;&amp;rsquo;s &lt;a href="https://www.cs.utexas.edu/~isil/cs389L/"&gt;class&lt;/a&gt;, I finally got the chance to try out &lt;a href="https://en.wikipedia.org/wiki/Dafny"&gt;Dafny&lt;/a&gt;, a language made by &lt;a href="https://www.microsoft.com/en-us/research/"&gt;Microsoft Research&lt;/a&gt;, with built-in support for formal specification through &lt;a href="https://en.wikipedia.org/wiki/Precondition"&gt;preconditions&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Postcondition"&gt;postconditions&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Loop_invariant"&gt;loop invariants&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Loop_variant"&gt;loop variants&lt;/a&gt;. I often think, what if we write programs in a verification language, would there be much less bugs and will it make our lives much easier than sitting in front a screen for hours grinding at bugs. Here are some thoughts I want to put down before they are gone in my head.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://raw.githubusercontent.com/dafny-lang/dafny/master/dafny-banner.png#center" alt="Dafny-programming-language"&gt;&lt;/p&gt;
&lt;p&gt;Here is a simple example illustrating some basic concepts in Dafny. Suppose we are given a function that reverses a sequence. For example, we want to return &lt;code&gt;[c, b, a]&lt;/code&gt; when given &lt;code&gt;[a, b, c]&lt;/code&gt;, the recursive implementation would look like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;function reverse&amp;lt;T&amp;gt;(in : seq&amp;lt;T&amp;gt;) : seq&amp;lt;T&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; if (|s| == 0) then s
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; else reverse(s[1..]) + [s[0]]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This looks correct, right? We have termination condition set for recursion, the implementation is fairly straightforward. Nothing seems wrong. If the function is correct, it would suggest that the lemma&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;lemma reverseLemma&amp;lt;T&amp;gt;(in : seq&amp;lt;T&amp;gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ensures in == reverse(reverse(in));
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;{}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;must hold. If we simply reverse a sequence twice, it would be the same compared to the original input. However, compiling the program would complain that a postcondition might not hold. This is really strange. If we simply eye-ball through the implementation of &lt;code&gt;reverse&lt;/code&gt;, it&amp;rsquo;s hard to imagine what could possibly go wrong.&lt;/p&gt;
&lt;p&gt;One thing you should constantly keep in mind is that the &lt;strong&gt;Dafny compiler doesn&amp;rsquo;t really understand the implementation of the code&lt;/strong&gt;. It instead uses &lt;em&gt;specification&lt;/em&gt; to reason about the correctness of the program.&lt;/p&gt;
&lt;p&gt;Imagine we are given a function called &lt;code&gt;func1&lt;/code&gt;, we know its function signature and its return type. Then we are told that calling this function twice with a given input \(s\) is going to produce an output \(k\) matching \(s\) exactly. How should we trust that this claim is valid if the function &lt;code&gt;func1&lt;/code&gt; is a black box. How do we know if calling &lt;code&gt;func1&lt;/code&gt; with input \(s\) might never terminate?&lt;/p&gt;
&lt;p&gt;The solution is to annotate the function with certain properties so that the compiler knows what condition might hold before and after a function is executed.&lt;/p&gt;
&lt;p&gt;For example, the &lt;code&gt;reverse&lt;/code&gt; might take a sequence with length greater or equal to 0. Otherwise, it wouldn&amp;rsquo;t make sense to reverse a sequence with negative length. We could write this requirement as &lt;code&gt;requires |in| &amp;gt;= 0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We also need to claim that the output of function &lt;code&gt;reverse&lt;/code&gt; must be equal to the input, denoted as &lt;code&gt;ensures |s| == |reverse(s)|&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Finally, and most importantly, what property must the output of &lt;code&gt;reverse&lt;/code&gt; have? It&amp;rsquo;s the order of the output must be the reverse of the input. How do we express such property? We should say something like: every element in the output sequence must match the element in the original input sequence at the reversed location. It will look like: &lt;code&gt;ensures forall i :: 0 &amp;lt;= i &amp;lt; |in| ==&amp;gt; reverse(in)[i] == in[i_reversed]&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;After spending some good hours writing Dafny, my feeling towards Dafny is mixed. The plus side is writing down how a program behaves exactly and precisely forces programmers to be more careful when writing code. This would remove a lot of problems after deployment. Reuse verified API also removes concerns over much of the safety issue.&lt;/p&gt;
&lt;p&gt;The tricky part is coming up with annotations that meets specification requirements. On one side, this forces people to break down big tasks into smaller functions so that it&amp;rsquo;s much easier to come up with correct annotations. On the other side, debugging complex functions would require some magic. The debugging messages aren&amp;rsquo;t really helpful in terms of narrowing down possible problems. It would require some guesses and luck to find what annotations are wrong or lacking. The example &lt;code&gt;reverse&lt;/code&gt; function only has few post and pre conditions. However, more sophisticated functions involving multiple conditions as well as invariants are much harder to get right. Often, it requires the programmer to explore in the dark without much guidance before uncovering the solution.&lt;/p&gt;</description></item><item><title>Ethereum</title><link>https://www.bodunhu.com/blog/posts/ethereum/</link><pubDate>Fri, 14 May 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/ethereum/</guid><description>&lt;p&gt;In my previous &lt;a href="https://www.bodunhu.com/blog/posts/reflections-on-my-cs-phd-application-process/"&gt;post&lt;/a&gt;, we&amp;rsquo;ve gone over the high-level structure of blockchain and its attributes. This post covers &lt;a href="https://ethereum.org/en/"&gt;Ethereum&lt;/a&gt; and explore how blockchain can be used not only for money transfer but also application development.&lt;/p&gt;
&lt;h2 id="more-than-money"&gt;More Than Money&lt;/h2&gt;
&lt;p&gt;The idea behind Ethereum was proposed by &lt;a href="https://vitalik.ca/"&gt;Vitalik Buterin&lt;/a&gt;. He wanted to apply the idea of decentralization to build applications with a central authority in control. He first proposed to add a scripting language to Bitcoin but was later rejected. Later, &lt;a href="https://www.linkedin.com/in/gavin-wood-88843316/"&gt;Dr. Gavin Wood&lt;/a&gt; released the &lt;a href="https://ethereum.github.io/yellowpaper/paper.pdf"&gt;Ethereum yellow paper&lt;/a&gt;, covering the Ethereum Virtual Machine (EVM) which is capable of executing smart contracts on the network.&lt;/p&gt;
&lt;p&gt;Similar to blockchain, all computers across the Ethereum network have a full copy (ledger) of the application code and data. That means the platform can provide services at any time, without censorship or third-part interferences (however, this doesn&amp;rsquo;t necessarily mean extreme scalability nor low application response time).&lt;/p&gt;
&lt;h2 id="ethereum-architecture"&gt;Ethereum Architecture&lt;/h2&gt;
&lt;p&gt;To understand the high-level architecture of Ethereum, we compare it with the traditional client/server architecture. I really like &lt;a href="https://www.zastrin.com/courses/ethereum-primer/lessons/2-1"&gt;Zastrin&lt;/a&gt;&amp;rsquo;s illustration.&lt;/p&gt;
&lt;p&gt;Here is how a traditional client/server architecture looks like:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/ethereum/Webapp-Architecture.png#center" alt="client-server-architecture"&gt;&lt;/p&gt;
&lt;p&gt;The main point is that the web application is deployed on centralized platforms or hosting services like AWS. All client applications need to access the centralized component for requested services.&lt;/p&gt;
&lt;p&gt;The architecture of Ethereum is shown below:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/ethereum/Ethereum-Architecture.png#center" alt="ethereum-architecture"&gt;&lt;/p&gt;
&lt;p&gt;The major difference here is that each client interacts with a decentralized application instead. Each computer holds a replica of all the data as well as the code. You can think of your computer as a dedicated mini server handling only your requests. Of course, what this implies is you will need to download the entire Ethereum blockchain to be able to use the application.&lt;/p&gt;
&lt;p&gt;This is obviously infeasible. In practice, there are solutions like &lt;a href="https://metamask.io/"&gt;Metamask&lt;/a&gt; to mitigate the problem so your storage space won&amp;rsquo;t be saturated. But the high-level concept holds.&lt;/p&gt;
&lt;p&gt;A Ethereum blockchain consists of two components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Database: this is similar to the database we use in the traditional client/server architecture. The difference lies in how the data is represented and stored. Ethereum stores events as transactions which are exactly the same as BTC transactions, but are much more general as they can store anything ranging from a git commit message to a blog post. It can even be used to represent something more abstract such as ownership of artwork in the form of non-fungible tokens. Because Ethereum is a public blockchain, all the data stored in the blockchain will be visible to the public.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Smart Contract: this is fancy name for application code. The code is compiled into Ethereum Byte Code and then executed by the EVM.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The interesting part is the association between contract and code. &amp;lsquo;&amp;lsquo;Contract&amp;rsquo;&amp;rsquo; implies some form of agreement enforced by certain rules. &amp;lsquo;&amp;lsquo;Code&amp;rsquo;&amp;rsquo; on the other hand can be much more flexible. The reason behind this is that code deployed on the Ethereum blockchain can enforce the agreement because, &lt;strong&gt;once deployed, it can not be terminated or modified&lt;/strong&gt;. The beauty behind this is that the smart contract is deployed in a distributed network. A successful tampering attempt can only be achieved through agreement from majority, which is hard to achieve.&lt;/p&gt;
&lt;h3 id="how-is-identity-established"&gt;How is identity established?&lt;/h3&gt;
&lt;p&gt;In a traditional server/client application, for example, like GitHub, a user account is identified using the username. To managed an account, a user must have the corresponding password to access and modify the information in the account. The notion of account in Ethereum is also similar. An account have a Ethereum address which is visible to the entire network just like a GitHub username. The private key is used to control an the account and is not meant to be shared publicly, which corresponds to your GitHub account password. The exception is the contract account. Intuitively, ths makes sense because it is meant to be accessed by all parties in the contract.&lt;/p&gt;
&lt;h2 id="not-a-silver-bullet"&gt;Not a Silver Bullet&lt;/h2&gt;
&lt;p&gt;There has been a lot of hype around Ethereum and blockchain in the business world. However, Ethereum is not meant to be the solution for all problems.&lt;/p&gt;
&lt;p&gt;The root of Ethereum lies in decentralization. The benefit of decentralization comes at a cost. Because any request on the Ethereum blockchain has to go through multiple stages of verification, validation, and consensus, the response latency makes it unbearable for any application that needs to meet stringent time requirements.&lt;/p&gt;
&lt;p&gt;For example, posting a Twitter can be done in a matter of seconds. All it takes is for the client to send the tweet to the server. The server adds the tweet in its database, and update the information to other clients. Doing this in Ethereum means we need to first seek agreement from multiple other nodes with completely network conditions, wait for consensus to be achieved (which takes some time), before the tweet becomes valid.&lt;/p&gt;
&lt;p&gt;In fact, we have a simple &lt;a href="https://github.com/alexissa32/EthereumVoting"&gt;prototype&lt;/a&gt; using Ethereum to provide a platform (like &lt;a href="https://www.patreon.com/"&gt;Patreon&lt;/a&gt;) for artists to demonstrate their artwork and receive financial support directly from their community without any third-party agencies. The advantage is, by avoiding third-party agencies, we reduced the cost of transaction. However, the problem is, whenever a user click the &amp;lsquo;&amp;rsquo;like&amp;rsquo;&amp;rsquo; button, the information would not be reflected until minutes later. This would be unacceptable for products like Facebook or Instagram.&lt;/p&gt;</description></item><item><title>Reflections on my CS PhD Application Process</title><link>https://www.bodunhu.com/blog/posts/reflections-on-my-cs-phd-application-process/</link><pubDate>Mon, 10 May 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/reflections-on-my-cs-phd-application-process/</guid><description>&lt;p&gt;I&amp;rsquo;m glad it&amp;rsquo;s over.&lt;/p&gt;
&lt;p&gt;I applied for CS Ph.D. programs this past fall and had interviews with schools from late December all the way to March. Now that the semester has ended, I decided to put down some reflections on this process. This post is not intended to be the most comprehensive CS Ph.D. application tutorial in the world, but merely a half-guide half-memoir of journey towards a PhD. Of course, you should take this post with a grain of salt, since I don&amp;rsquo;t work on admission committees, and am no where near an expert in the application process.&lt;/p&gt;
&lt;h2 id="on-research"&gt;On Research&lt;/h2&gt;
&lt;p&gt;The singly most important factor in a Ph.D. application is perhaps your research experience. The general thumb is &amp;lsquo;&amp;rsquo;the earlier, the better&amp;quot;. You should absolutely start getting involved in research as early as possible. Throughout this experience, you will eventually find out if research is your thing and whether you want to continue the path in academia.&lt;/p&gt;
&lt;p&gt;The general rule of thumb is &amp;lsquo;&amp;rsquo;the earlier, the better&amp;quot;. Practically speaking, you will likely have something to put on your resume or your statement of purpose by the time you start Ph.D. applications. Since grad school is all about research, having as many research experiences as possible is only going to make your profile look stronger.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The earlier, the better&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;More importantly, grad school applications, especially in Computer Science, has become extremely competitive over the years. The admission rate for most top CS departments is well below 10 percent. Having publications, or even just research experiences, is likely gonna decide whether you are getting accepted. To some extent, &amp;ldquo;publish or perish&amp;rdquo; also applies to Ph.D. application.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/cs-phd-application/publish-or-perish.jpg" alt="publish-or-perish"&gt;&lt;/p&gt;
&lt;p&gt;I had friends who asked me whether they should wait until junior or senior years to get started in research because by then they will have taken enough courses and are better equipped with the background. My suggestion is no. First, part of research is about learning: you settle on a problem and you find ways to solve it. Learning happens throughout this process. My advisor Chris always says you learn more by doing research. My personal experience has reenforced this statement. Second, there simply might not be enough time for you to finish the research project by the time application process starts. It&amp;rsquo;s possible to pull a rabbit out of the hat in some areas like theory or machine learning. But for areas like systems, the sheer amount of workloads makes submission cycle really long. This is exacerbated by the relatively small number of top conferences. In addition, there will be several months before you are notified of the final decision.&lt;/p&gt;
&lt;p&gt;For example, my first GPU project took more than a year to finish. It was then rejected and cost another several months before it was published. By the time my application started, my paper was still under submission and I didn&amp;rsquo;t hear anything back until late January.&lt;/p&gt;
&lt;p&gt;Another problem people asked me is how to pick a research area in the first place. I use a small trick I learned from a post on Quora: figure out what you spend most time on. For a really long time, I thought I would eventually do something related to deep learning, because, well, that&amp;rsquo;s what everybody is doing. However, I found myself spending much more time browsing through &lt;a href="https://lwn.net/"&gt;LWN.net&lt;/a&gt; or Linux kernel code than actual deep learning. So I decided doing systems suites me better.&lt;/p&gt;
&lt;h2 id="picking-schools"&gt;Picking Schools&lt;/h2&gt;
&lt;p&gt;So I&amp;rsquo;ve decided to apply for PhD, it&amp;rsquo;s time to pick which schools to apply to. Ph.D. is equivalent to research apprenticeship. Therefore, the application process is very much &lt;em&gt;faculty-oriented&lt;/em&gt;. I highly recommend &lt;a href="https://csrankings.org/"&gt;csrankings&lt;/a&gt;. It is capable of filtering out a lot of unwanted information and focus on your research interests quickly. It&amp;rsquo;s much quicker and more convenient to see all names in one place than browsing through every person&amp;rsquo;s name in every university. But you should still take a loof at each university&amp;rsquo;s faculty list because csrankings might not contain the most up-to-date information.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/cs-phd-application/phdcomics.gif#center" alt="phd-comics"&gt;&lt;/p&gt;
&lt;p&gt;In terms of how many schools to apply, my suggestion is &amp;ldquo;the more, the better&amp;rdquo;. Statistically speaking, more application increases your chance of getting accepted. The major problem is the application fee and the need to tailor each SOP to different schools. I applied 20 programs, all in U.S.. This is way above the average number and I don&amp;rsquo;t really recommend most people doing it. However, applying more programs does gave me several interviews from schools I thought weren&amp;rsquo;t the best fit. For example, there is one AP who just came to the university and has research interests closely matching mine, but his personal information is not yet reflected in both &lt;a href="https://csrankings.org/"&gt;csrankings&lt;/a&gt; and the school&amp;rsquo;s website. Another interviews I got are from a professor coming from a more theoretical background but is in need of students with more system skills to build the underlying infrastructure.&lt;/p&gt;
&lt;h2 id="statement-of-purpose"&gt;Statement of Purpose&lt;/h2&gt;
&lt;p&gt;You will need to submit a SOP to every school you apply to. There are many tutorials online on how to come up with the best SOP so I won&amp;rsquo;t go over them. In general, I think there are two things to consider:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Don&amp;rsquo;t be creative, be simple.&lt;/li&gt;
&lt;li&gt;Focus on research.&lt;/li&gt;
&lt;li&gt;Getting as many people reading your draft as possible.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you google &amp;ldquo;what to avoid in a PhD SOP&amp;rdquo;, you will be advised to not say something like how you started programming at age 4. This is not the same as your college application essays so don&amp;rsquo;t be smart about it. Simply stating your goals and interests is more than enough. My SOP opening goes like:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After spending three years in academic research, I see my goal to seek a Ph.D. in Computer Science,
more specifically in systems, as a continuation of my increasing involvement with the field, and as a
requirement to pursue my research interests and solve related problems in systems as a professor. In
particular, I am interested in &lt;strong&gt;operating systems, heterogeneity, networking, machine learning systems, and architecture&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Avoid writing every accomplishment in your life. I get it, everyone wants to show off their proud moments. I made the mistake in my first draft by going over all my research projects, including those outside of systems, as well as my side projects. In the end, my essay turned into a hodgepodge with no real focus. It also goes way over the page limit. In the end, I had to cut down all stuffs that are not &amp;lsquo;&amp;lsquo;system-related&amp;quot; to make rooms for more important contents. If you want a sample, I&amp;rsquo;d be glad to share you a copy upon request.&lt;/p&gt;
&lt;h2 id="the-interview"&gt;The Interview&lt;/h2&gt;
&lt;p&gt;Now that you submitted all the applications, and interviews start to come in. My general suggestion for preparing for an interview is to keep things simple. First, there are only 30 minutes. Your priority, as a student, is to make the most out of this 30 minutes. It&amp;rsquo;s important to discuss your research and know what you are talking about, but don&amp;rsquo;t go over too much into the details because&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;No body is able to understand all the technical details under 30 minutes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The high level idea is generally much more important. If you can&amp;rsquo;t describe it under a few minutes, you likely don&amp;rsquo;t really know the project very well.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the professor is interested in digging more into the project, he/she will ask more questions. This is good because it can turn an interview into a discussion, which is both more interactive and less intimidating. Nobody likes a 30-minute monologue.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The purpose of the interview is to find whether there is a match of interests and whether this interest is mutual. You should also reserve some time to ask your own questions.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="other-thoughts"&gt;Other Thoughts&lt;/h2&gt;
&lt;p&gt;I don&amp;rsquo;t really recommend taking too many classes while doing research projects. Research is very time-consuming and can take away a lot of sleep. When I was first semester in working on my GPU project, I was trying to double major, taking all the hard classes, TAing an OS course, and juggling around all sorts of stuff. Anxiety became an issue and I wasn&amp;rsquo;t able to get enough sleep. My productivity took a hit and resulted in my first paper being rejected. If I could travel back in time, I&amp;rsquo;d rather do less to achieve more.&lt;/p&gt;
&lt;p&gt;If you decide to email potential advisors, keep the message short and go straight to the point. Avoid American novels. Assistant professors or professors that explicitly wrote on their page about students reaching out to them have much higher likelihood of replying your messages, because they are constantly looking for students to expand their groups.&lt;/p&gt;
&lt;p&gt;Keep up with the application status, but do not check &lt;a href="https://www.thegradcafe.com/"&gt;TheGradCafe&lt;/a&gt; too often because it can get addictive:)&lt;/p&gt;
&lt;p&gt;May the force be with you &amp;#x1f918;.&lt;/p&gt;</description></item><item><title>Blockchain</title><link>https://www.bodunhu.com/blog/posts/blockchain/</link><pubDate>Mon, 19 Apr 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/blockchain/</guid><description>&lt;p&gt;The first time I&amp;rsquo;ve heard the term &amp;ldquo;blockchain&amp;rdquo; was around 2014. Since then, its popularity has grown rapidly. However, I&amp;rsquo;ve never actually understand what blockchain is exactly, until recently. In fact, I didn&amp;rsquo;t really understand the difference between blockchain and bitcoin. For me, blockchain is clubbed with cryptocurrencies. So here is a short summary of what blockchain is and why people use blockchain.&lt;/p&gt;
&lt;h2 id="what-is-blockchain"&gt;What is Blockchain&lt;/h2&gt;
&lt;p&gt;I tried reading the articles about blockchain before, and it didn&amp;rsquo;t take long before I was completely overwhelmed by technical terms: consensus, asymmetric crypto, consistency, etc. It&amp;rsquo;s hard to combine all these little pieces together and form a big picture. Instead, it&amp;rsquo;s much easier to understand block-chain from a top-to-bottom view. Even better, a small step-by-step example can clarify much of the confusion. I like Prof. Anand&amp;rsquo;s example given in the class slides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Supposed we own a comic book store, and we want to sell comic books to some customer.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Every time we sell a book, 10 of my friends will record the action. Traditionally, we refer to such record as &amp;lsquo;&amp;rsquo;ledger&amp;rsquo;&amp;rsquo;. In the world of blockchain, we call this &amp;lsquo;&amp;lsquo;distributed ledger&amp;rsquo;&amp;rsquo;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Once we sell enough comic books, the 10 records (ledgers) will be collected into a book, all of my friends will get a copy of the book. This is very important because we use duplication to achieve consensus.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;To make things more secure, all these books are stored in a secure vault. In digital world, we achieve this though ways of encryption, digital signature, and so on. An attacker need to tamper many copies of such book to disrupt our selling records, which tends to be extremely hard in real world scenario.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Now we have this secure vault, which is effectively an immutable block. This block (vault) stores the record of we selling a comic book.
If we decide to sell more books, each one will generate an additional block (vault). Each block is appended after the previous block, forming what we call a &amp;lsquo;&amp;lsquo;blockchain&amp;rsquo;&amp;rsquo;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In essence, a blockchain is a series of immutable blocks, each storing the information of an event(s) whose validity is approved by a majority of other participants. Simple as that.&lt;/p&gt;
&lt;p&gt;I like using the term &amp;lsquo;&amp;lsquo;distributed ledger&amp;rsquo;&amp;rsquo; to characterize blockchain. In Prof. Anand&amp;rsquo;s slides, this graph summarizes how a distributed ledger differs from traditional centralized ledger:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/blockchain/ledger.png" alt="hyper-ledger"&gt;&lt;/p&gt;
&lt;p&gt;The main difference is how consensus is achieved. In centralized ledger, we have a single that decides the &amp;lsquo;&amp;lsquo;golden record&amp;rsquo;&amp;rsquo;. In a distributed ledger, consensus is achieved is everybody agrees with it. To give an example, we would pay a 45-dollar electricity bill each month to Texas electricity company because the price standard is set by the company alone. In a distributed ledger world, we might pay 32 dollars instead, if every single residents living in the building agrees this is the best price. Essentially, we eliminate the centralized entity and distribute the ability of making decisions to each individual evenly.&lt;/p&gt;
&lt;p&gt;In super-simple terms, a blockchain is just a computer file for storing data. The reason why it&amp;rsquo;s so secure is because there doesn&amp;rsquo;t exist a single central point of attack for hackers to target.&lt;/p&gt;
&lt;h2 id="sketch-on-blockchain"&gt;Sketch on Blockchain&lt;/h2&gt;
&lt;p&gt;Now we understand what a blockchain is, it&amp;rsquo;s time to find out how blockchain enables the development of digital currencies such as Bitcoin.
There are many great articles talking about Bitcoin in details, but I found the original &lt;a href="https://nakamotoinstitute.org/bitcoin/"&gt;paper&lt;/a&gt; extremely helpful in terms of understanding the motivation behind Bitcoin. In essence, Bitcoin was introduced to eliminate one problem: the need of a trusted third party to process electronic payments. More abstractly, shift from trust-based system to cryptographic-proof-based system.&lt;/p&gt;
&lt;p&gt;Tha paper claims that trust based model suffers from a fundamental weakness: the need of mediation. The logic is simple: mediation is required in the presence of disputes. Disputes means making non-reversible transactions more difficult, thus comes the possibility of reversal. Reversal causes the need for trust to spread. To establish trust, more price needs to be paid, in the form of money, personal information, etc. Essentially, the need of trust creates a centralized component that participants must rely on. In theory, Bitcoin resorts to a cryptographic-proof based system to replace the trust-based system, with the difference being that cryptographic-proof based system is distributed in nature.&lt;/p&gt;
&lt;p&gt;Imaging a distributed system as a fully connected graph with \(n\) nodes where each node represents a buyer/buyer. A transaction represents an edge connecting two nodes together. We denote set \(T\) as currently ongoing transactions, there could be as many as \(n(n-1)\) transactions going on concurrently, and each transaction \(t \in T\) is independent to each other. There enables 1) extremely scalability; 2) on reliance on central components. If a transaction is committed to a Bitcoin network, it suggests that the transaction has already gained approval from both the buyer and seller side (why this is the case is more technical, and you should Google how symmetric and asymmetric encryption work).&lt;/p&gt;
&lt;p&gt;If we imaging a centralized system, where every buyer node is connected to one node \(c\). Node \(c\) in turn is connected to every seller node \(s\). Assume the centralized component, or node \(c\), has a fixed capacity limiting the amount of traffic flowing thought it in any given moment. To achieve the same level of information flow in a distributed system, we need to increase node \(c\)&amp;rsquo;s capacity, which represents the increased costs of mediation. Assume a buyer node \(b\)&amp;rsquo;s output value is different from a seller node \(s\)&amp;rsquo;s input value (disputes), extra information flow will be required from seller node \(s\), creating the need for more capacity at node \(c\), thus driving the cost.&lt;/p&gt;
&lt;p&gt;From an abstract point of view, I&amp;rsquo;d like to imaging a normal transaction in a Bitcoin network as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Transaction is initiated&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The buy, the seller, and all witnesses agree with the validity of such transaction.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;With everyone satisfied as the precondition, transaction completes. If there exists a disagreement, transaction doesn&amp;rsquo;t happen.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;On the other hand, in a centralized system, the transaction happens as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Transaction is initiated&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Seller received payments, but there&amp;rsquo;s a mismatch&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Now the centralized component must be engaged to mitigate the issue.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;More shit happens, the centralized component must constantly nag both the buyer and the seller until the problem is solved.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In short, I think trust is not &amp;lsquo;&amp;lsquo;removed&amp;rsquo;&amp;rsquo;, it is merely achieved through a different way. I&amp;rsquo;d like to modify Prof. Anand&amp;rsquo;s summary on Bitcoin: Bitcoin is an engineering solution to solve trust issues.&lt;/p&gt;
&lt;h2 id="blockchain-structure"&gt;Blockchain Structure&lt;/h2&gt;
&lt;p&gt;The structure of blockchain is surprisingly simple. A blockchain consists a series of blocks, each holding batches of valid transactions that are hashed and encoded into a &lt;a href="https://en.wikipedia.org/wiki/Merkle_tree"&gt;Merkle tree&lt;/a&gt;, with only the root of the tree included in the block&amp;rsquo;s hash. Each block also includes a hash value of prior block in the blockchain. The essentially forms a linked list, except we replaced pointer with a hash value of a block.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/blockchain/block_chain_structure.png" alt="blockchain_structure"&gt;&lt;/p&gt;
&lt;p&gt;Using hash value to link block has another benefit that it protects the integrity of all previous blocks. For example, if an attacker modifies the data in one block, the action will consequently change the block&amp;rsquo;s corresponding hash value, resulting the hash value stored in the next block being invalidated. The attack needs to modify hash values starting from the modified block all the way to the latest block. In addition, the modified blockchain would be different compared the one stored in other nodes across the internet, making attacking even more difficult.&lt;/p&gt;
&lt;h3 id="mining"&gt;Mining&lt;/h3&gt;
&lt;p&gt;After explaining the basic concepts behind blocks, it becomes easy to understand the purpose of mining. Mining, in its essence, is using proof-of-work to implement a distributed timestamp server on a P2P bases.&lt;/p&gt;
&lt;p&gt;What a miner does is incrementing the nonce in a block until a value is found that would result in a block&amp;rsquo;s hash with required beginning zero bits. It is that simple &amp;#x1f604;. The more the number of zero bits required, the more work is needed to derive such hash value.&lt;/p&gt;
&lt;p&gt;Once thing one might ask is: it could be possible that there are multiple miners producing blocks with hash values satisfying such requirement. In that case, how should we determine which miner&amp;rsquo;s blocks get accepted? The original &lt;a href="https://nakamotoinstitute.org/bitcoin/"&gt;paper&lt;/a&gt; paper explains that the proof-of-work also solves the problem of determining representation in majority decision making. The majority decision is represented by the &lt;strong&gt;longest&lt;/strong&gt; chain. That&amp;rsquo;s it! The longest chain means a miner is able to produce many blocks that satisfy the zero bit requirement, thus showing the greatest proof-of-work effort invested.&lt;/p&gt;
&lt;p&gt;What is suggests, in layman&amp;rsquo;s term, is that whoever gets the most computational power has higher probability of generating new blocks and thus getting rewarded with bitcoins. That&amp;rsquo;s why people are craving for GPUs, FPGAs, and other accelerators because they are much better at parallel computing and have higher throughput than CPUs.&lt;/p&gt;
&lt;p&gt;Personally, I have doubts on the way proof-of-work is implemented. Normally, proof-of-work, for example, can be providing technical support to customers, or helping cleaning your neighbor&amp;rsquo;s backyard. The work you did has created positive values to the society. In the bitcoin case, the work was simply spending electricity to derive a value, which is hard to argue about its value. One way to justify its value might be that it provides a fundamental service so that blockchain can function properly and smoothly. Even then, it still feels like a bubble, not to mention the massive amount of resources wasted. Could there be another way to implement proof-of-work? If calculating the nonce value takes a long time, could we use waiting time instead to mimic the same result while saving resources in the same time?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Recently, a new cyptocurrency called &lt;a href="https://www.chia.net/"&gt;Chia&lt;/a&gt; was introduced and caught my attention. It was developed by the inventor of BitTorrent, &lt;a href="https://en.wikipedia.org/wiki/Bram_Cohen"&gt;Bram Cohen&lt;/a&gt;. It uses proof of space and time to replace the energy-hungry proof of work approach. In short terms, the way it works is: whenever the blockchain broadcasts a challenge for the next block, farmers scan their plots to see if they have the hash closest to the puzzle. The probability of winning a block the roughly proportional to the total space a farmer has compared to the entire network.&lt;/p&gt;
&lt;p&gt;Obviously, the demand for storage devices will increase dramatically. In fact, according to &lt;a href="https://www.tomshardware.com/news/chia-network-now-uses-more-than-1-exabyte-for-storage"&gt;Tom&amp;rsquo;s Hardware&lt;/a&gt;, in about a month&amp;rsquo;s time storage space allocated to Chia network increased from 120PB all the way to 1143PB, or 1.14 Exabytes. 1.14EB equals 1,140,000TB, or 63,333 20TB hard drives. Looking back at proof-of-work, it feels like choosing between one evil and another.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.mos.cms.futurecdn.net/3sV9LasTQiSqw6scuN5uQg-970-80.png" alt="chia-storage"&gt;&lt;/p&gt;
&lt;h3 id="transaction"&gt;Transaction&lt;/h3&gt;
&lt;p&gt;Transactions are the most part in the bitcoin system. It is represented as data structures that encode the transfer of value between participants. There are many fields in a transaction structure. But the most important components are: &lt;em&gt;input and output&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The best way to understand how transaction works is through an example. Suppose we have sender \(A\) and receiver \(B\). To send some BTC to receiver \(B\). \(A\) signs a transaction using his private key with specific details. This message is sent to the bitcoin network, the message contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;input: the source transaction sent to \(A\) at an earlier time.&lt;/li&gt;
&lt;li&gt;amount: amount of BTC to sent to \(B\).&lt;/li&gt;
&lt;li&gt;output: \(B\)&amp;rsquo;s public address.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here, the miners will verify whether \(A\) actually have access to the funds he/she claims to control using \(A\)&amp;rsquo;s public key. Upon verification, new blocks will be created.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: to actually understand how public and private key works, please refer to &lt;a href="https://en.wikipedia.org/wiki/Public-key_cryptography"&gt;public-key cryptography&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange"&gt;Diffie-Hellman algorithm&lt;/a&gt;, and the use of &lt;a href="https://brilliant.org/wiki/public-key-cryptography/"&gt;number theory&lt;/a&gt; in encryption.&lt;/p&gt;</description></item><item><title>Hoare Logic</title><link>https://www.bodunhu.com/blog/posts/hoare-logic/</link><pubDate>Sat, 17 Apr 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/hoare-logic/</guid><description>&lt;p&gt;Hoare logic forms the basis of all deductive verification. To illustrate Hoare logic, we first consider a smaller imperative programming
language &lt;strong&gt;IMP&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In IMP, we have three program constructs: expressions, conditionals, and statements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Expression takes the form \( E := Z\ |\ V\ |\ e_1 + e_2\ |\ e_1 \times e_2 \)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Conditional is self-explanatory: \( C := true\ |\ false\ |\ e_1 = e_2\ |\ e_1 \leq e_2 \)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Statement consists of several different forms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;\(S := V := E\) (Assignment)&lt;/li&gt;
&lt;li&gt;\(S_1; S_2\) (Composition)&lt;/li&gt;
&lt;li&gt;if \(C\) then \(S_1\) else \(S_2\) (If)&lt;/li&gt;
&lt;li&gt;while \(C\) do \(S\) (While)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="hoare-triple"&gt;Hoare Triple&lt;/h2&gt;
&lt;p&gt;In Hoare logic, we specify partial correctness of programs using Hoare triples:&lt;/p&gt;
&lt;p&gt;\[\{P\} S \{Q\}\]&lt;/p&gt;
&lt;p&gt;Here \(P\) is the precondition and \(Q\) is the post-condition. $S$ is a statement in IMP.&lt;/p&gt;
&lt;p&gt;The interpretation of Hoare triple is as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;if \(S\) is executed in a state satisfying \(P\)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;and if execution of \(S\) terminates&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;then the program state after \(S\) terminates satisfies \(Q\)&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Here an example, \(\{x = 0 \} while\ true\ do\ x := 0\ \{x = 1 \}\) is a valid Hoare triple because the execution of the statement never terminates, thus satisfying the requirement posed by Hoare triple.&lt;/p&gt;
&lt;p&gt;Thus the specification \(\{P\} S \{Q\}\) is called &lt;em&gt;partial&lt;/em&gt; correctness spec, because it doesn&amp;rsquo;t require \(S\) to terminate.&lt;/p&gt;
&lt;p&gt;There is also a stronger requirement called &lt;em&gt;total&lt;/em&gt; correctness. The total correctness specification is written as:&lt;/p&gt;
&lt;p&gt;\[ [P] S [Q]\]&lt;/p&gt;
&lt;p&gt;Total correctness requires that if \(P\) is satisfied when executing \(S\), then \(S\) must terminate, and the post-conditional \(Q\) must be satisfied after \(S\) terminates.&lt;/p&gt;
&lt;p&gt;Thus the example \(\{x = 0 \} while\ true\ do\ x := 0\ \{x = 1 \}\) is no longer valid because it never terminates.&lt;/p&gt;
&lt;p&gt;In summary, we can say that Total correctness \(=\) Partial correctness \(+\) termination.&lt;/p&gt;
&lt;h2 id="proving-partial-correctness"&gt;Proving Partial Correctness&lt;/h2&gt;
&lt;p&gt;We use \(\vDash \{P\} S \{Q\} \) to say a Hoare triple is valid and we use \(\vdash \{P\} S \{Q\} \) to indicate we can prove validity of a Hoare triple.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s say we are given an assignment \(x := y \) with post-condition \(x &amp;gt; 2\). The question is, what do we need to know before the assignment happens so that the post-condition, \(x &amp;gt; 2\), holds afterwards?&lt;/p&gt;
&lt;p&gt;To prove \(Q\) holds after the assignment \(x := E\), we need to show that &lt;strong&gt;\(Q\) with \(E\) substituting \(x\) holds before the assignment&lt;/strong&gt;. Formally, we write it as:&lt;/p&gt;
&lt;p&gt;\[\vdash \{Q[E / x]\}\ x := E \{Q\}\]&lt;/p&gt;
&lt;p&gt;For example, given \( \{ x+1 = n\}\ x := x+1 \ \{x=n\} \), we know this formula is provable because we can take \(Q\), which is \(\{x=n\}\), substituting \(x\) with \(x+1\) given we need to replace it with \(E\), and we will convert \(x=n\) to \(x+1 = n\), which matches the precondition.&lt;/p&gt;
&lt;p&gt;Here is another interesting example, given \( \{z = 2\}y:= x \{y = x\} \), this Hoare triple is valid but not provable. If we use the above substitution procedure, it will result in the precondition being \(x=x\) which is always true but is also different from the original precondition \(z=2\).&lt;/p&gt;
&lt;p&gt;Intuitively, we can prove the post-condition \(y = x\) given the statement \(y = x\) without any assumptions, so even if we do have assumptions like \(z=2\), we should still be able to prove it, and here comes proof rule for precondition strengthening.&lt;/p&gt;
&lt;h2 id="proof-rule-for-precondition-strengthening"&gt;Proof Rule for Precondition Strengthening&lt;/h2&gt;
&lt;p&gt;Formally, we define precondition strengthening as:&lt;/p&gt;
&lt;p&gt;\[ \frac{ \vDash \{P&amp;rsquo;\} S \{Q\}\ \ P \Rightarrow P&amp;rsquo; }{\vdash \{P\} S \{Q\}} \]&lt;/p&gt;
&lt;p&gt;Now, with the original formula \( \{z = 2\}y:= x \{y = x\} \), we would derive \( x= x \equiv true \). and since \(z=2 \rightarrow true\) is valid, we can now prove the formula!&lt;/p&gt;
&lt;h2 id="a-dual-post-condition-weakening"&gt;A Dual: Post-Condition Weakening&lt;/h2&gt;
&lt;p&gt;Formally, we define post-condition weakening as:&lt;/p&gt;
&lt;p&gt;\[ \frac{ \vDash \{P\} S \{Q&amp;rsquo;\}\ \ Q&amp;rsquo; \Rightarrow Q }{\vdash \{P\} S \{Q\}} \]&lt;/p&gt;
&lt;p&gt;What this means if that if we can prove a post-condition \(Q&amp;rsquo;\), we can always relax it to something &lt;strong&gt;weaker&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;For example, given that \(\vdash \{true\}S\{x=y \land z=2\}\), we can prove \(\{true\}S\{x=y\}\) because \(x=y\) is a weaker condition of \( x=y \land z=2 \).&lt;/p&gt;
&lt;h2 id="proof-rule-for-composition"&gt;Proof Rule for Composition&lt;/h2&gt;
&lt;p&gt;For composition, we define the rule as:&lt;/p&gt;
&lt;p&gt;\[ \frac{ \vdash \{P\}S_1\{Q\}\ \ \vdash \{Q\}S_2 \{R\} }{ \vdash \{P\}S_1;S_2\{R\} }\]&lt;/p&gt;
&lt;p&gt;I won&amp;rsquo;t show why this is true, so this will be left as an exercise.&lt;/p&gt;
&lt;h2 id="proof-rule-for-if-statements"&gt;Proof Rule for If Statements&lt;/h2&gt;
&lt;p&gt;Naturally, we define the rule for if statement as:&lt;/p&gt;
&lt;p&gt;\[ \frac{_{ \vdash \{P \land C\} S_1 \{Q\} }^{ \vdash \{P \land \neg C\} S_2 \{Q\} }}{ \vdash \{P\}\ if\ C\ then\ S_1\ else \ S_2 \ \{Q\} } \]&lt;/p&gt;
&lt;p&gt;In summary, this means given we know \(P\) is true, no matter what \(C\) evaluates to, we will come to the same post-condition \(Q\). If you still don&amp;rsquo;t understand it, just stare at it for five minutes and you should figure out why this is the case:)&lt;/p&gt;
&lt;h2 id="proof-rule-for-while"&gt;Proof Rule for While&lt;/h2&gt;
&lt;p&gt;To understand the proof rule for while statement, we need to first understand a simple concept: loop invariant&lt;/p&gt;
&lt;h3 id="loop-invariant"&gt;Loop Invariant&lt;/h3&gt;
&lt;p&gt;Loop invariant \(I\) has two properties:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;\(I\) holds initially before the loop&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;\(I\) holds after each loop iteration&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For example, given a loop&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;i := 0;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;j := 0;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;n := 0;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;while i &amp;lt; n do
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; i := i + 1;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; j := i + j
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Here, \(i \leq n \) is a loop invariant but \(i &amp;lt; n \) isn&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;Now, we put the properties of loop invariant \(I\) in formal terms. Given that the precondition before a loop executes is \(C\), by definition, \(I\) holds initially before the loop, we know \(I \land C\) holds.&lt;/p&gt;
&lt;p&gt;For the second property of loop invariant, it specifies \(I\) holds after each loop iteration. So that means \(\{ I \land C\ \} S \{I\} \) holds. Formally, we express loop invariant as \( \vdash \{P \land C\} S \{P\} \).&lt;/p&gt;
&lt;p&gt;Now, we know if a loop terminates, it must be that condition \(C\) no longer holds, meaning \( P \land \neg C \) must be true after loop terminates. This is because \(P\) is a loop invariant and always holds after each loop iteration, including termination.&lt;/p&gt;
&lt;p&gt;Putting all this together, we form the proof rule for while loop:&lt;/p&gt;
&lt;p&gt;\[ \frac{ \vdash \{P \land C\} S \{P\} }{ \vdash \{P\} while \ C \ do \ S\{P \land \neg C\} }\]&lt;/p&gt;
&lt;h3 id="inductive-loop-invariant"&gt;Inductive Loop Invariant&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s not always the case that we can prove loop invariant is valid. Here is a counter example:&lt;/p&gt;
&lt;p&gt;Consider precondition \( I = j \geq 1 \) and the code is:&lt;/p&gt;
&lt;p&gt;\[i := 1; j := 1; while \ i &amp;lt; n\ do\ \{j := j+i; i ;= i + 1\}\]&lt;/p&gt;
&lt;p&gt;We know that the precondition is \(I = j \geq 1\) and \(C\) (loop condition) is \(i \leq n\). So we have a Hoare triple:&lt;/p&gt;
&lt;p&gt;\[ \{ j \geq 1 \land i \leq n \} j =j + i;\ i = i + 1; \ \{j \geq 1\} \]&lt;/p&gt;
&lt;p&gt;We could simply set \(i = -100\), then if we execute the code once we will not be sure if the post-condition \(j \geq 1\) holds.&lt;/p&gt;
&lt;p&gt;However, if we have &lt;strong&gt;strengthened invariant&lt;/strong&gt; such as \(j \geq 1 \land i \geq 1\), the new Hoare triple will be valid. Then \(I\) will become inductive invariant because we can prove these invariant.&lt;/p&gt;
&lt;p&gt;To put everything in action, here is an example showing how to find inductive loop invariant to prove the following Hoare triple:&lt;/p&gt;
&lt;p&gt;\[ \{i = o \land j = o \land n = 5\} \]
\[while\ i &amp;lt; do\ i := i + 1; \ j := j + i; \]
\[\{j = 15\} \]&lt;/p&gt;
&lt;p&gt;If we have \( j = \frac{i(i+1)}{2} \), this is a loop invariant because we can prove that:&lt;/p&gt;
&lt;p&gt;\[\{j = \frac{i(i+1)}{2} \land i &amp;lt; n\} i = i + 1;\ j = j+ i\ \{j = \frac{i(i+1)}{2}\} \]&lt;/p&gt;
&lt;p&gt;If we conjoin this condition with \(i \geq n\) as the post-condition, however, we can&amp;rsquo;t really show that \(j = 15\) is true for the given Hoare triple.&lt;/p&gt;
&lt;p&gt;If we also add condition \(n = 5\) and \(i \leq n\), and we conjoin this with the end-loop condition \( i \geq n\), we would realize that \( i = n = 5\), and thus prove that \(j = 15\) for the given Hoare triple.&lt;/p&gt;
&lt;p&gt;How we get \(j = \frac{i(i+1)}{2}\) is, however, not trivial to solve, and requires some human effort in program verification.&lt;/p&gt;
&lt;h2 id="basic-idea-behind-program-verification"&gt;Basic Idea behind Program Verification&lt;/h2&gt;
&lt;h3 id="automating-reasoning-in-hoare-logic"&gt;Automating Reasoning in Hoare Logic&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s reasonable to automate the tedious parts of program verification: proving correctness. The basic idea to assume an oracle (human or another program) gives loop invariants but automate the rest of the reasoning.&lt;/p&gt;
&lt;p&gt;Automating Hoare logic is based on generating verification conditions (VC). Essentially, a verification condition is formula \(\phi\) s.t. program is correct iff \(\phi\) is valid.&lt;/p&gt;
&lt;p&gt;There are two way to generate verification conditions: forwards and backwards.&lt;/p&gt;
&lt;p&gt;As their name suggests, a forwards analysis starts from precondition and generates formulas to prove post-condition. Forwards technique computes &lt;strong&gt;strongest post-conditions (sp)&lt;/strong&gt;. In contrast, backwards analysis starts from post-condition and tries to prove precondition. Backwards technique computes &lt;strong&gt;weakest preconditions (wp)&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Here, we start from the backwards method.&lt;/p&gt;
&lt;h3 id="weakest-preconditions"&gt;Weakest Preconditions&lt;/h3&gt;
&lt;p&gt;Formally, we define the weakest precondition of \(Q\) with respect to \(S\) as \(wp(S, Q)\).&lt;/p&gt;
&lt;p&gt;\(wp(S, Q)\) has the property that it is the weakest condition (least amount of information we need to have) that guarantees \(Q\) holds after \(S\) in any execution.&lt;/p&gt;
&lt;p&gt;Thus, Hoare triple \( \{P\}S\{Q\} \) is valid iff \( P\Rightarrow wp(S, Q) \).&lt;/p&gt;
&lt;p&gt;Weakest preconditions are defined inductively and follow Hoare&amp;rsquo;s proof rules:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;\(wp(x := E, Q) = Q[E/x]\)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;\( wp(s_1 ; s_2, Q) = wp(s_1, wp(s_2, Q) ) \)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;\(wp(if \ C\ then \ s_1\ else \ s_2, Q) =C \rightarrow wp(s_1, Q) \land \neg C \rightarrow wp(s_2, Q) \)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, for loops, we might not be able to compute the weakest preconditions exactly because there might be cases where we simply don&amp;rsquo;t know the number loops executed.&lt;/p&gt;
&lt;p&gt;Thus, we relax our requirement by computing \(awp(S,Q)\) (\(a\) stands for approximate)) instead, hoping that \(awp(S, Q)\) is weak enough to be implied by \(P\) although it may not be the weakest.&lt;/p&gt;
&lt;p&gt;Now, assume all loops are annotated with invariants \(while \ C \ do \ [I]\ S\), we will just define \(awp(while \ C \ do \ [I]\ S, Q) \equiv I\).&lt;/p&gt;
&lt;p&gt;However, there is another program, since \(awp\) is only an approximated condition, it doesn&amp;rsquo;t necessarily mean that if \(P \Rightarrow awp(S, Q)\), \( \{P\}S\{Q\} \) is valid. There are two reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;We don&amp;rsquo;t know if the loop invariant \(I\) provided by the oracle is correct since it might be provided by human and we know human make mistakes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Even if \(I\) is correct, we don&amp;rsquo;t know if \(I \land \neg C\) is sufficient to establish \(Q\).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Thus, for each statement \(S\), we need to generate verification condition (VC) \( VC(S,Q) \) which encodes additional conditions to prove.&lt;/p&gt;
&lt;h3 id="verification-conditions"&gt;Verification Conditions&lt;/h3&gt;
&lt;p&gt;So how do we formulate VC generation rules for loops?&lt;/p&gt;
&lt;p&gt;\[ VC(while\ C\ do\ [I]\ S,Q) = ?\]&lt;/p&gt;
&lt;p&gt;First, we need to ensure that \(Q\) is satisfied after loop, which means \( I \land \neg C \Rightarrow Q \).&lt;/p&gt;
&lt;p&gt;To show that \(I\) is actually correct, we also need \( \{I \land C\} S \{I\} \).&lt;/p&gt;
&lt;p&gt;This implies that we need to show \( I \land C \Rightarrow awp(S, I) \). In case \(S\) contains nested loops, and also add \(VC(S, I)\)&lt;/p&gt;
&lt;p&gt;In summary, to how that loop invariant \(I\) provided by the oracle is correct, we need to show \( I \land C \Rightarrow awp(S,I) \land VC(S, I) \).&lt;/p&gt;
&lt;p&gt;To show \(I\) is strong enough to establish \(Q\), we need to show \( I \land \neg C \Rightarrow Q \).&lt;/p&gt;
&lt;p&gt;Putting this together, and to answer the two reason why \(P \Rightarrow awp(S, Q)\), \( \{P\}S\{Q\} \) might not be valid, VC for a while loop \( S&amp;rsquo; = while \ C \ do \ \{I\} \) is expressed as:&lt;/p&gt;
&lt;p&gt;\[ VC(S&amp;rsquo;, Q) = (I \land C \Rightarrow awp(S, I) \land VC(S, I) ) \land (I \land \neg C \Rightarrow Q) \]&lt;/p&gt;
&lt;p&gt;In essence, verification condition simply stands for additional checks we need to verify before we can claim that, if an approximated precondition \(P\) is valid, \( \{P\} S \{Q\} \).&lt;/p&gt;
&lt;p&gt;The verification condition for other statements is as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;For assignment, we don&amp;rsquo;t need any additional checks for precondition because if \( P \Rightarrow wp(S, Q) \), it implies that \( \{P\} S \{Q\} \) is valid. Thus, \( VC(x:= E, Q) = true \).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For composition, we have \( VC(s_1 ; s_2, Q) = VC(s_2, Q) \land VC(s_1, awp(s_2 , Q)) \).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For if statement, we have \( VC(if \ C \ then \ s_1\ else \ s_2, Q) = VC(s_1, Q) \land VC(s_2, Q) \).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;Quick question: for if statement, why don&amp;rsquo;t we instead use verification condition generation rule: \( C \Rightarrow VC(s_1, Q) \land \neg C \Rightarrow VC(s_2, Q) \)?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here is a counter example. Suppose we have \( S = if\ (x &amp;gt; 0) \ while (*) x - -; else \ skip\), and we have given loop invariant \(x \geq 0\).&lt;/p&gt;
&lt;p&gt;If we use the original rule \( VC(s_1, Q) \land VC(s_2, Q) \), according to the verification condition generation rule for while loop, we would have to verify the loop invariant \(I\) is correct, and thus \(VC(S, I) \equiv \{ x \geq 0 \} x - - \{ x \geq 0 \} \), obviously, this not true, and we can use this VC.&lt;/p&gt;
&lt;p&gt;However, if we instead use the rule \( C \Rightarrow VC(s_1, Q) \land \neg C \Rightarrow VC(s_2, Q) \). The VC would become \( x &amp;gt; 0 \Rightarrow (\{ x \geq 0 \} x - - \{ x \geq 0 \}) \), which is valid, and we will include the wrong VC. Thus we can&amp;rsquo;t use this VC generation rule.&lt;/p&gt;
&lt;h2 id="verification-of-hoare-triple"&gt;Verification of Hoare Triple&lt;/h2&gt;
&lt;p&gt;Thus, to show validity of Hoare triple \( \{P\} S \{ Q \} \), we need to compute:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;\( awp(S, Q) \)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;\( VC(S, Q) \)&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Therefore, a Hoare triple is valid if the following formula holds:&lt;/p&gt;
&lt;p&gt;\begin{equation}\tag{*}
VC(S, Q) \land P \rightarrow awp(S, Q)
\end{equation}&lt;/p&gt;
&lt;p&gt;Thus, if we can prove the validity of the above equation *, we know the program obeys specification.&lt;/p&gt;</description></item><item><title>Congruence Closure</title><link>https://www.bodunhu.com/blog/posts/congruence-closure/</link><pubDate>Sat, 27 Mar 2021 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/congruence-closure/</guid><description>&lt;p&gt;This is a summary of how to compute congruence closure. I implemented the algorithm to compute congruence closure and thought I&amp;rsquo;d never forget it. But my memory starts to get blurry just after two days. So I figured I&amp;rsquo;d put things down so I don&amp;rsquo;t have to watch the entire lecture again the next time I need it.&lt;/p&gt;
&lt;!--description--&gt;
&lt;h2 id="equivalence-relation"&gt;Equivalence Relation&lt;/h2&gt;
&lt;p&gt;Equivalence relation has three properties: reflexive, symmetric, and transitive.
(E.g. \(\geq\) is not an equivalence relation because it break the symmetric property. \(4 \geq 6\) does not imply that \(6 \geq 4\).
For example, a binary relation $R$ over a set $S$ meeting these three properties can be expressed as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reflexive: $\forall s \in S.\ sRs$&lt;/li&gt;
&lt;li&gt;Symmetric : $\forall s_1, s_2 \in S.\ s_1 R s_2 \rightarrow s_2 R s_1$&lt;/li&gt;
&lt;li&gt;Transitive: $\forall s_1, s_2, s_3 \in S.\ s_1 R s_2 \land s_2 R s_3 \rightarrow s_1 Rs_3$&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="congruence-relation"&gt;Congruence Relation&lt;/h2&gt;
&lt;p&gt;Given a set $S$ equipped with functions $F = {f_1, &amp;hellip;, f_n}$, a relation $R$ over $S$ is a congruence relation if $R$ is an equivalence relation and for every $n$&amp;lsquo;ary function $f \in F$ we have:&lt;/p&gt;
&lt;p&gt;\[\forall \overset{\rightarrow}{s}, \overset{\rightarrow}{t}.\ \bigwedge\limits_{i=1}^{n}s_i R t_i \rightarrow f(\overset{\rightarrow}{s}) R f(\overset{\rightarrow}{t})\]&lt;/p&gt;
&lt;p&gt;A counter example would be given $R(x, y)$ defined as $|x| = |y|$ on all integers. If we have $R = {2, 2}$ and $f(x) = x + 1$ (successor function), then we know it violates the equivalence relation we mentioned above&lt;/p&gt;
&lt;h2 id="equivalence-closure"&gt;Equivalence Closure&lt;/h2&gt;
&lt;p&gt;In short, the equivalence closure $R^E$ is the smallest equivalence relation that includes $R$.
This is illustrated through an example. Given a set $S = {a, b, c}$ and binary relation $R:{\langle a, b \rangle , \langle b, c \rangle, \langle d, d \rangle}$, $R^E$ would contain all elements extended from $R$ based on the three properties of equivalence relation.&lt;/p&gt;
&lt;h2 id="congruence-closure"&gt;Congruence Closure&lt;/h2&gt;
&lt;p&gt;Naturally, congruence closure $R^C$ would be the smallest set that contains congruence relation $R$. What this means is $R^C$ contains $R^E$ (the equivlance closure we derived before), and any element generated from $R^E$ by a given function that produces element which also satisfies equivelance relation. For example, Given $S = {a, b, c}$ and function $f$ such that $f(a) = b$, $f(b) = c$, $f(c) = c$, the congruence closure would contain nine elments in total. First, we
would use the procedure above to generated equivalence closure. Then, because $f(a) = b$ and $f(b) = c$ due to congruence relation, we know $b = c$, now we apply the procure for generating equivalence closure again.&lt;/p&gt;
&lt;h2 id="algorithm-to-compute-congruence-closure"&gt;Algorithm to Compute Congruence Closure&lt;/h2&gt;
&lt;p&gt;The high-level description of the algorithm is as following:&lt;/p&gt;
&lt;p&gt;To decide satisfiability of $T_{=}$ (equality theory) formula:&lt;/p&gt;
&lt;p&gt;\[F\ : \ s_1 = t_1 \land &amp;hellip; s_m = t_m \land s_{m+1} \neq t_{m+1} \land &amp;hellip; s_n \neq t_n\]&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Compute subterms and construct initial DAG (each node’s representative is itself)&lt;/li&gt;
&lt;li&gt;For each $i \in [1,m]$, process equality $s_i= t_i$ as described. (Essentially, process all equiv expression first)&lt;/li&gt;
&lt;li&gt;For each $i \in [m + 1,n]$, check if $Rep(s_i) =Rep(t_i)$. (Check if any nequiv expression contradicts any equiv expression)&lt;/li&gt;
&lt;li&gt;If there exists some $i \in [m + 1, n]$, for which $Rep(s_i) =Rep(t_i)$, return UNSAT&lt;/li&gt;
&lt;li&gt;if for all $i$, $Rep(s_i) \neq Rep(t_i)$, return SAT&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is an example for illustration purposes, borrowed from Prof. &lt;a href="https://www.cs.utexas.edu/~isil/"&gt;Dillig&lt;/a&gt;&amp;rsquo;s slides:&lt;/p&gt;
&lt;p&gt;Given formula $F\ : \ f^3(a) = a \land f^5(a) = a \land f(a) \neq a$&lt;/p&gt;
&lt;p&gt;The initial DAG would be:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/congruence_algorithm/DAG.png#center" alt="congruence-closure-DAG"&gt;&lt;/p&gt;
&lt;p&gt;Process equality $f^3(a) = a$ gives us:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/congruence_algorithm/DAG_1.png#center" alt="congruence-closure-DAG-2"&gt;&lt;/p&gt;
&lt;p&gt;Recursively merging the parents results in:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/congruence_algorithm/DAG_2.png#center" alt="congruence-closure-DAG-3"&gt;&lt;/p&gt;
&lt;p&gt;Process equality $f^5(a) = a$ gives us:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/congruence_algorithm/DAG_3.png#center" alt="congruence-closure-DAG-4"&gt;&lt;/p&gt;
&lt;p&gt;Now in this step, $f^2(a)$ and $a$ are in the same congruence class, thus we will perform the same operation on their parents, processing equality $f^3(a) = f(a)$:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/congruence_algorithm/DAG_5.png#center" alt="congruence-closure-DAG-5"&gt;&lt;/p&gt;
&lt;p&gt;We find $f(a) \neq a$ has a conflict because node $a$&amp;rsquo;s representative is $f(a)$, indicating they are in the same congruence class, meeting congruence relation.
Thus the formula is UNSAT.&lt;/p&gt;</description></item><item><title>Program Loading and Memory Mapping in Linux</title><link>https://www.bodunhu.com/blog/posts/program-loading-and-memory-mapping-in-linux/</link><pubDate>Tue, 03 Nov 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/program-loading-and-memory-mapping-in-linux/</guid><description>&lt;p&gt;This is a summary over program loading, dynamical paging, signal handling, and memory mapping in Linux.&lt;/p&gt;
&lt;h2 id="execve-syscall"&gt;execve Syscall&lt;/h2&gt;
&lt;p&gt;One of operating systems&amp;rsquo; basic services is to load programs into memory to execute. Programs rely on &lt;code&gt;execve&lt;/code&gt; syscall to get the OS to load the program into memory and start it executing as a process. The kernel version we used to testing is 5.4.0. Doing a quick search inside &lt;a href="https://elixir.bootlin.com/linux/v5.4/source/fs/exec.c#L1956"&gt;Elixir&lt;/a&gt; gives us:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;SYSCALL_DEFINE3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;execve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;__user&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;__user&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__user&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;__user&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__user&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;envp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;do_execve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;envp&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Follow the function call, we will eventually reach the call to &lt;code&gt;__do_execve_file&lt;/code&gt;, the comment of this function says &amp;ldquo;sys_execve() executes a new program&amp;rdquo;, which is pretty straightforward. This function first checks the &lt;code&gt;filename&lt;/code&gt; pointer. Then it checks the flags of the current process that limit of running processes is not exceeded:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;IS_ERR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;PTR_ERR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt;/*
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; * We move the actual failure in case of RLIMIT_NPROC excess from
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; * set*uid() to execve() because too many poorly written programs
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; * don&amp;#39;t check setuid() return code. Here we additionally recheck
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; * whether NPROC limit is still exceeded.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;PF_NPROC_EXCEEDED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;atomic_read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nf"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;processes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;rlimit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RLIMIT_NPROC&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;EAGAIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;out_ret&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt;/* We&amp;#39;re below the limit (still or again), so we don&amp;#39;t want to make
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; * further execve() calls fail. */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;=&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;PF_NPROC_EXCEEDED&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The next important task is to allocate the &lt;code&gt;struct linux_binprm&lt;/code&gt; structure defined &lt;a href="https://elixir.bootlin.com/linux/v5.4/source/include/linux/binfmts.h#L17"&gt;here&lt;/a&gt;. This structure is used to hold the arguments that are used when loading binaries.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;bprm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;kzalloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;GFP_KERNEL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;out_files&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Next, the function performs a seireis of tasks to prepare the &lt;code&gt;bprm&lt;/code&gt; struct. Refer to the &lt;a href="https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html"&gt;linux-insides&lt;/a&gt; book to find more information on how exactly the &lt;code&gt;bprm&lt;/code&gt; structure is filled up.&lt;/p&gt;
&lt;p&gt;The most important function called by &lt;code&gt;__do_execve_file&lt;/code&gt; is &lt;code&gt;search_binary_handler&lt;/code&gt;. Based on the &lt;a href="https://elixir.bootlin.com/linux/v5.4/source/fs/exec.c"&gt;comment&lt;/a&gt;, this function cycles the list of binary formats handler, until one recognizes the image. We can find one section of the code surrounded by &lt;code&gt;binfmt_lock&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;list_for_each_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;formats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nf"&gt;try_module_get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;read_unlock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;binfmt_lock&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;recursion_depth&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;load_binary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;recursion_depth&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;read_lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;binfmt_lock&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;put_binfmt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;mm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* we got to flush_old_exec() and failed after it */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;read_unlock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;binfmt_lock&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;force_sigsegv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SIGSEGV&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;retval&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ENOEXEC&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;read_unlock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;binfmt_lock&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;retval&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We can see it calls into &lt;code&gt;load_binary&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;load_binary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Here, the &lt;code&gt;load_binary&lt;/code&gt; is a pointer in a &lt;code&gt;linux_binfmt&lt;/code&gt; struct. For elf format, it can be found &lt;a href="https://elixir.bootlin.com/linux/v5.4/source/fs/binfmt_elf.c#L94"&gt;here&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;linux_binfmt&lt;/span&gt; &lt;span class="n"&gt;elf_format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;THIS_MODULE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_binary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_elf_binary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_shlib&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_elf_library&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;core_dump&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;elf_core_dump&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_coredump&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ELF_EXEC_PAGESIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We can find the &lt;code&gt;load_elf_binary&lt;/code&gt; function defined in the &lt;a href="https://elixir.bootlin.com/linux/v5.4/source/fs/binfmt_elf.c#L673"&gt;&lt;code&gt;fs/binfmt_elf.c&lt;/code&gt;&lt;/a&gt; file. Then the function will check the magic number in the ELF file header. You can find the ELF format from &lt;a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format"&gt;wiki&lt;/a&gt;.
We can see for both 32-bit and 64-bit systems, the e-ident field should contain the magic number for ELF format files.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt;/* Get the exec-header */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;elf_ex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;elfhdr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ENOEXEC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt;/* First of all, some simple consistency checks */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;memcmp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;elf_ex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;e_ident&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ELFMAG&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SELFMAG&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then, &lt;code&gt;load_elf_binary&lt;/code&gt; will do some tasks to prepare for the executable file. After that, it will try to load the program header table:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;elf_phdata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_elf_phdrs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;elf_ex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;elf_phdata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then it will traverse the program header table and find the interpreter which is responsible of setting up the stack and map elf binary into the correct location in memory. After the interpreter is obtained, the function will perform simple consistency checks on the interpreter. It will load the interpreter program headers:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt;/* Load the interpreter program headers */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;interp_elf_phdata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_elf_phdrs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;interp_elf_ex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;interpreter&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;interp_elf_phdata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;out_free_dentry&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This function will call &lt;code&gt;setup_arg_pages&lt;/code&gt; to finalize the stack vm_area_struct:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt;/* Do this so that we can load the interpreter, if need be. We will
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; change some of these later */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setup_arg_pages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;randomize_stack_top&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STACK_TOP&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;executable_stack&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;out_free_dentry&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;It will also mmap the elf image into the correct location in memory. The bss and brk sections are prepared for the executable file:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt;/* Now we do a little grungy work by mmapping the ELF image into
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; the correct location in memory. */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;elf_ppnt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;elf_phdata&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;elf_ex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;e_phnum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;elf_ppnt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* There was a PT_LOAD segment with p_memsz &amp;gt; p_filesz
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; before this one. Map anonymous pages, if needed,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; and clear the area. */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set_brk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elf_bss&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;load_bias&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;elf_brk&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;load_bias&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;bss_prot&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;goto&lt;/span&gt; &lt;span class="n"&gt;out_free_dentry&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;nbyte&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ELF_PAGEOFFSET&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elf_bss&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nbyte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;nbyte&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ELF_MIN_ALIGN&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;nbyte&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nbyte&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;elf_brk&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;elf_bss&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;nbyte&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;elf_brk&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;elf_bss&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;clear_user&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;__user&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;elf_bss&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;load_bias&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nbyte&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;It will also call &lt;code&gt;elf_map&lt;/code&gt; to map the segment to [vaddr, vaddr + file size] and align and then perform some checks:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;elf_map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;load_bias&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;vaddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;elf_ppnt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;elf_prot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;elf_flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The interpreter is then loaded:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;elf_entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_elf_interp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;interp_elf_ex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;interpreter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;interp_map_addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;load_bias&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interp_elf_phdata&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Finally, the elf talbe is created:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;retval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_elf_tables&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;elf_ex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;load_addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interp_load_addr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;After everything is prepared, we can call the &lt;code&gt;start_thread&lt;/code&gt; function, which prepares the new task&amp;rsquo;s registers and segments for execution. We will pass the set of registers for the new task, the address of the entry point of the new task, and the address of the top of of the statck for the new task to this function.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;start_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;regs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;elf_entry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bprm&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A lot of the information here can also be found at the &lt;a href="https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html"&gt;linux-insides&lt;/a&gt; book. I found it very helpful clearing my confusion.&lt;/p&gt;
&lt;p&gt;In our own implementations, we will not call the loaded program&amp;rsquo;s &lt;code&gt;main&lt;/code&gt; function. Instead, our loader will transfer control to the entry point of the loaded program via the &lt;code&gt;jmp&lt;/code&gt; instruction. It has two major differences:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Jumping to the entry point indicates we are going to execute the glibc start up functions before main is called. This includes setting up thread local storage. &lt;code&gt;main&lt;/code&gt; simply jump to the main with the loader&amp;rsquo;s TLS, no other setups are involved.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;jmp&lt;/code&gt; doesn&amp;rsquo;t push return address on stack. When the loaded program finishes execution, it exits the loader program, instead of giving control back to the caller.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Scheduler Activation</title><link>https://www.bodunhu.com/blog/posts/scheduler-activation/</link><pubDate>Sat, 24 Oct 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/scheduler-activation/</guid><description>&lt;p&gt;This is a summary on scheduler activation. To discuss about scheduler activation, we must first understand what is a thread. A thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler.&lt;/p&gt;
&lt;h2 id="kernel-level-threads-proscons"&gt;Kernel Level Threads Pros/Cons&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Good functionality, system wide integration&lt;/li&gt;
&lt;li&gt;Threads are seen and scheduled only by the kernel. A lot of kernel information should be invisible to user thread and can be useful for scheduling&lt;/li&gt;
&lt;li&gt;Poor performance, every thread_related call traps. This situation is a lot worse in the 1990s than it is now mainly due to clock speed.
The scheduling quanta are roughly the same, but because the clock speeds are much faster today, you can execute orders of magnitude more instructions
per quanta today than you could in 1990. Even if traps, let&amp;rsquo;s say, costs 10 cycles to complete, it would be a much bigger fraction of the quanta in 1990 than
it is today.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="user-level-threads-proscons"&gt;User Level Threads Pros/Cons&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Good performances. (most threads operations don&amp;rsquo;t involve kernel)&lt;/li&gt;
&lt;li&gt;Good scheduling policy flexibility: done by thread lib&lt;/li&gt;
&lt;li&gt;Poor system-wide integration&lt;/li&gt;
&lt;li&gt;Multi-programmed workloads are hard to schedule&lt;/li&gt;
&lt;li&gt;I/O, page faults invisible&lt;/li&gt;
&lt;li&gt;Potential for incorrect behavior
&lt;ul&gt;
&lt;li&gt;User level scheduler may not be cooperative. With user threads running on kernel threads, it may be that kernel threads block when a user-thread blocks, thus an application can run out of kernel threads to run their user threads.May be gilding the lily.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="some-problems-about-user-level-threads-on-kernel-interface"&gt;Some Problems about User-Level Threads on Kernel Interface&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Insufficient visibility between the kernel and user thread lib&lt;/li&gt;
&lt;li&gt;Kernel event such as pr-emption or I/O are not visible to user lib
&lt;ul&gt;
&lt;li&gt;For example, if user level threads block, then the kernel thread serving it also blocks.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Kernel threads are scheduled with respect to user-level thread library, we can have this interferences between two schedulers.&lt;/li&gt;
&lt;li&gt;Kernel time-slicing of threads
&lt;ul&gt;
&lt;li&gt;For example, user level threads holding a spin-lock can be pre-emptied, which can potentially cause all other user threads to wait.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="scheduler-activation"&gt;Scheduler Activation&lt;/h2&gt;
&lt;p&gt;The basic principle about scheduler activation is to expose revocation: telling me when you take something away. This is basically the same idea as the exokernel. For example, interfaces like&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;add_processor()&lt;/li&gt;
&lt;li&gt;has_blocked()&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The basics about scheduler activation are&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multi-threaded programs are still given an address space&lt;/li&gt;
&lt;li&gt;Facilitate flow of kernel information between user and kernel threads&lt;/li&gt;
&lt;li&gt;Kernel explicitly vectors kernel events to the user-level thread
&lt;ul&gt;
&lt;li&gt;via scheduler activation (upcall)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Extended kernel interface for processor allocation-related events
&lt;ul&gt;
&lt;li&gt;Essentially exchanging information&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="scheduler-activation-vs-kernel-threads"&gt;Scheduler Activation vs Kernel Threads&lt;/h2&gt;
&lt;p&gt;Key differences:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pre-emptied threads never resumed by the kernel directly.
&lt;ul&gt;
&lt;li&gt;Essentially, every new SA is a brand new context.&lt;/li&gt;
&lt;li&gt;For example, if you do blocking I/O, the kernel will provide a new scheduling activation and vector into that application space. There isn&amp;rsquo;t a notion of &amp;ldquo;resume&amp;rdquo;. The kernel is simply going to find some new schedule activation to notify you that a work has unblocked. In modern kernels, you would do something like stack unwinding to get back into user space.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;An important problem is what happened if a user thread is forced to be de-scheduler while it&amp;rsquo;s in a scheduler. The user thread will hold a a lock on user level run queue. That means no other user thread can be scheduled to run because none of them can acquire the lock. Because there&amp;rsquo;s no notion of &amp;ldquo;resume&amp;rdquo; in scheduling activation, we can&amp;rsquo;t really resume the execution in the scheduler. Thus, we run into a deadlock situation.&lt;/p&gt;
&lt;p&gt;One solution is to detect whether we are using a lock and keep executing until we leave the locked region. Of course, there are too many gotchas in this solution.&lt;/p&gt;
&lt;p&gt;Another solution is that the kernel can make a copy of the critical section and execute the critical section itself regardless of what the user thread chooses to do. Therefore, we can guarantee by the time you vector back into user space the lock is no longer held. So the kernel is basically executing the user code! Crazy, right?
Now we ran into more gotchas. What if the code is written in Java? How to find a locked region in userspace? What if &amp;hellip;&lt;/p&gt;
&lt;p&gt;Another thing we want to mention is page fault. Page fault indicates that you are missing part of your address. So there will be a notification with a new scheduler activation. Once you do something with it, you will likely touch that same piece in the space and double fault again.&lt;/p&gt;
&lt;p&gt;What is the solution?&lt;/p&gt;</description></item><item><title>Add MathJax Support to Jekyll and Hugo</title><link>https://www.bodunhu.com/blog/posts/add-mathjax-support-to-jekyll-and-hugo/</link><pubDate>Thu, 22 Oct 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/add-mathjax-support-to-jekyll-and-hugo/</guid><description>&lt;p&gt;I was using Mathjax v2 for a while and I heard v3 perform significantly faster than v2. Many great tutorials explains explains how to add Mathjax support to &lt;a href="https://jekyllrb.com/"&gt;Jekyll&lt;/a&gt; websites. Some of them only cover Mathjax v2. So here is the brief summary on how to add Mathjax v3 support to your Jekyll website (Recently I&amp;rsquo;ve migrated to &lt;a href="https://gohugo.io/"&gt;Hugo&lt;/a&gt; but adding support to Hugo is also pretty similar).&lt;/p&gt;
&lt;!--description--&gt;
&lt;ul&gt;
&lt;li&gt;In the &lt;code&gt;_config.yml&lt;/code&gt; located in your root directory, add this line:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;markdown: kramdown
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;Create a file called &lt;code&gt;mathjax.html&lt;/code&gt; insides &lt;code&gt;_includes/&lt;/code&gt;, add these lines (these settings come from the Mathjax &lt;a href="https://docs.mathjax.org/en/latest/web/configuration.html"&gt;documentation&lt;/a&gt;):&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt;script&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;MathJax = {
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; tex: {
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; inlineMath: [ [&amp;#39;$&amp;#39;, &amp;#39;$&amp;#39;], [&amp;#39;\\(&amp;#39;, &amp;#39;\\)&amp;#39;] ]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; },
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; svg: {
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; fontCache: &amp;#39;global&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;};
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt;/script&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt;script
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; type=&amp;#34;text/javascript&amp;#34; id=&amp;#34;MathJax-script&amp;#34; async
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; src=&amp;#34;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-svg.js&amp;#34;&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt;/script&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For Hugo website, the script will be exactly the same. The only difference is instead of putting &lt;code&gt;mathjax.html&lt;/code&gt; into &lt;code&gt;_includes/&lt;/code&gt;, you would want to put it inside &lt;code&gt;layouts/partials&lt;/code&gt;. For example, I put my &lt;code&gt;mathjax.html&lt;/code&gt; into the theme directory &lt;code&gt;themes/mini/layouts/partials&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For Jekyll, add this line in your &lt;code&gt;_includes/head.html&lt;/code&gt; before &lt;code&gt;&amp;lt;/head&amp;gt;&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;{% include mathjax.html &amp;gt;}}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For Hugo, we would add the following line to &lt;code&gt;layouts/partials/header.html&lt;/code&gt; before &lt;code&gt;&amp;lt;/head&amp;gt;&lt;/code&gt; in your theme&amp;rsquo;s &lt;code&gt;layouts/partials/header.html&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;{{ partial &amp;#34;mathjax.html&amp;#34; . }}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now you can write in-line math equations in your markdown file like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;\\(f(x) = x^2\\)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;or&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$f(x) = x^2$
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The above text will be render to:
\(f(x) = x^2\)&lt;/p&gt;
&lt;p&gt;If you are already using Mathjax v2 and wish to just convert it to v3, you may also try this configuration &lt;a href="https://mathjax.github.io/MathJax-demos-web/convert-configuration/convert-configuration.html"&gt;converter&lt;/a&gt;. There is a much more detailed &lt;a href="https://docs.mathjax.org/en/latest/upgrading/v2.html"&gt;guide&lt;/a&gt; but it may contain information unnecessary to average Hugo or Jekyll users.html) how to migrate from mathjax v2 to v3. The most useful resource is the official Mathjax &lt;a href="https://docs.mathjax.org/en/latest/"&gt;documentation&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Linux Program Measurement and mmap</title><link>https://www.bodunhu.com/blog/posts/linux-program-measurement-and-mmap/</link><pubDate>Wed, 23 Sep 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/linux-program-measurement-and-mmap/</guid><description>&lt;p&gt;This is a summary over Linux kernel program measurement and mmap. The specs of our experiment environment is listed below. For more details regarding the CPU spec please refer to &lt;a href="http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7%20i7-6800K.html"&gt;cpu world&lt;/a&gt;. This is the system spec:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th style="text-align: center"&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Processor name (BIOS)&lt;/td&gt;
&lt;td style="text-align: center"&gt;Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cores&lt;/td&gt;
&lt;td style="text-align: center"&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logical processors&lt;/td&gt;
&lt;td style="text-align: center"&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TLB/Cache details&lt;/td&gt;
&lt;td style="text-align: center"&gt;64-byte Prefetching Data TLB: 1-GB pages, 4-way set associative, 4 entries Data TLB: 4-KB Pages, 4-way set associative, 64 entries Instruction TLB: 4-KByte pages, 8-way set associative, 64 entries L2 TLB: 1-MB, 4-way set associative, 64-byte line size Shared 2nd-Level TLB: 4-KB / 2-MB pages, 6-way associative, 1536 entries. Plus, 1-GB pages, 4-way, 16 entries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td style="text-align: center"&gt;32GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operating System&lt;/td&gt;
&lt;td style="text-align: center"&gt;Ubuntu 20.04.1 LTS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel Version&lt;/td&gt;
&lt;td style="text-align: center"&gt;5.4.0-47-generic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br /&gt;
&gt; 8-way set associative means the CPU cache is made up of sets that can fit 8 blocks each.
&lt;!--description--&gt;
&lt;p&gt;Here are the details for the CPU cache, which we will need later:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cache&lt;/th&gt;
&lt;th&gt;L1 data&lt;/th&gt;
&lt;th&gt;L1 instruction&lt;/th&gt;
&lt;th&gt;L2&lt;/th&gt;
&lt;th&gt;L3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;6 x 32 KB&lt;/td&gt;
&lt;td&gt;6 x 32 KB&lt;/td&gt;
&lt;td&gt;6 x 256 KB&lt;/td&gt;
&lt;td&gt;15 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Associativity&lt;/td&gt;
&lt;td&gt;8-way set associative&lt;/td&gt;
&lt;td&gt;8-way set associative&lt;/td&gt;
&lt;td&gt;8-way set associative&lt;/td&gt;
&lt;td&gt;20-way set associative&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Line size:&lt;/td&gt;
&lt;td&gt;64 bytes&lt;/td&gt;
&lt;td&gt;64 bytes&lt;/td&gt;
&lt;td&gt;64 bytes&lt;/td&gt;
&lt;td&gt;64 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Comments:&lt;/td&gt;
&lt;td&gt;Direct-mapped&lt;/td&gt;
&lt;td&gt;Direct-mapped&lt;/td&gt;
&lt;td&gt;Non-inclusive Direct-mapped&lt;/td&gt;
&lt;td&gt;Inclusive Shared between all cores&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br /&gt;
&lt;h2 id="memory-map"&gt;Memory Map&lt;/h2&gt;
&lt;p&gt;To print the &lt;code&gt;/proc/self/maps&lt;/code&gt; file for a process, we use the &lt;code&gt;sprintf&lt;/code&gt; to construct the file name and then use the &lt;code&gt;system&lt;/code&gt; from stdlib to cat the contents of the running process&amp;rsquo;s address space. If we execute the program, it shows (also available on &lt;a href="https://gist.github.com/BDHU/9ad2f0b6353b789cfb7c29c804a6088a#file-proc_mem_map"&gt;gist&lt;/a&gt;)&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;address perms offset dev inode pathname
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;559e3e51f000-559e3e520000 r--p 00000000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;559e3e520000-559e3e521000 r-xp 00001000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;559e3e521000-559e3e522000 r--p 00002000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;559e3e522000-559e3e523000 r--p 00002000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;559e3e523000-559e3e524000 rw-p 00003000 00:31 1199787 /mnt/hdd1/Desktop/CS/CS380L/Lab1/a.out
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c477000-7faf5c49c000 r--p 00000000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c49c000-7faf5c614000 r-xp 00025000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c614000-7faf5c65e000 r--p 0019d000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c65e000-7faf5c65f000 ---p 001e7000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c65f000-7faf5c662000 r--p 001e7000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c662000-7faf5c665000 rw-p 001ea000 08:22 11932543 /usr/lib/x86_64-linux-gnu/libc-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c665000-7faf5c66b000 rw-p 00000000 00:00 0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c685000-7faf5c686000 r--p 00000000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c686000-7faf5c6a9000 r-xp 00001000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c6a9000-7faf5c6b1000 r--p 00024000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c6b2000-7faf5c6b3000 r--p 0002c000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c6b3000-7faf5c6b4000 rw-p 0002d000 08:22 11932535 /usr/lib/x86_64-linux-gnu/ld-2.31.so
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7faf5c6b4000-7faf5c6b5000 rw-p 00000000 00:00 0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7ffcddb8d000-7ffcddbae000 rw-p 00000000 00:00 0 [stack]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7ffcddbe0000-7ffcddbe3000 r--p 00000000 00:00 0 [vvar]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7ffcddbe3000-7ffcddbe4000 r-xp 00000000 00:00 0 [vdso]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Based on the &lt;a href="https://man7.org/linux/man-pages/man5/proc.5.html"&gt;linux man page&lt;/a&gt;, we can see each column has different definition. The &lt;em&gt;address&lt;/em&gt; field is the address space in the process that the mapping occupies. The &lt;em&gt;perms&lt;/em&gt; field is a set of permissions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;r = read&lt;/li&gt;
&lt;li&gt;w = write&lt;/li&gt;
&lt;li&gt;x = execute&lt;/li&gt;
&lt;li&gt;s = shared&lt;/li&gt;
&lt;li&gt;p = private (copy on write)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;em&gt;offset&lt;/em&gt; field is the offset into the file/whatever; &lt;em&gt;dev&lt;/em&gt; is the device (major:minor); &lt;em&gt;inode&lt;/em&gt; is the inode on that device. 0 indicates that no inode is associated with the memory region, as would be the case with BSS (uninitialized data).&lt;/p&gt;
&lt;p&gt;The &lt;em&gt;pathname&lt;/em&gt; field will usually be the file that is backing the mapping. For ELF files, you can easily coordinate with the &lt;em&gt;offset&lt;/em&gt; field by looking at the Offset field in the ELF program headers (readelf -l). In addition, we can see a few other pseudo-paths:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;[stack]&lt;/em&gt;: the initial process&amp;rsquo;s (also known as the main thread&amp;rsquo;s) stack.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;[vdso]&lt;/em&gt;: The virtual dynamically linked shared object. More detailed descriptions can be found on &lt;a href="https://lwn.net/Articles/615809/"&gt;lwn&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;[vvar]&lt;/em&gt;: location of kernel space variables mapped in user space needed by virtual system calls. Essentially, a kernel-space physical address is mapped into the userspace.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;[vsyscall]&lt;/em&gt;: similar to vDSO, vsyscall is another segment used to accelerate certain system calls in Linux. Vsyscall has some limitations; among other things, there is only space for a handful of virtual system calls. More detailed descriptions can be found on &lt;a href="https://lwn.net/Articles/446528/"&gt;lwn&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;One thing interesting here is that when we execute the same program twice, we can see after the first run, the output is&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7fffbc92f000-7fffbc930000 r-xp 00000000 00:00 0 [vdso]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Type the same command again:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;7ffd6a94d000-7ffd6a94e000 r-xp 00000000 00:00 0 [vdso]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Note that the vDSO area has moved, while the vsyscall page remains at the same location. The location of the vsyscall page is nailed down in the kernel ABI, but the vDSO area - like most other areas in the user-space memory layout - has its location randomized every time it is mapped. The vsyscall is legacy implementation of user-space sys call acceleration. Since it has fixed addresses, it is vulnerable to security issues. Because applications depend on the existence and exact address of that page, most functions are simply removed and replaced by a special trap instruction. More detailed explanation can be found on &lt;a href="https://lwn.net/Articles/446528/"&gt;lwn.net&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Another interesting thing we observed is the base address of the executable (the start of the text section) and the start address of libc is rather different. This is also the result of using ASLR which is used to prevent return-to-libc attack.&lt;/p&gt;
&lt;h2 id="getrusage"&gt;getrusage&lt;/h2&gt;
&lt;p&gt;Then, we call &lt;code&gt;getrusage&lt;/code&gt; at the end of our program and print out the fields. We will need &lt;code&gt;getrusage&lt;/code&gt; later. Here is a sample output for some fields inside &lt;code&gt;struct rusage&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;utime: 1306
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;stime: 0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;maxrss: 2692
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;minflt: 76
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;majflt: 0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;inblock: 0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;oublock: 0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;nvcsw: 2
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;nivcsw: 0
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Here is a short list of descriptions for each of these fields. More detailed information can be found on &lt;a href="https://www.gnu.org/software/libc/manual/html_node/Resource-Usage.html"&gt;gnu website&lt;/a&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;utime&lt;/strong&gt;: time spent executing user instructions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;stime&lt;/strong&gt;: time spent in operating system code on behalf of processes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;maxrss&lt;/strong&gt;: the maximum resident set size used, in kilobytes. That is, the maximum number of kilobytes of physical memory that processes used simultaneously.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;minflt&lt;/strong&gt;: the number of page faults which were serviced without requiring any I/O.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;majflt&lt;/strong&gt;: the number of page faults which were serviced by doing I/O.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;inblock&lt;/strong&gt;: the number of times the file system had to read from the disk on behalf of processes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;oublock&lt;/strong&gt;: the number of times the file system had to write to the disk on behalf of processes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;nvcsw&lt;/strong&gt;: the number of times processes voluntarily invoked a context switch (usually to wait for some service).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;nivcsw&lt;/strong&gt;: the number of times an involuntary context switch took place (because a time slice expired, or another process of higher priority was scheduled).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="perf_event_open"&gt;perf_event_open&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;perf_event_open&lt;/code&gt; interface is useful to measurement numerous system events. However, glibc doesn&amp;rsquo;t provide wrapper for this system call. Instead, we need to use &lt;code&gt;syscall&lt;/code&gt; directly.&lt;/p&gt;
&lt;p&gt;To use &lt;code&gt;perf_event_open&lt;/code&gt;, we call create a function wrapper that does the actual syscall for us. Take the example from the &lt;a href="https://man7.org/linux/man-pages/man2/perf_event_open.2.html"&gt;Linux man page&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;perf_event_open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;perf_event_attr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;hw_event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;pid_t&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;group_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;ret&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;ret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;syscall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__NR_perf_event_open&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hw_event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;group_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ret&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Here the &lt;code&gt;__NR_perf_event_open&lt;/code&gt; specifies the syscall number. On our local machine, we can go to &lt;code&gt;/usr/include/x86_64-linux-gnu/sys/syscall.h&lt;/code&gt;, which specifies the location of &lt;code&gt;__NR_perf_event_open&lt;/code&gt;. In our case, it is located at &lt;code&gt;/usr/include/x86_64-linux-gnu/asm/unistd_64.h&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If we call &lt;code&gt;objdump -d&lt;/code&gt; on the binary file, we will see something like this&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;000000000000119a &amp;lt;perf_event_open&amp;gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 119a: 55 push %rbp
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 119b: 48 89 e5 mov %rsp,%rbp
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 119e: 48 83 ec 30 sub $0x30,%rsp
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11a2: 48 89 7d e8 mov %rdi,-0x18(%rbp)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11a6: 89 75 e4 mov %esi,-0x1c(%rbp)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11a9: 89 55 e0 mov %edx,-0x20(%rbp)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11ac: 89 4d dc mov %ecx,-0x24(%rbp)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11af: 4c 89 45 d0 mov %r8,-0x30(%rbp)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11b3: 48 8b 7d d0 mov -0x30(%rbp),%rdi
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11b7: 8b 75 dc mov -0x24(%rbp),%esi
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11ba: 8b 4d e0 mov -0x20(%rbp),%ecx
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11bd: 8b 55 e4 mov -0x1c(%rbp),%edx
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11c0: 48 8b 45 e8 mov -0x18(%rbp),%rax
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11c4: 49 89 f9 mov %rdi,%r9
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11c7: 41 89 f0 mov %esi,%r8d
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11ca: 48 89 c6 mov %rax,%rsi
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11cd: bf 2a 01 00 00 mov $0x12a,%edi
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11d2: b8 00 00 00 00 mov $0x0,%eax
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11d7: e8 84 fe ff ff callq 1060 &amp;lt;syscall@plt&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11dc: 89 45 fc mov %eax,-0x4(%rbp)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11df: 8b 45 fc mov -0x4(%rbp),%eax
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11e2: 48 98 cltq
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11e4: c9 leaveq
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 11e5: c3 retq
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We notice there&amp;rsquo;s one interesting line&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;callq 1060 &amp;lt;syscall@plt&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;plt&lt;/code&gt; stands for Procedure Linkage Table. This lines indicates a call to the &lt;code&gt;syscall&lt;/code&gt; in the procedure linking table. The PLT allows us to resolve the absolute addresses of shared libraries at runtime.&lt;/p&gt;
&lt;p&gt;Take a look at the &lt;code&gt;&amp;lt;syscall@plt&amp;gt;&lt;/code&gt; section of the disassembly of section .plt, we see&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000001060 &amp;lt;syscall@plt&amp;gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 1060: ff 25 62 2f 00 00 jmpq *0x2f62(%rip) #3fc8&amp;lt;syscall@GLIBC_2.2.5&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 1066: 68 03 00 00 00 pushq $0x3
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; 106b: e9 b0 ff ff ff jmpq 1020 &amp;lt;.plt&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Notice this jump is a pointer to an address. The address lies inside the GOT (Global Offset Table). The GOT will eventually hold the absolute address call to &lt;code&gt;syscall&lt;/code&gt;. On the first call the address will point back to the instruction after the jump in the PLT - &lt;code&gt;0x1066&lt;/code&gt;. Then we see another jump instruction. This jump is a jump into the eventual runtime linker code that will load the shared library which has syscall.&lt;/p&gt;
&lt;p&gt;We also see the comment for the first jump instruction&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#3fc8&amp;lt;syscall@GLIBC_2.2.5&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Use &lt;code&gt;objdump -R&lt;/code&gt;, we see the dynamic relocation entries in the file&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;DYNAMIC RELOCATION RECORDS
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;OFFSET TYPE VALUE
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003d98 R_X86_64_RELATIVE *ABS*+0x0000000000001190
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003da0 R_X86_64_RELATIVE *ABS*+0x0000000000001150
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000004008 R_X86_64_RELATIVE *ABS*+0x0000000000004008
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003fd8 R_X86_64_GLOB_DAT _ITM_deregisterTMCloneTable
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003fe0 R_X86_64_GLOB_DAT __libc_start_main@GLIBC_2.2.5
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003fe8 R_X86_64_GLOB_DAT __gmon_start__
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003ff0 R_X86_64_GLOB_DAT _ITM_registerTMCloneTable
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003ff8 R_X86_64_GLOB_DAT __cxa_finalize@GLIBC_2.2.5
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003fb0 R_X86_64_JUMP_SLOT getpid@GLIBC_2.2.5
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003fb8 R_X86_64_JUMP_SLOT __stack_chk_fail@GLIBC_2.4
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003fc0 R_X86_64_JUMP_SLOT system@GLIBC_2.2.5
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003fc8 R_X86_64_JUMP_SLOT syscall@GLIBC_2.2.5
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0000000000003fd0 R_X86_64_JUMP_SLOT sprintf@GLIBC_2.2.5
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;!-- If we take a look at instruction ```1020```, we see
```text
0000000000001020 &lt;.plt&gt;:
1020: ff 35 7a 2f 00 00 pushq 0x2f7a(%rip) # 3fa0 &lt;_GLOBAL_OFFSET_TABLE_+0x8&gt;
1026: ff 25 7c 2f 00 00 jmpq *0x2f7c(%rip) # 3fa8 &lt;_GLOBAL_OFFSET_TABLE_+0x10&gt;
102c: 0f 1f 40 00 nopl 0x0(%rax)
``` --&gt;
&lt;h2 id="monitor-events"&gt;Monitor Events&lt;/h2&gt;
&lt;p&gt;Next, we are going to look at L1 data cache metrics. We are interested in L1 data cache accesses, misses, and data TLB misses. We will measure this code in our experiment. CACHE_LINE_SIZE is defined as 64 to match our CPU specs.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// p points to a region that is 1GB (ideally)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;do_mem_access&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;locality&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;ws_base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;max_base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;CACHE_LINE_SIZE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;outer&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;outer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;simplerand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;max_base&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// Pick a starting offset
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;opt_random_access&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;ws_base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;ws_base&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;ws_base&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;max_base&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;ws_base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;locality&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;locality&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;locality&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;volatile&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// Working set of 512 cache lines, 32KB
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws_base&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;CACHE_LINE_SIZE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;What this routine does is essentially pick a working set of 512 cache lines, periodically perform a write or otherwise read operation. This process is repeated 16 times during each interaction. Each read or write access will operate on a new cache line. The innermost loop will perform this set of operations for the entire L1 data cache.&lt;/p&gt;
&lt;p&gt;When opt_random_access is true, the starting base address of the cache line is randomly picked. Otherwise, it is incremented by 512 cache lines (or one working set) during each outer iteration. The main difference is that with opt_random_access set to true, the starting base address of the cache line can&amp;rsquo;t be precomputed by the hardware, thus likely increase miss rate.&lt;/p&gt;
&lt;p&gt;To measure L1 data cache metrics, we will use the &lt;code&gt;perf_event_open&lt;/code&gt; interface we discussed above. To measure L1 data cache read misses, we will configure our &lt;code&gt;struct perf_event_attr&lt;/code&gt; as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#define CALC_CONFIG(perf_hw_cache_id, perf_hw_cache_op_id, perf_hw_cache_op_result_id) \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;((perf_hw_cache_id) | (perf_hw_cache_op_id &amp;lt;&amp;lt; 8) | (perf_hw_cache_op_result_id &amp;lt;&amp;lt; 16))
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;hw_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PERF_TYPE_HW_CACHE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;hw_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;perf_event_attr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;hw_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disabled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// disable at init time
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;hw_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exclude_kernel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;hw_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;CALC_CONFIG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PERF_COUNT_HW_CACHE_L1D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PERF_COUNT_HW_CACHE_OP_READ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PERF_COUNT_HW_CACHE_RESULT_ACCESS&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The exact details can be found in &lt;a href="https://man7.org/linux/man-pages/man2/perf_event_open.2.html"&gt;linux man page&lt;/a&gt;. The important part is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;hw_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;CALC_CONFIG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PERF_COUNT_HW_CACHE_L1D&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PERF_COUNT_HW_CACHE_OP_READ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PERF_COUNT_HW_CACHE_RESULT_ACCESS&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;These configurations allows us to measure the L1 data cahe read misses. The arguments passed to &lt;code&gt;perf_event_open&lt;/code&gt; is&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;pid_t&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;group_fd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The choice of these parameters can also be found on the &lt;a href="https://man7.org/linux/man-pages/man2/perf_event_open.2.html"&gt;linux man page&lt;/a&gt;. After &lt;code&gt;perf_event_open&lt;/code&gt; is called, we will re-enable event measurements by calling&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;ioctl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PERF_EVENT_IOC_RESET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;ioctl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PERF_EVENT_IOC_ENABLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;What it does is resetting the event count specified by the file descriptor argument to zero, then enables the individual event specified by the file descriptor argument. After &lt;code&gt;do_mem_access(p, size)&lt;/code&gt; is executed, we call &lt;code&gt;ioctl(fd, PERF_EVENT_IOC_DISABLE, 0)&lt;/code&gt; to disable the event and then read the result by &lt;code&gt;read(fd, &amp;amp;result, sizeof(long long))&lt;/code&gt;. How result is defined is up to how &lt;code&gt;PERF_FORMAT_*&lt;/code&gt; was specified. You can also check &lt;a href="https://elixir.bootlin.com/linux/latest/source/kernel/events/core.c#L1833"&gt;lxr&lt;/a&gt; to see how &lt;code&gt;__perf_event_read_size&lt;/code&gt; calculates the size of event that is read. In our case, it&amp;rsquo;s simple a &lt;code&gt;u64&lt;/code&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Be aware that simply executing the binary might cause &lt;code&gt;perf_event_open&lt;/code&gt; to fail (in which case will always return -1). Using &lt;code&gt;sudo&lt;/code&gt; is one workaround. Execute &lt;code&gt;cat /proc/sys/kernel/perf_event_paranoid&lt;/code&gt; and see what returns. &lt;code&gt;-1&lt;/code&gt; means you have raw access to kernel tracepoints. Otherwise, you might have trouble accessing the performance counter without root privilege. Check this &lt;a href="https://unix.stackexchange.com/questions/14227/do-i-need-root-admin-permissions-to-run-userspace-perf-tool-perf-events-ar"&gt;stackexchange post&lt;/a&gt; for more details.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To be even more careful about generating repeatable results we should flush the level 1 data cache before enabling the performance counters. We will do this by reading a memory buffer larger than per-core L1 data cache size&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;buffer_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;buff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;buff&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We will also lock the process onto a single processor by using the &lt;code&gt;sched_setaffinity&lt;/code&gt; function. Our example is&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;cpu_set_t&lt;/span&gt; &lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;CPU_ZERO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;CPU_SET&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;aff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sched_setaffinity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;cpu_set_t&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We perform the each of the above experiments 5 times. First, we turn on random cache line base address generation. On average, we have around 1010665367 L1 data cache read misses wtih standard deviation to be 61010967 misses. When random access is disabled, we have on average 964420324 read misses with standard deviation of 65787193 misses. We can also measure the number L1 data cache write misses by using the &lt;code&gt;PERF_COUNT_HW_CACHE_OP_WRITE&lt;/code&gt; config instead. Use &lt;code&gt;PERF_COUNT_HW_CACHE_OP_PREFETCH&lt;/code&gt; gives us prefetch misses, in our case, both of these metrics are unavailable. We can check the &lt;code&gt;/arch/x86/events/intel/core.c&lt;/code&gt; in &lt;a href="https://elixir.bootlin.com/linux/v5.6/source/arch/x86/events/intel/core.c"&gt;lxr&lt;/a&gt; and we can see these metrics are not available.&lt;/p&gt;
&lt;p&gt;We can also use the &lt;code&gt;PERF_COUNT_HW_CACHE_DTLB&lt;/code&gt; config option for data TLB measurement. For read access we have on average 3390719 misses with std dev being 17579, while write access has 1486451 misses with std dev being 13455. The prefetch metrics for TLB are unavailable in our case. To find out more about available metrics supported, please check the constant &lt;code&gt;static __initconst const u64 skl_hw_cache_event_ids&lt;/code&gt; for specific kernel version.&lt;/p&gt;
&lt;p&gt;With random cache line access turned off, we have 517335 read misses data TLB with standard deviation of 3820 misses. For write we have on average 809671 misses with standard deviation being 9580 misses. It is a significant reduction compared to the random access implementation.&lt;/p&gt;
&lt;p&gt;To calculate the L1 cache miss rate and data TLB miss rate, we can use 100.0 * cache misses / cache_accesses and 100.0 * tlb misses / cache_accesses to calculate the results. With random access turned off, we get L1 read access miss rate to be $$miss_{cache} = 1.5%$$ and TLB read miss rate $$miss_{tlb} \approx 0$$. When random access is turned on, we have $$miss_{cache} = 1.4%$$ and $$miss_{tlb} \approx 0$$. We can see the miss rate in all scenarios is really low. This is mainly because the inner most loop in our routine is performing operations on working set already presented in L1 cache and TLB. The read/write operations use continous cache lines, which means there will almost be no faults while we access the 512 cache lines. If one fault causes the entire new working set to be cached, then there would be no subsequent faults until the entire working set is iterated.&lt;/p&gt;
&lt;p&gt;If we use &lt;code&gt;getrusage&lt;/code&gt; we can see the metrics listed below:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metrics&lt;/th&gt;
&lt;th&gt;Mean&lt;/th&gt;
&lt;th&gt;std dev&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;utime&lt;/td&gt;
&lt;td&gt;868629&lt;/td&gt;
&lt;td&gt;126044&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stime&lt;/td&gt;
&lt;td&gt;253586&lt;/td&gt;
&lt;td&gt;20112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;maxrss&lt;/td&gt;
&lt;td&gt;1049691&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;minflt&lt;/td&gt;
&lt;td&gt;262214&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;majflt&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;inblock&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;oublock&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nvcsw&lt;/td&gt;
&lt;td&gt;0.4&lt;/td&gt;
&lt;td&gt;0.54&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nivcsw&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br /&gt;
&lt;h2 id="mmap"&gt;mmap&lt;/h2&gt;
&lt;p&gt;Next we are going to explore the behavior of mmap. Previously, we used &lt;code&gt;malloc&lt;/code&gt; for data allocation. Next, we are going to instead use &lt;code&gt;mmap&lt;/code&gt; and see what happens. Here we will only use read access for benchmark metrics since it&amp;rsquo;s available in both L1 and TLB metrics.&lt;/p&gt;
&lt;p&gt;First, we use the &lt;code&gt;MAP_ANONYMOUS&lt;/code&gt; as a flag passed to &lt;code&gt;mmap&lt;/code&gt;. This flag means the mapping is not backed by any file; its contents are initialized to zero. The complete call is&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;mmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PROT_READ&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;PROT_WRITE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;MAP_PRIVATE&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;MAP_ANONYMOUS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fd_ignore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For more details, refer to &lt;a href="https://man7.org/linux/man-pages/man2/mmap.2.html"&gt;mmap man page&lt;/a&gt; for information.&lt;/p&gt;
&lt;p&gt;When we turn on the random access and use &lt;code&gt;perf_event_open&lt;/code&gt; interface to collect metrics, we see the L1 data cache read misses are 956148031 (std dev 84631843). The TLB data cache read misses are 3370309 (std dev 17792). We see it is not really different to the malloc approach we used before. Doing a simple &lt;code&gt;strace&lt;/code&gt; shows &lt;code&gt;malloc&lt;/code&gt; calls &lt;code&gt;mmap&lt;/code&gt;. The memory that backs &lt;code&gt;malloc()&lt;/code&gt; allocations is handled by the kernel in much the same way as the memory that backs private anonymous mappings created with &lt;code&gt;mmap()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Then, we try to use &lt;code&gt;mmap()&lt;/code&gt; to create mapping in the virtual address space backed by a file instead of using &lt;code&gt;MAP_ANONYMOUS&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We first test &lt;code&gt;mmap&lt;/code&gt; with &lt;code&gt;MAP_PRIVATE&lt;/code&gt;. According to the man page, this flags means creating a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note we should call &lt;code&gt;fallocate()&lt;/code&gt; for the newly created file, otherwise mmap is gonna throw bur error.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When we measure the L1 data cache miss, it&amp;rsquo;s around 946128512 (std dev 956148031), nothing special happens. When we use &lt;code&gt;MAP_SHARED&lt;/code&gt; flag, the result was similar. The result seems to fluctuates as time passes, but overall they are not much different. After all, it&amp;rsquo;s just reading from the memory, whether the address is backed by a file or not doesn&amp;rsquo;t play a big role in affecting the cache miss rate. The L1 data cache misses is shown below:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;PRIVATE&lt;/th&gt;
&lt;th&gt;PRIVATE+POPULATE&lt;/th&gt;
&lt;th&gt;SHARED&lt;/th&gt;
&lt;th&gt;SHARED+POPULATE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mean&lt;/td&gt;
&lt;td&gt;783864673&lt;/td&gt;
&lt;td&gt;769314361&lt;/td&gt;
&lt;td&gt;842915231&lt;/td&gt;
&lt;td&gt;816749524&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Std dev&lt;/td&gt;
&lt;td&gt;77816766&lt;/td&gt;
&lt;td&gt;53913082&lt;/td&gt;
&lt;td&gt;54613278&lt;/td&gt;
&lt;td&gt;60580595&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br /&gt;
&lt;p&gt;If we take a look at TLB data cache, the result is&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;PRIVATE&lt;/th&gt;
&lt;th&gt;PRIVATE+POPULATE&lt;/th&gt;
&lt;th&gt;SHARED&lt;/th&gt;
&lt;th&gt;SHARED+POPULATE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mean&lt;/td&gt;
&lt;td&gt;3372303&lt;/td&gt;
&lt;td&gt;3370740&lt;/td&gt;
&lt;td&gt;3381755&lt;/td&gt;
&lt;td&gt;3377370&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Std dev&lt;/td&gt;
&lt;td&gt;9884&lt;/td&gt;
&lt;td&gt;13567&lt;/td&gt;
&lt;td&gt;17626&lt;/td&gt;
&lt;td&gt;11776&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br /&gt;
&lt;p&gt;Still, there doesn&amp;rsquo;t seem have any significant fluctuation in the number of misses in data TLB. This pattern also applies to sequential access, except the TLB data cache misses is alot lower in sequentual access.&lt;/p&gt;
&lt;p&gt;Now If we instead use &lt;code&gt;getrusage()&lt;/code&gt;, we will get something like this&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;PRIVATE&lt;/th&gt;
&lt;th&gt;PRIVATE+POPULATE&lt;/th&gt;
&lt;th&gt;SHARED&lt;/th&gt;
&lt;th&gt;SHARED+POPULATE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Usec/std dev&lt;/td&gt;
&lt;td&gt;20/0&lt;/td&gt;
&lt;td&gt;20/0&lt;/td&gt;
&lt;td&gt;20/0&lt;/td&gt;
&lt;td&gt;20/0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;usec/std dev&lt;/td&gt;
&lt;td&gt;801512/ 78346&lt;/td&gt;
&lt;td&gt;793452/ 143556&lt;/td&gt;
&lt;td&gt;872342/ 124124&lt;/td&gt;
&lt;td&gt;671957/ 229314&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ssec/std dev&lt;/td&gt;
&lt;td&gt;0/0&lt;/td&gt;
&lt;td&gt;0/0&lt;/td&gt;
&lt;td&gt;0/0&lt;/td&gt;
&lt;td&gt;0/0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ssec/std dev&lt;/td&gt;
&lt;td&gt;475977/ 54355&lt;/td&gt;
&lt;td&gt;475678/ 134253&lt;/td&gt;
&lt;td&gt;445467/ 99345&lt;/td&gt;
&lt;td&gt;536041/ 98797&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;oublock/std dev&lt;/td&gt;
&lt;td&gt;0/0&lt;/td&gt;
&lt;td&gt;0/0&lt;/td&gt;
&lt;td&gt;2997152/ 82256&lt;/td&gt;
&lt;td&gt;2097152/ 19760&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br /&gt;
&lt;p&gt;The most interesting part here is when &lt;code&gt;MAP_SHARED&lt;/code&gt; is enabled, the &lt;code&gt;oublock&lt;/code&gt; immediately changes. As we mentioned previously, &lt;code&gt;oublock&lt;/code&gt; specifies the number of times the file system had to write to the disk on behalf of processes. Because the address is now backed by a file, all write operations will cause the file system to write the contents back to the file.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;mmap()&lt;/code&gt; creates a new mapping in the virtual address space of the
calling process. However, it doesn&amp;rsquo;t allocate RAM. If we call &lt;code&gt;memset()&lt;/code&gt; then followed by &lt;code&gt;msync()&lt;/code&gt; with &lt;code&gt;MS_SYNC&lt;/code&gt; flag, we can get some interesting results in &lt;code&gt;getrusage&lt;/code&gt;, these observations are summarized here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;kernel space time is much higher. It usually take 1 sec (no std dev) as opposed to 0. Synchronizing to files on disk will require more kernel participation.&lt;/li&gt;
&lt;li&gt;minflt (the number of page faults which were serviced without requiring any I/O) was muich higher, the value is around 540782(std dev 3). More memory mapped means the faults by I/O will be less likely.&lt;/li&gt;
&lt;li&gt;oublock is much higher, the value is around 4196512(std dev 1). The sync operating means there will be approximatly double amount of writes to disk.&lt;/li&gt;
&lt;li&gt;nvcsw was higher, there are more voluntary context switches. Writing results to disk has delay, and thus the process likely need to context switch while waiting for I/O to be finished.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We may notice the number data TLB misses is lower than the total number of page the application uses. One obvious answer the use of huge page. One huge page can cover many small pages. Also, because we have prefetching TLB and the working set access pattern is contiguous, TLB hit rate will be high. Because we have a set-associative TLB cache, and we access the memory in a fairly deterministic way, it&amp;rsquo;s easy to predict where the next access is pointing to. For example, if the replacement policy is FIFO, then each cache line will remain untouched for exact same clock cycle before replaced. This also applies to other policies. One way to determine the replacement algorithm is using P-Chase.&lt;/p&gt;
&lt;h2 id="strace"&gt;strace&lt;/h2&gt;
&lt;p&gt;We then use &lt;code&gt;strace&lt;/code&gt; to trace syscalls of our application. The output contains some interesting information, one is&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;access&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;/etc/ld.so.preload&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;R_OK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;ENOENT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;No&lt;/span&gt; &lt;span class="n"&gt;such&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;directory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;arch_prctl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARCH_SET_FS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x7fdc6ad83540&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;According to &lt;a href="https://man7.org/linux/man-pages/man2/arch_prctl.2.html"&gt;arch_prctl man page&lt;/a&gt;, &lt;code&gt;arch_prctl()&lt;/code&gt; sets architecture-specific process or thread state. The &lt;code&gt;ARCH_SET_FS&lt;/code&gt; option sets the 64-bit base for the FS register to addr, in our case it&amp;rsquo;s 0x7fdc6ad83540. Let&amp;rsquo;s set a break point at &lt;code&gt;arch_prctl&lt;/code&gt; and backtrace from there&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#0 0x00007ffff7febb55 in ?? () from /lib64/ld-linux-x86-64.so.2
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#1 0x00007ffff7fd104c in ?? () from /lib64/ld-linux-x86-64.so.2
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#2 0x00007ffff7fd0108 in ?? () from /lib64/ld-linux-x86-64.so.2
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#3 0x0000000000000001 in ?? ()
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#4 0x00007fffffffe2fa in ?? ()
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#5 0x0000000000000000 in ?? ()
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We can see the FS segment base is set by the &lt;code&gt;ld-linux&lt;/code&gt;, which is a part of glibc, during the program loading. A simple google seach tells us &lt;code&gt;/lib64/ld-linux-x86-64.so.2&lt;/code&gt; is a dynamic linker. A more detailed description can be found on this &lt;a href="https://unix.stackexchange.com/questions/400621/what-is-lib64-ld-linux-x86-64-so-2-and-why-can-it-be-used-to-execute-file"&gt;post&lt;/a&gt; and &lt;a href="https://lwn.net/Articles/631631/"&gt;lwn.net&lt;/a&gt;. During the startup, the loader initalizes TLS. The includes memory allocation and setting FS base value to point to the TLS beignning, which is done via the &lt;code&gt;arch_prctl&lt;/code&gt; syscall. More can be found &lt;a href="https://unix.stackexchange.com/questions/453749/what-sets-fs0x28-stack-canary/453772"&gt;here&lt;/a&gt;. This &lt;code&gt;init_tls()&lt;/code&gt; is called &lt;a href="https://git.launchpad.net/glibc/tree/elf/rtld.c?id=916124ed841745b7a1e0fbc43f9909340b47d373#n1397"&gt;here&lt;/a&gt;, which subsequently calls the actuall &lt;a href="https://git.launchpad.net/glibc/tree/sysdeps/x86_64/nptl/tls.h#n153"&gt;syscall&lt;/a&gt; in &lt;code&gt;tls.h&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The /etc/ld.so.preload has similarities to LD_PRELOAD, in addition, it doesn&amp;rsquo;t suffer security limitation posed by LD_PRELOAD (&lt;a href="https://superuser.com/questions/1183037/what-is-does-ld-so-preload-do"&gt;explanation here&lt;/a&gt;). This a feature of &lt;em&gt;glibc&lt;/em&gt;.&lt;/p&gt;
&lt;!-- ## Measuring memory access behavior
First we thought having more background processes running will have effect on the application behavior. The gnome desktop envionrment consists of dozens of processes running at the same time. However, there doesn't seem to have any difference when we turned off the DE. This is mostly due to the fact the workload is not heavy enough to contend resources with our pplication. Inspired by the contention experiement, we instead uses the kenrel compilation as the ''background activity''. We use -j flag to spawn more processes than the number of physical cores. This will create more opportuniteis for context switches and higher memory utilization. The most significant changes we see is the change in nivcsw. Previously, the number can reach 0 very often. However, as we create more background processes the value of nivcsw increases dramatically. Our cases shows around 234 switches on average with standard deviation being 123. --&gt;
&lt;h2 id="competing-for-memory"&gt;Competing for Memory&lt;/h2&gt;
&lt;p&gt;Next we are going to fork another process that will compete for memory with our process under test. We will use this code snippet which is going to be executed by both the parent and the child process&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;compete_for_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;unused&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;mem_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_mem_size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;page_sz&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sysconf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_SC_PAGE_SIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#34;Total memsize is %3.2f GBs&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;mem_size&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;fflush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PROT_READ&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;PROT_WRITE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;MAP_NORESERVE&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;MAP_PRIVATE&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;MAP_ANONYMOUS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;off_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;MAP_FAILED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;perror&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#34;Failed anon MMAP competition&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;while&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;volatile&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;simplerand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mem_size&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;page_sz&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;mem_size&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;page_sz&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// One read and write per page
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;//a = p + i * page_sz; // sequential access
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;page_sz&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;get_mem_size()&lt;/code&gt; is implemented using this &lt;a href="https://stackoverflow.com/questions/22670257/getting-ram-size-in-c-linux-non-precise-result"&gt;portable code&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#if defined(_WIN32)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;Windows.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#elif defined(__unix__) || defined(__unix) || defined(unix) || (defined(__APPLE__) &amp;amp;&amp;amp; defined(__MACH__))
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;unistd.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;sys/types.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;sys/param.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#if defined(BSD)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;sys/sysctl.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#else
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#error &amp;#34;Unable to define getMemorySize( ) for an unknown OS.&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt;/**
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; * Returns the size of physical memory (RAM) in bytes.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cm"&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="nf"&gt;getMemorySize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#if defined(_WIN32) &amp;amp;&amp;amp; (defined(__CYGWIN__) || defined(__CYGWIN32__))
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* Cygwin under Windows. ------------------------------------ */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* New 64-bit MEMORYSTATUSEX isn&amp;#39;t available. Use old 32.bit */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;MEMORYSTATUS&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dwLength&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;GlobalMemoryStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dwTotalPhys&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#elif defined(_WIN32)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* Windows. ------------------------------------------------- */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* Use new 64-bit MEMORYSTATUSEX, not old 32-bit MEMORYSTATUS */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;MEMORYSTATUSEX&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dwLength&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;GlobalMemoryStatusEx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ullTotalPhys&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#elif defined(__unix__) || defined(__unix) || defined(unix) || (defined(__APPLE__) &amp;amp;&amp;amp; defined(__MACH__))
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* UNIX variants. ------------------------------------------- */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* Prefer sysctl() over sysconf() except sysctl() HW_REALMEM and HW_PHYSMEM */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#if defined(CTL_HW) &amp;amp;&amp;amp; (defined(HW_MEMSIZE) || defined(HW_PHYSMEM64))
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CTL_HW&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#if defined(HW_MEMSIZE)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HW_MEMSIZE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="cm"&gt;/* OSX. --------------------- */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#elif defined(HW_PHYSMEM64)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HW_PHYSMEM64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="cm"&gt;/* NetBSD, OpenBSD. --------- */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int64_t&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="cm"&gt;/* 64-bit */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="nf"&gt;sysctl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0L&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="cm"&gt;/* Failed? */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#elif defined(_SC_AIX_REALMEM)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* AIX. ----------------------------------------------------- */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;sysconf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;_SC_AIX_REALMEM&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="mi"&gt;1024L&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#elif defined(_SC_PHYS_PAGES) &amp;amp;&amp;amp; defined(_SC_PAGESIZE)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* FreeBSD, Linux, OpenBSD, and Solaris. -------------------- */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;sysconf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;_SC_PHYS_PAGES&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;sysconf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;_SC_PAGESIZE&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#elif defined(_SC_PHYS_PAGES) &amp;amp;&amp;amp; defined(_SC_PAGE_SIZE)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* Legacy. -------------------------------------------------- */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;sysconf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;_SC_PHYS_PAGES&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;sysconf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;_SC_PAGE_SIZE&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#elif defined(CTL_HW) &amp;amp;&amp;amp; (defined(HW_PHYSMEM) || defined(HW_REALMEM))
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="cm"&gt;/* DragonFly BSD, FreeBSD, NetBSD, OpenBSD, and OSX. -------- */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CTL_HW&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#if defined(HW_REALMEM)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HW_REALMEM&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="cm"&gt;/* FreeBSD. ----------------- */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#elif defined(HW_PYSMEM)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HW_PHYSMEM&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="cm"&gt;/* Others. ------------------ */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="cm"&gt;/* 32-bit */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="nf"&gt;sysctl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;mib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0L&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="cm"&gt;/* Failed? */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#endif &lt;/span&gt;&lt;span class="cm"&gt;/* sysctl and sysconf variants */&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#else
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0L&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="cm"&gt;/* Unknown OS. */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The important line is&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;sysconf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;_SC_PHYS_PAGES&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;sysconf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;_SC_PAGESIZE&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;One thing to notice in the routine for competing for memory is we used &lt;code&gt;fflush&lt;/code&gt; after the &lt;code&gt;printf&lt;/code&gt;. The purpose of &lt;code&gt;fflush(stream)&lt;/code&gt; is to make the operating system flush any buffers to the underlying file. This mainly because stdout is buffered. The buffer is not flushed until newline. &lt;code&gt;fflush&lt;/code&gt; will cause this process to happen with the absense of newline. stderr is unbuffered and thus fflush would not be necessary.&lt;/p&gt;
&lt;p&gt;For this experiment, we tested it on a VM. The reason is because the contending process will take all RAM and completely hault the mahcine if tested on the host. To ensure our VM has enough swap space, we follow this &lt;a href="https://wiki.crowncloud.net/?adding_swap_kvm"&gt;tutorial&lt;/a&gt; to create 4GB of swap area (we allocated 2GB RAM for VM).&lt;/p&gt;
&lt;p&gt;One thing we observe is that the execution time of the program become significantly longer to run. In our experiement we need to limit the number of iterations from 1 &amp;laquo; 20 to 1 &amp;laquo; 8 to get some sensible results without running for days.&lt;/p&gt;
&lt;p&gt;When we use PRIVATE and ANONYMUS option and random access turned on, the misses in data TLB is 335009(std dev 7298). We can&amp;rsquo;t get access to L1 cache data because it will cause the session to be automatically logged out whenever L1D is used. here are some interesting things to notice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;MAP_PRIVATE + MAP_ANONYMOUS&lt;/em&gt;: TLB misses:335009(std dev 17298)&lt;br&gt;
minflt: 4220(std dev 231)&lt;br&gt;
oublock: 8(std dev 4)&lt;br&gt;
nivcsw: 19(10)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;MAP_SHARED&lt;/em&gt;: TLB misses:251284std dev 103292)&lt;br&gt;
minflt: 2784(std dev 231)&lt;br&gt;
majflt: 247(std 65)&lt;br&gt;
oublock: 18200(std dev 2987)&lt;br&gt;
nivcsw: 8(7)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The most important difference here is that the oublock is much easier to trigger because the constant swapping. When file backed memory is used we also notice that majflt is much higher. Because pages are constantly traveling between swap area and memory, the page fault rate becomes a lot higher. The oublock also follows previous patterns as the file backed memory requires filesystem involvement.&lt;/p&gt;
&lt;p&gt;Finally, we also modify the kernel&amp;rsquo;s head (or more precisely its LRU page replacement algorithm). Look in &lt;code&gt;mm/vmscan.c&lt;/code&gt; there&amp;rsquo;s a function calleed &lt;code&gt;shrink_page_list&lt;/code&gt;. In it, you will see a switch statement with a PAGEREF_ACTIVATE case, which is the case where the kernel sees the page has been recently accessed. In this case the kernel gotos activate_locked, but you will change it to to do the same thing as the PAGEREF_RECLAIM case. We can simply move the case down and change its default behavior to direct to the PAGEREF_RECLAIM case. After that, we need to recompile the kernel for VM. We also summarize the most interesting results:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;MAP_PRIVATE + MAP_ANONYMOUS&lt;/em&gt;: TLB misses:308031(std dev 17298)&lt;br&gt;
minflt: 4223(std dev 791)&lt;br&gt;
oublock: 8(std dev 1)&lt;br&gt;
nivcsw: 11(5)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;MAP_SHARED&lt;/em&gt;: TLB misses: 251284std dev 103292)&lt;br&gt;
minflt: 2724(std dev 231)&lt;br&gt;
majflt: 0(std 0)&lt;br&gt;
oublock: 18200(std dev 2987)&lt;br&gt;
nivcsw: 8(7)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can see that the most of the pattern follow the previous result after the modified kernel is installed. One main difference is majflt value is reduced back down.&lt;/p&gt;</description></item><item><title>Memory Resource Management in VMware ESX Server</title><link>https://www.bodunhu.com/blog/posts/memory-resource-management-in-vmware-esx-server/</link><pubDate>Mon, 21 Sep 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/memory-resource-management-in-vmware-esx-server/</guid><description>&lt;p&gt;&lt;a href="https://www.vmware.com/products/esxi-and-esx.html"&gt;VMWare ESX Server&lt;/a&gt; is a software layer designed to multiplex hardware resources among virtual machines running unmodified commodity operating systems. ESX Server, different to &lt;a href="https://www.vmware.com/products/workstation-pro.html"&gt;VMware Workstation&lt;/a&gt;, is a type 1 hypervisor, which means it runs directly on bare metal. ESX Server focuses on running guest VMs without modifying the guest OSes at all, which is challenging.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Memory Virtualization is done by interposing an extra abstraction layer between a &lt;code&gt;physical address&lt;/code&gt; from the VM&amp;rsquo;s point of view, and a &lt;code&gt;machine address&lt;/code&gt; which represents the actual hardware memory. ESX Server maintains a &lt;em&gt;pmap&lt;/em&gt; data structure for each VM to translate PPMs to MPNs. A separate &lt;em&gt;shadow page table&lt;/em&gt;, consistent with the physical-to-machine mappings, is used to map virtual-to-machine page mappings. This avoids additional overheads as the hardware TLB will cache direct virtual-to-machine address translations read from the shadow page table.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="key-features"&gt;Key features&lt;/h2&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/ESXServer/ballooning.png" alt="vmware-esx-server-memory"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ballooning&lt;/strong&gt; is a technique used by the server to achieve memory reclamation. As its name suggests, the hypervisor inflates the balloon by instructing the balloon driver module to allocate pinned physical pages and deflates it by instructing it to deallocate previously-allocated pages. The idea behind this technique is that the hypervisor is unaware of the specific usage patterns of policies of its guests, therefore the making page replacement decisions is best done in the guest VM. When the hypervisor over commits memory, it needs some way to claim memories from the VMs. By consuming some of the memory that the guest OS believes is physically present in the virtual machine. The guest OS will then swap memory to disk reducing the load on the host&amp;rsquo;s physical memory. The host will them reallocate that memory to other VMs. A details description of ballooning can be found in this &lt;a href="https://www.vladan.fr/what-is-vmware-memory-ballooning/"&gt;post&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.vmware.com/pdf/usenix_resource_mgmt.pdf"&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/ESXServer/pghash.png#center" alt="vmware-esx-server-page-coloring"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Page Coloring can be used to reduce cache misses or partition resources. But it might complicates memory management, especially with the presence of huge pages. Because coloring enforces ownership, thus might result in distinct L2 cache entries.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sharing memory&lt;/strong&gt; is achieved by comparing the content of each page, since modifying guest operating system internals is not possible. Because comparing each page would be \(O(n^2)\), hashing is used to identify pages to make the progress more efficiently. By letting VMs share pages based the contents, the host can potentially save spaces dramatically. For example, the presence of zero pages is a great opportunity for page sharing by mapping one zero page to multiple VMs. Hint is hash hit, but it doesn&amp;rsquo;t guarantee the content of the page doesn&amp;rsquo;t change at that moment.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Idle Memory&lt;/strong&gt; presents a problem in pure proportional-share algorithms because they do not incorporate any information about active memory usage. More specifically, the memory demand might change dynamically. ESX Server collects &lt;em&gt;idle memory tax&lt;/em&gt; from VMs to mitigate this issue. A client is charged more for an idle page than the active one. The cost of idle memory is inflated by tax rate. The metrics of idles pages in guests is collected by hypervisor without guests&amp;rsquo; involvement. The idle page information in virtual page table inside VMs is periodically sampled on random bases.&lt;/p&gt;
&lt;h2 id="questions"&gt;Questions&lt;/h2&gt;
&lt;p&gt;a. What is the overhead of ballooning? Triggering memory management in the VM by &amp;ldquo;tricking&amp;rdquo; it into thinking the the memory resource is scarce/plentiful may have unexpected behaviors.&lt;br&gt;
b. Do content-based sharing pose security vulnerabilities?&lt;br&gt;
c. Remapping hot I/O pages to low memory can be a bottleneck if the page number is high. How does modern hypervisor solution cope with this issue?&lt;/p&gt;</description></item><item><title>Xen and the Art of Virtualization</title><link>https://www.bodunhu.com/blog/posts/xen-and-the-art-of-virtualization/</link><pubDate>Wed, 16 Sep 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/xen-and-the-art-of-virtualization/</guid><description>&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/Xen/Xen.png#center" alt="xen-architecture"&gt;&lt;/p&gt;
&lt;p&gt;Xen is an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, without sacrificing either performance or functionality. Xen is type I hypervisor, which directly runs on top of bare metal. We will summarize what Xen is what its attributes are.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;paravirtualization - presents a virtual machine abstraction that is similar but not identical to the underlying hardware.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="the-virtual-machine-interface"&gt;The Virtual Machine Interface&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Memory&lt;/strong&gt; is hard to virtualize mostly because x86 doesn&amp;rsquo;t support software-managed TLB. A tagged TLB entry allows both guest OS and hypervisor to coexist because it can be associated with an address-space identifier. This is not possible on x86, thus address space changing likely requires flushing the TLB. Thus, to achieve better performance, guest OSes are responsible to managing hardware page tables. Batching can be used by the guest OS to reduce constantly requesting new pages from the hypervisor when new processes are created.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CPU&lt;/strong&gt; virtualization has implications for guest OSes. Principally, OS the most privileged entity on top of hardware. A hypervisor in the middle means the guests OSes must be modified to run a lower privilege level. On x86, this is not a problem since OSes executes in ring 0 while applications execute in ring 3, leaving ring 1 and ring 2 unused.
Privileged instructions executed by the guest has to go through the check of hypervisor in general. For performance reasons, system call exceptions can be handled directly by the CPU. As for paging faults, this needs to go through the hypervisor because only code in ring 0 can result the faulting address from &lt;code&gt;CR2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Device I/O&lt;/strong&gt; is implemented by transfer data between guest and Xen using shared-memory async buffer-descriptor rings. Event delivery is achieved by hypervisor sending notification to its guest asynchronously. When and whether to hold off these callbacks is at the discretion of the guest.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/Xen/xen_ringbug.png#center" alt="xen-ring-buffer"&gt;&lt;/p&gt;
&lt;p&gt;Essentially, the virtualization interface design is based on a number of factors. The hypervisor acts as a security guard that validates the guest&amp;rsquo;s request which would go directly to hardware normally if running in ring 0. The bottom line is the hypervisor shouldn&amp;rsquo;t be involved unless the there are hardware limitations, or when resource validation or management are required. The goal is to separate policy from mechanism wherever possible. This similar to exokernel in that the hypervisor merely provides basic functionalities without understanding higher level issues.&lt;/p&gt;
&lt;h2 id="questions"&gt;Questions&lt;/h2&gt;
&lt;p&gt;a. Why does x86 make it hard to support efficient virtualization?&lt;br&gt;
b. How does Xen exists in 64MB section at the top of every address space avoid TLB flushes when entering and leaving the hypervisor?&lt;/p&gt;</description></item><item><title>Start Linux Kernel Hacking</title><link>https://www.bodunhu.com/blog/posts/start-linux-kernel-hacking/</link><pubDate>Mon, 14 Sep 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/start-linux-kernel-hacking/</guid><description>&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/linux_kernel_hacking/linux.jpg#center" alt="Linux"&gt;&lt;/p&gt;
&lt;aside class="toc"&gt;
&lt;h4&gt;Table of Contents&lt;/h4&gt;
&lt;nav id="TableOfContents"&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#getting-the-vm-running-in-kvm"&gt;Getting the VM running in KVM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#building-the-kernel"&gt;Building the Kernel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#build-and-install-kernel-modules"&gt;Build and Install Kernel Modules&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#booting-kvm-with-the-new-kernel"&gt;Booting KVM with the new Kernel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#booting-process"&gt;Booting Process&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#debugging-kernel"&gt;Debugging Kernel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#set-breakpoints"&gt;Set Breakpoints&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#syscall"&gt;Syscall&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/nav&gt;
&lt;/aside&gt;
&lt;p&gt;This is a summary of how to compile and boot the Linux kernel on the KVM-qemu virtual machine. It covers how to get a VM running in KVM, how to build a customized kernel, and how to use GDB with the Linux kernel. The experiment is conducted on an amd64 architecture CPU. We use Ubuntu as our testing environment but the steps covered here should apply to other distros as well.&lt;/p&gt;
&lt;!--description--&gt;
&lt;h2 id="getting-the-vm-running-in-kvm"&gt;Getting the VM running in KVM&lt;/h2&gt;
&lt;p&gt;The Ubuntu ISO image is downloaded from the &lt;a href="https://ubuntu.com/download/desktop"&gt;Canonical website&lt;/a&gt;. The kernel is downloaded directly from &lt;a href="https://www.kernel.org/"&gt;kernel.org&lt;/a&gt;. The specs of our test environment is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CPU: Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz&lt;/li&gt;
&lt;li&gt;RAM: 32 GB&lt;/li&gt;
&lt;li&gt;Host and Guest OS: Ubuntu 20.04.1 LTS&lt;/li&gt;
&lt;li&gt;Host Kernel Version: 5.4.0-47-generic&lt;/li&gt;
&lt;li&gt;GCC: 7.5.0&lt;/li&gt;
&lt;li&gt;QEMU emulator version: 4.2.0&lt;/li&gt;
&lt;li&gt;Guest Kernel Version: 5.8.6&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After we obtained the Ubuntu ISO image, we use GUI virt-manager to install the OS. One thing to notice here is the default directory for virtual disks is &lt;code&gt;/var/lib/libvirt/images&lt;/code&gt;, since my system partition is located on a separate SSD with limited space, the virtual disk directory is changed to my &lt;code&gt;/home&lt;/code&gt; directory instead.&lt;/p&gt;
&lt;p&gt;We also create the new virtual disk inside virt-manager. We chose raw format instead of qcow2. Creating a new image file can also be done in command line using:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-shell" data-lang="shell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;qemu-img create -f raw -o &lt;span class="nv"&gt;preallocation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;full vmdisk.img 40G
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The preallocation can be turn either on or off depends on personal choices. After the disk image is created, we proceeds in virt-manager to install Ubuntu on the newly allocated virtual disk. We enabled storage for this virtual machine so that we don&amp;rsquo;t need to repeat the installation process every time we launch the VM. One thing to be noticed here is we don&amp;rsquo;t need swap area inside a virtual machine. We can simply use the whole virtual disk for &lt;code&gt;/&lt;/code&gt; partition.&lt;/p&gt;
&lt;p&gt;To start the VM from cmd, you might need to change the owner of the disk image. We add the user to both &lt;code&gt;kvm&lt;/code&gt; and &lt;code&gt;libvirt&lt;/code&gt;. The image created or accessed by virt-manager seems to change the file owner to libvirt-qemu, which may cause problems when starting from cmd.&lt;/p&gt;
&lt;p&gt;After the installation is finished, we can simply launch the virtual machine inside virt-manager through its GUI interface. We can also use command line to start the VM:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-shell" data-lang="shell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;kvm -accel kvm -m 8G -smp &lt;span class="m"&gt;6&lt;/span&gt; --snapshot -drive &lt;span class="nv"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;raw,file&lt;span class="o"&gt;=&lt;/span&gt;/home/ed/virtimg/ubuntu20.04
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The argument &lt;code&gt;-accel kvm&lt;/code&gt; enables Kernel-based Virtual Machine full virtualization, which uses hardware acceleration. Without this option the VM will become extremely slow. The &lt;code&gt;-m 8G&lt;/code&gt; assigns the given amount of memory to the VM. The &lt;code&gt;-smp 6&lt;/code&gt; assigns the given number of cores to the guest if the host has multiple cores. The &lt;code&gt;--snapshot&lt;/code&gt; ensures that no changes are made to your image during an execution so you can do something dangerous and have the original image file preserved. The &lt;code&gt;-drive&lt;/code&gt; option specifies the location of the virtual disk and its format. We will use some of these options later.&lt;/p&gt;
&lt;p&gt;To confirm the VM has internet access, simply execution &lt;code&gt;apt install pkg-name&lt;/code&gt; in the guest terminal. No error message would indicates properly functioning network access from the guest VM. For example, when we execute &lt;code&gt;sudo apt install llvm&lt;/code&gt; it shows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Reading package lists... Done
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Building dependency tree
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Reading state information... Done
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;The following additional packages will be installed:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; llvm-runtime
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;The following NEW packages will be installed:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; llvm llvm-runtime
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Need to get 6,796 B of archives.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;After this operation, 128 kB of additional disk space will be used.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Do you want to continue? [Y/n]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="building-the-kernel"&gt;Building the Kernel&lt;/h2&gt;
&lt;p&gt;We can use out customized kernel for our newly created VM. After we obtain the Linux kernel from &lt;a href="https://www.kernel.org/"&gt;kernel.org&lt;/a&gt;, we extract the source into &amp;lt;kernel dir&amp;gt; and create a separate build directory &amp;lt;kbuild&amp;gt; (outside &amp;lt;kernel dir&amp;gt;).&lt;/p&gt;
&lt;p&gt;Then we enter the &amp;lt;kbuild&amp;gt; directory, run&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-shell" data-lang="shell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;yes &lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; make -C /home/ed/Desktop/linux_kernel/kbuild &lt;span class="nv"&gt;O&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt; config
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This will create a &lt;code&gt;.config&lt;/code&gt; file inside &amp;lt;kbuild&amp;gt; with the default options selected. We then open the configuration file and ensures &lt;code&gt;CONFIG_SATA_AHCI=y&lt;/code&gt;, which builds the SATA disk driver into the kernel. That will allow your kernel to boot off a (virtual) SATA drive without having to load a module to do it.&lt;/p&gt;
&lt;p&gt;Next we build the kernel by running &lt;code&gt;make&lt;/code&gt; in &amp;lt;kbuild&amp;gt;. We use the -j 6 option speedup the building process using multiple processor cores. This process can take a long time.&lt;/p&gt;
&lt;h2 id="build-and-install-kernel-modules"&gt;Build and Install Kernel Modules&lt;/h2&gt;
&lt;p&gt;To build modules locally on host, we create another separate &amp;lt;install_mod_dir&amp;gt; directory for building kernel modules. Then in &amp;lt;kbuild&amp;gt;, execute&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-shell" data-lang="shell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;make &lt;span class="nv"&gt;INSTALL_MOD_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/home/ed/Desktop/linux_kernel/install_mod_dir modules_install
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now there is a &lt;code&gt;lib&lt;/code&gt; directory inside &lt;code&gt;/home/ed/Desktop/linux_kernel/install_mod_dir&lt;/code&gt;, which holds all the kernel modules we are about to install.&lt;/p&gt;
&lt;p&gt;The complete list of modules can be listed using &lt;code&gt;cat modules.builtin&lt;/code&gt; inside &lt;code&gt;lib/moduels/5.8.6&lt;/code&gt;. Here is a &lt;a href="https://gist.github.com/BDHU/4d31d18ad106a13caceac4a961d04a44"&gt;link&lt;/a&gt; to all the modules being built. We didn&amp;rsquo;t modify anything in the configuration.&lt;/p&gt;
&lt;p&gt;Then we use guestmount to mount the virtual disk to a mount point on the host&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-shell" data-lang="shell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;guestmount -a /home/ed/virtimg/ubuntu20.04 -i ~/vm/linux/
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In Ubuntu this step yields the following message:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;libguestfs: error: /usr/bin/supermin exited with error status 1.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;To see full error messages you may need to enable debugging.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Do:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;and run the command again. For further information, read:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;You can also run &amp;#39;libguestfs-test-tool&amp;#39; and post the *complete* output
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;into a bug report or message to the libguestfs mailing list.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The underlying problem is that the kernel cannot be read and according to the &lt;a href="https://askubuntu.com/questions/1046828/how-to-run-libguestfs-tools-tools-such-as-virt-make-fs-without-sudo"&gt;post&lt;/a&gt; and the &lt;a href="https://bugs.launchpad.net/fuel/+bug/1467579"&gt;bug report&lt;/a&gt; on Ubuntu Launchpad.&lt;/p&gt;
&lt;p&gt;To fix the issue, we need to run&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-shell" data-lang="shell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;sudo chmod +r /boot/vmlinuz-*
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We can verify the contents inside ~/vm/linux by simply cd into it.&lt;/p&gt;
&lt;p&gt;To install the modules we just built, we can copy the &lt;code&gt;&amp;lt;install_mod_dir&amp;gt;lib/modules&lt;/code&gt; into the mounted filesystem &lt;code&gt;&amp;lt;mount_point&amp;gt;/lib/modules&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Finally, we unmount the filesystem by doing&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-shell" data-lang="shell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;fusermount -u /mnt/hdd1/vm/linux
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="booting-kvm-with-the-new-kernel"&gt;Booting KVM with the new Kernel&lt;/h2&gt;
&lt;p&gt;To boot up the VM with the new kernel, we will add a few extra command line options to kvm. For convenience, we put the scripts into a file. It&amp;rsquo;s also available on &lt;a href="https://gist.github.com/BDHU/8c6ab518ab37571a1cae132d79ac9a9e"&gt;gist&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-shell" data-lang="shell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#!/bin/bash
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;kvm &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -s &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -display gtk &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -cpu host &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -vga qxl &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -accel kvm &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -kernel &lt;span class="s2"&gt;&amp;#34;/home/ed/Desktop/linux_kernel/kbuild/arch/x86/boot/bzImage&amp;#34;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -append &lt;span class="s2"&gt;&amp;#34;root=/dev/sda1 console=ttyS0,115200n8 nokaslr&amp;#34;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -drive &lt;span class="nv"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;raw,file&lt;span class="o"&gt;=&lt;/span&gt;/home/ed/virtimg/ubuntu20.04 &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -m 8G &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -smp &lt;span class="m"&gt;6&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; --snapshot &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -S
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Aside from the command line arguments we discussed before, there are a few new members here. the &lt;code&gt;-s&lt;/code&gt; switch is a shorthand for &lt;code&gt;-gdb tcp::1234&lt;/code&gt;. The &lt;code&gt;-display gtk&lt;/code&gt; is optional. It enables the opengl context in the display device for gtk display output. &lt;code&gt;-cpu host&lt;/code&gt; says the guest should emulate the host processor. &lt;code&gt;-vga qxl&lt;/code&gt; enables 3D acceleration on the guest system. &lt;code&gt;-vga virtio&lt;/code&gt; also offers good performance in our case. &lt;code&gt;-kernel&lt;/code&gt; allows bootloader to pickup the new kernel. The &lt;code&gt;-append&lt;/code&gt; along with its arguments specifies where the root partition of the hard disk is and the console parameter adds a serial console at boot so you can see boot messages. The &lt;code&gt;--snapshot&lt;/code&gt; in QEMU says the images that refer to an original image will use Redirect-on-Write to avoid changing the original image. The &lt;code&gt;-S&lt;/code&gt; means the kernel won&amp;rsquo;t start executing unless we attach a debugger to it. We only use it later in the debugging stage.&lt;/p&gt;
&lt;p&gt;Again, we can verify there is internet access using the new kernel using &lt;code&gt;apt update&lt;/code&gt;. There are no errors shown, which indicates the network is functioning correctly.&lt;/p&gt;
&lt;h2 id="booting-process"&gt;Booting Process&lt;/h2&gt;
&lt;p&gt;Now we are able to boot up the VM successfully, we can first measure how much time the kernel spends in booting. Running &lt;code&gt;dmesg -d&lt;/code&gt; shows the timestamp and time delta spent between messages. The final line shows &lt;code&gt;[10.842998]&lt;/code&gt;. If we use &lt;code&gt;systemd-analyze&lt;/code&gt;, it outputs&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Startup finished in 795ms (kernel) + 5.451s (userspace) = 6.247s
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;graphical.target reached after 5.439s in userspace
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The reason why there is a gap between these two measurement is because &lt;code&gt;dmesg&lt;/code&gt; is not a reliable test of how long a boot-up process goes. &lt;code&gt;dmesg&lt;/code&gt; itself merely collects information. The drivers and other system processes can output messages at any point in time. There may or may not be processes spawning between those messages.&lt;/p&gt;
&lt;p&gt;Next, we are going to look at how PCI device is involved in kernel startup. &lt;code&gt;lspci&lt;/code&gt; outputs the follow&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;00:02.0 VGA compatible controller: Red Hat, Inc. Virtio GPU (rev 01)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We can use the PCI address here to search for corresponding information in &lt;code&gt;dmesg&lt;/code&gt;. For example, if we use the domain value \(0000:\) as query, we get something like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[ 0.295026] PCI host bridge to bus 0000:00
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[ 0.299055] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[ 0.300133] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[ 0.301163] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[ 0.311006] pci 0000:00:02.0: [1af4:1050] type 00 class 0x030000
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[ 0.319650] pci 0000:00:03.0: [8086:100e] type 00 class 0x020000
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The full result is also available as &lt;a href="https://gist.github.com/BDHU/4d31d18ad106a13caceac4a961d04a44#file-dmesg_output"&gt;gist&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;lspci&lt;/code&gt; command specifies the type of device right after the address. For example, the first one is host bridge. We specifically selected the message in the &lt;em&gt;type 00 class&lt;/em&gt; format here. The significance here is that the class value actually telss us the type of the corresponding device. We can check the &lt;a href="https://github.com/torvalds/linux/blob/master/include/linux/pci_ids.h"&gt;include/linux/pci_ids.h&lt;/a&gt; for each macro respectively. For example,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#define PCI_CLASS_NETWORK_ETHERNET 0x0200
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;this line shows the value 0x0200 corresponds to a network PCI device. This aligns with our &lt;code&gt;dmesg&lt;/code&gt; output as well as the &lt;code&gt;lspci&lt;/code&gt; result.&lt;/p&gt;
&lt;h2 id="debugging-kernel"&gt;Debugging Kernel&lt;/h2&gt;
&lt;p&gt;To build KVM+GDB-friendly kernel, we need to have proper CONFIG_DEBUG* options set in the .config file. More specifically, we need to have the following options enabled:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CONFIG_DEBUG_INFO y: compile the kernel with debug info. The full list of definitions can be found &lt;a href="https://cateee.net/lkddb/web-lkddb/DEBUG_INFO.html"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;CONFIG_DEBUG_INFO_DWARF4 y: generate dwarf4 debug info. Definition can be found &lt;a href="https://cateee.net/lkddb/web-lkddb/DEBUG_INFO_DWARF4.html"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;CONFIG_GDB_SCRIPTS y: creates the required links to GDB helper scripts in the build directory. Full definition can be found &lt;a href="https://cateee.net/lkddb/web-lkddb/GDB_SCRIPTS.html"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;CONFIG_GDB_INFO_REDUCED n: disable reduced gdb info.&lt;/li&gt;
&lt;li&gt;CONFIG_KGDB y: kernel debugging location. Full list of definitions found &lt;a href="https://cateee.net/lkddb/web-lkddb/KGDB.html"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;CONFIG_FRAME_POINTER y: compile the kernel with frame pointers. Full list of definitions found &lt;a href="https://cateee.net/lkddb/web-lkddb/FRAME_POINTER.html"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;CONFIG_SATA_AHCI y: this option enables support for AHCI Serial ATA. Definition found &lt;a href="https://cateee.net/lkddb/web-lkddb/SATA_AHCI.html"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;CONFIG_KVM_GUEST y: this option enables various optimizations for running under the KVM hypervisor. Definition found &lt;a href="https://cateee.net/lkddb/web-lkddb/KVM_GUEST.html"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;CONFIG_RANDOMIZE_BASE n: drop support for Kernel Address Space Layout Randomization (KASLR). Definition found &lt;a href="https://cateee.net/lkddb/web-lkddb/RANDOMIZE_BASE.html"&gt;here&lt;/a&gt;. We also added &lt;code&gt;nokaslr&lt;/code&gt; in our qemu arguments.&lt;/li&gt;
&lt;li&gt;CONFIG_SMP y: enable Symmetric multi-processing support. Definition found &lt;a href="https://cateee.net/lkddb/web-lkddb/SMP.html"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now we can recompile the kernel and attack gdb to it. We simply add &lt;code&gt;-S&lt;/code&gt; option to kvm to only start the VM when gdb is attached. Then we enter our &amp;lt;kbuild&amp;gt; directory and execute:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-shell" data-lang="shell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gdb vmlinux
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;gdb&lt;span class="o"&gt;)&lt;/span&gt; target remote:1234
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The step is also documented in the kernel community &lt;a href="https://www.kernel.org/doc/html/latest/dev-tools/gdb-kernel-debugging.html"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="set-breakpoints"&gt;Set Breakpoints&lt;/h2&gt;
&lt;p&gt;Spin lock is easy to find in a kernel. Therefore, we will set break points on &lt;code&gt;spin_lock&lt;/code&gt;. For kernel 5.8.6, we see that &lt;code&gt;spin_lock&lt;/code&gt; is defined in &lt;a href="https://elixir.bootlin.com/linux/v5.8.6/source/include/linux/spinlock.h#L351"&gt;https://elixir.bootlin.com/linux/v5.8.6/source/include/linux/spinlock.h#L351&lt;/a&gt; as a inline function. If we trace the function, we can see the actual function we should use is &lt;code&gt;_raw_spin_lock&lt;/code&gt; defined &lt;a href="https://elixir.bootlin.com/linux/v5.8.6/source/kernel/locking/spinlock.c#L149"&gt;here&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#ifndef CONFIG_INLINE_SPIN_LOCK
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;__lockfunc&lt;/span&gt; &lt;span class="nf"&gt;_raw_spin_lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;raw_spinlock_t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;__raw_spin_lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If we need to break the execution only when a given program is executed, we can use the program PID to as the condition. The problem is, how do we get the program PID if it doesn&amp;rsquo;t last for long?&lt;/p&gt;
&lt;p&gt;We could instead first set a breakpoint on &lt;code&gt;fork&lt;/code&gt;. We can break its kernel call at &lt;code&gt;_do_fork&lt;/code&gt; which is defined &lt;a href="https://elixir.bootlin.com/linux/v5.8.6/source/kernel/fork.c#L2416"&gt;here&lt;/a&gt;. After that, we can simply continue executing the kernel until we run the program.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note: we need to compile the program and open a new terminal first. Since they both involves forking new processes, which will hit &lt;code&gt;_do_fork&lt;/code&gt; before our program runs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then we print the process PID using &lt;code&gt;p $lx_current().pid&lt;/code&gt;. We then use this value as the condition for &lt;code&gt;b _raw_spin_lock if $lx_current().pid == pid_value&lt;/code&gt; inside gdb.&lt;/p&gt;
&lt;p&gt;If we want &lt;code&gt;_raw_spin_lock&lt;/code&gt; to break under different contexts, we can simply use PID as different contexts. We can also set break points in functions in different contexts that calls &lt;code&gt;spin_lock&lt;/code&gt; and see what they do. For example, we can set break point at &lt;code&gt;expand_downwards&lt;/code&gt; defined in &lt;a href="https://elixir.bootlin.com/linux/v5.8.6/source/mm/mmap.c#L2428"&gt;here&lt;/a&gt;, if we back trace this function, we will get a series of calls, we mention the important ones here&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#1 0xffffffff81284c4e in expand_stack
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#3 0xffffffff813843db in load_elf_binary
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#8 do_execve
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#12 0xffffffff81b1f658 in do_syscall_64
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We also added a helper script in .gdbinit to print our the name of the function, which is &amp;lsquo;&amp;lsquo;anacron&amp;rsquo;&amp;rsquo; in this case.
In short, this process execute commands periodically, and it performs a sys call which loads elf binary, thus requiring stack expansion.&lt;/p&gt;
&lt;p&gt;Another example is timer interrupt. The &lt;code&gt;get_next_timer_interrupt&lt;/code&gt; calls &lt;code&gt;_raw_spin_lock&lt;/code&gt;. We select some messages from backtrace:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#1 0xffffffff8113b224 in get_next_timer_interrupt
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#2 0xffffffff8114d52e in tick_nohz_next_event
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#4 tick_nohz_idle_stop_tick ()
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#5 0xffffffff810df567 in cpuidle_idle_call ()
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In short, the is a timer interrupt that gets called when CPU is idle.&lt;/p&gt;
&lt;p&gt;The last example is &lt;code&gt;hrtimer_interrupt&lt;/code&gt;. The selected messages are:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#4 0xffffffff8114d80c in tick_sched_timer
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#7 0xffffffff8113c8e7 in hrtimer_interrupt
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#12 run_on_irqstack_cond
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;#14 0xffffffff81c00cc2 in asm_sysvec_apic_timer_interrupt
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In summary, &lt;code&gt;hrtimer_interrupt&lt;/code&gt; is called as event handler. This function is responsible to select all timers that have expired and either move them to the expiration list (if they may be processed in softIRQ context) or call the handler function directly.&lt;/p&gt;
&lt;!-- ## Extra credit
If we are to set conditional breakpoint of spin_lock which does not stop within the timer interrupt. We can use the gdb [convenience functions](https://sourceware.org/gdb/current/onlinedocs/gdb/Convenience-Funs.html#Convenience-Funs) ```$_any_caller_is(name[, number_of_frames]) == 0```. We can trace the timer interrupt call that uses the ```_raw_spin_lock``` and make sure it doesn't break whenever the caller represents an interrupt. For example, we will instead get
```
#3 0xffffffff813130b7 in vfs_poll
#5 do_poll
#9 __x64_sys_poll
#10 0xffffffff81b1f658 in do_syscall_64
```
which represents file system operations. --&gt;
&lt;h2 id="syscall"&gt;Syscall&lt;/h2&gt;
&lt;p&gt;Essentially, processor switches from the user mode to kernel mode and starts execution of the sys call entry - &lt;code&gt;entry_SYSCALL_64&lt;/code&gt;, we can find its definition at &lt;a href="https://elixir.bootlin.com/linux/v5.8.6/source/arch/x86/entry/entry_64.S#L94"&gt;here&lt;/a&gt;. This is the only entry point used for 64-bit system calls. We can set a break point here. When the break point is hit, we use &lt;code&gt;info registers&lt;/code&gt; in gdb to get the value of cr3. In our case, it is 0x22a6d5806. Then we simply step from this breakpoint, and will likely reach &lt;code&gt;SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp&lt;/code&gt;. After this call the value in cr3 is changed to 0x22a6d4006. The macro is defined &lt;a href="https://elixir.bootlin.com/linux/v5.8.6/source/arch/x86/entry/entry_32.S#L165"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We can see whenever the processor switch from the user mode to kernel mode the value of cr3 is changed. The root cause the Page &lt;a href="https://www.kernel.org/doc/html/latest/x86/pti.html"&gt;Table Isolation (PTI)&lt;/a&gt;. It is a countermeasure against attacks on the shared user/kernel address space such as the &amp;lsquo;&amp;lsquo;Meltdown&amp;rsquo;&amp;rsquo; approach. To mitigate this class of attacks, two independent page table copies are created, one in kernel space, one in user space. The cr3 register enables the processor to translate linear addresses into physical addresses by locating the page directory and page tables for the current task. So whenever the process enters kernel mode, the kernel copy requires its page directory address to be loaded into cr3 register.&lt;/p&gt;
&lt;p&gt;If we add &lt;code&gt;nopti&lt;/code&gt; in &lt;code&gt;-append&lt;/code&gt; in the QEMU cmd argument and perform the same steps. We get 0x231466005 before and after &lt;code&gt;SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp&lt;/code&gt; is executed. Based on the description in the &lt;a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/admin-guide/kernel-parameters.txt?h=v5.1.3#L3656"&gt;linux kernel tree&lt;/a&gt;, the &lt;code&gt;nopti&lt;/code&gt; on X86_64 is equivalent to pti=off, therefore explaining the constant value of cr3.&lt;/p&gt;</description></item><item><title>Performance Anomaly of 802.11b</title><link>https://www.bodunhu.com/blog/posts/performance-anomaly-of-802.11b/</link><pubDate>Sun, 13 Sep 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/performance-anomaly-of-802.11b/</guid><description>&lt;p&gt;This research is conducted by Martin Heusse, Franck Rousseau, Cilles Berger-Sabbatel, Andrzej Duda on analyzing the performance of the IEEE 802.11b
wireless local area networks. Degraded transmitting rate is caused by CSMA/CA channel access method.&lt;/p&gt;
&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;The performance of the IEEE 802.11b wireless local area networks have degraded performances when some mobile hosts use a lower bit rate than the others, which is caused by CSMA/CA channel access method. When one host changes it modulation type which degrades bit rate, it occupies the channel for a longer time, causing other hosts still using higher bit rate to be penalized. The paper &lt;a href="https://ieeexplore.ieee.org/document/1208921"&gt;Performance Anamoly of 802.11b&lt;/a&gt; analyzes how such anomaly works.&lt;/p&gt;
&lt;h2 id="transmission-overhead"&gt;Transmission Overhead&lt;/h2&gt;
&lt;p&gt;Consider there is only a single host in a 802.11b cell transmitting a single data frame. The overall transmission time is expressed as:&lt;/p&gt;
&lt;p&gt;$$T = t_{tr} + t_{ov}$$&lt;/p&gt;
&lt;p&gt;where the constant overhead&lt;/p&gt;
&lt;p&gt;$$t_{ov} = DIFS + t_{pr} + SIFS + t_{pr} + t_{ack}$$&lt;/p&gt;
&lt;p&gt;The transmission process can be represented by the graph&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ieeexplore.ieee.org/document/1208921"&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/ieee_anomaly/transmission_single_frame.png#center" alt="wireless-transmission-single-frame"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;!-- The takeaway here is that $t_{pr}$ varies based on the bit rate used by the host. If the bit rate changes from 1 Mb/s to 2, $5.5$, or $11$ Mb/s, the value of $t_{pr}$ changes from $192 \mu s$ to $96 \mu s$, resulting in less than maximum throughput. --&gt;
&lt;p&gt;When there are multiple hosts attempting to transmit, a host will execute the exponential backoff algorithm - it waits for a random interval to avoid saturating the channel, resulting in extra time spent in the contention procedure:&lt;/p&gt;
&lt;p&gt;$$T = t_{tr} + t_{ov} + t_{cont}(N)$$&lt;/p&gt;
&lt;p&gt;Finally, the useful throughput obtained by a host depends on the number of hosts:&lt;/p&gt;
&lt;p&gt;$$p(n) = t_{tr} / T(N)$$&lt;/p&gt;
&lt;p&gt;This indicates the useful throughput is smaller than the nominal bit rate and largely depends on the number of competing hosts.&lt;/p&gt;
&lt;h2 id="anomaly"&gt;Anomaly&lt;/h2&gt;
&lt;p&gt;Assume there are \(N\) hosts, \(N-1\) hosts use high transmission rate \(R=11\)Mb/s, one hosts transmits at rate \(r=5.5\), \(2\), or \(1\) Mb/s. We can deduce the transmission time of the fast ones:&lt;/p&gt;
&lt;p&gt;$$T_f = t_{ov}^{R} + \frac{s_d}{R} + t_{cont}$$&lt;/p&gt;
&lt;p&gt;The transmission time of the slow host is:&lt;/p&gt;
&lt;p&gt;$$T_s = t_{ov}^{R} + \frac{s_d}{r} + t_{cont}$$&lt;/p&gt;
&lt;p&gt;The short term behavior of CSMA/CA is shown to be not fair, thus we have&lt;/p&gt;
&lt;p&gt;$$U_f = \frac{T_f}{(N-1)T_f + T_s + P_c(N)\times t_{jam} \times N}$$&lt;/p&gt;
&lt;p&gt;\(t_{jam}\) is the average time spent in collisions, calculated between the all possible pairs between the fast hosts and the slow one:&lt;/p&gt;
&lt;p&gt;$$t_{jam} = \frac{2}{N}T_s + (1 - \frac{2}{N})T_f$$&lt;/p&gt;
&lt;p&gt;The throughput at he MAC layer of each fast hosts is:&lt;/p&gt;
&lt;p&gt;$$X_f = U_f \times p_f(N) \times R$$&lt;/p&gt;
&lt;p&gt;given that:&lt;/p&gt;
&lt;p&gt;$$p_f(N) = \frac{s_d}{RT_f}$$&lt;/p&gt;
&lt;p&gt;We apply the same process for the slow host, given \(p_s(N) = \frac{s_d}{rT_s}\), what we get eventually is:&lt;/p&gt;
&lt;p&gt;$$X_f=X_s = X$$&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This key point here is that the fast hosts transmitting at the higher rate R obtain the same throughput as the slow host transmitting at the lower rate.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="simulation-and-measurement-results"&gt;Simulation and Measurement Results&lt;/h2&gt;
&lt;p&gt;In general, the experimental value of \(P_c(N)\) seems to match the theory model. One thing the paper could illustrates better is to show how experimental value matches the equation as the number of hosts increases. The average and cumulative throughput value also seems reasonable compared to the expression discussed before.&lt;/p&gt;
&lt;p&gt;The throughput is measured using three different tools: &lt;em&gt;netperf&lt;/em&gt;, &lt;em&gt;tcpperf&lt;/em&gt;, and &lt;em&gt;udpperf&lt;/em&gt;. This idea of duplication makes the data collected more reliable and persuasive, which is especially useful in benchmarking since the results can be sensitive to environmental variable changes.&lt;/p&gt;
&lt;p&gt;The presented results justify the statement made in the paper. For example, the measured TCP throughput for two hosts is shown to degrade as time passes:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ieeexplore.ieee.org/document/1208921"&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/ieee_anomaly/TCP_degrade.png#center" alt="TCP-degrade-performance"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;One thing the paper can articulate more is how this seemly periodic pattern is related to the model. Another concern is the number of device used to conduct these experiments. The number of devices used seems to be much smaller than what would be in real-world scenario. It will be interesting to see how the performances are affected with a lot devices competing for a channel. This can be further extended to measuring performances with multiple devices having lower bit rate, which is more likely to capture real-world use cases. The potential performance impact is not clear given the present measurement.&lt;/p&gt;
&lt;p&gt;The paper also claims the useful throughput strongly depends on the number of competing host. More data related to how the number of hosts is related to performance impact will make this paper more interesting. It may be hard to achieve as many papers resort to simulation.&lt;/p&gt;
&lt;p&gt;This paper has made improvements over previous work in that it studies the performance of 802.11 WLANs, with one host having lower bit rate, whereas many other assume that all hosts communicate using the same bit rate. This is a step forward to capture more realistic situations. Overall, the paper does a good job in terms of proving its point. It captures the most critical information and it&amp;rsquo;s easy to follow the concept. However, the neat structure can make readers without sufficient background to spend more time catching up since the background section may not be enough for starters.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Overall, this paper brings novel approach to analyze the performance of 802.11 WLANs with varying bit rate. It brings new insights into studying the 802.11 standard. The paper focuses on TCP and UDP protocols. Applying the method discussed in paper to a lesser known protocol such as DCTCP can yield more insights into the different protocols can affect the throughput. Another direction is to generalize this model to multiple bit rate degrading and study their behaviors.&lt;/p&gt;
&lt;p&gt;The bit rate used in the paper also seems to be pretty low compared to modern standards. With the introduction of 5G network, the bit rate becomes a lot higher, it will be interesting to see how extremely high bit rate can affect the performance of 802.11.&lt;/p&gt;</description></item><item><title>Exokernel</title><link>https://www.bodunhu.com/blog/posts/exokernel/</link><pubDate>Tue, 01 Sep 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/exokernel/</guid><description>&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Exokernel"&gt;Exokernel&lt;/a&gt; is a term every system researcher has heard of at some point in life. However, according to the &lt;a href="https://pdos.csail.mit.edu/"&gt;PDOS&lt;/a&gt; group at MIT, there aren&amp;rsquo;t any exokernel-based operating systems in active use today. It&amp;rsquo;s interesting to discover what ideas exokernels brought to the OS high-level design and some potential drawbacks of such design choice.&lt;/p&gt;
&lt;p&gt;Perhaps the most important thing to keep in mind is that exokernel operating system architecture pushes management of physical resources to the application level, contrary to what most monolithic kernel would do: providing hardware resource management through some form of abstraction, usually hiding hardware-related details.&lt;/p&gt;
&lt;h2 id="limitations-of-traditional-approaches"&gt;Limitations of Traditional Approaches&lt;/h2&gt;
&lt;p&gt;Monolithic kernels usually enforce centralized resource management via a set of abstractions. In microkernel-based system, they are usually provided through some form of trusted user-level servers. There are several drawbacks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Too general. Over generalizing can limit application diversity and have performance implications (domain/application-specific approach usually have performance improvements, in the cost of, well, being more &amp;ldquo;specific&amp;rdquo;.). For example, in UNIX, two applications exhibiting rather different memory access patterns are subject to the general-purpose OS scheduler and page replacement policy. Letting applications define such policies can open doors for performance improvements since applications have better knowledge of their behaviors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Hide information. This is further expanded from the previous point. Applications tend to have better &amp;ldquo;self-awareness&amp;rdquo; and can implement custom policies that outclass the general-purpose ones provided by the kernel.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Limited functionality. Having limited resources in hand can inhibit implementation of new ideas.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, generalization may not be a bad thing. As discussed in the &lt;em&gt;UNIX the Timesharing System&lt;/em&gt; paper, having a generalized and unified yet limited file system API can simplify programming efforts. Accessing both ordinary files and I/O devices is achieved by utilizing a unified interface. Nobody today wants to implement a different set of policies just for character device or block device.&lt;/p&gt;
&lt;h2 id="design"&gt;Design&lt;/h2&gt;
&lt;p&gt;Essentially, exokernel consists of thin veneet that multiplexes and exports physical resources through a set of primitives. The libraries, running in the application space, use them to implement with special-purpose functionalities in a higher abstraction level. The architecture is shown in the Paper:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/exokernel/exokernel_arch.png#center" alt="exokernel"&gt;&lt;/p&gt;
&lt;p&gt;There are three majors tasks to separate protection from management:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tracking ownership of resources.&lt;/li&gt;
&lt;li&gt;Ensure protection by guarding resource usage.&lt;/li&gt;
&lt;li&gt;Revoke access.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The paper presents three techniques to achieve these goals:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;secure binding&lt;/em&gt;: lib OS can securely bind to machine resources.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;visible revocation&lt;/em&gt;: lib OS can participate in a resource revocation protocol. (Keep in mind why the revocation needs to be visible)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;abort protocol&lt;/em&gt;: exokernel itself can break secure binding of uncooperative lib OS.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In general, exokernel should expose hardware resources such as disk memory, CPU, interrupts, through low-level primitives with as few abstractions as possible. The resource management policy should be enforced by the library OS instead. &lt;strong&gt;The policy control boils down to whether the exokernel permits resource allocation&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="secure-binding"&gt;Secure Binding&lt;/h2&gt;
&lt;p&gt;One of the primary tasks of an exokernel is to multiplex resources securely, providing protection for mutually distrustful applications. Secure binding allows the kernel to protect resources without understanding them.&lt;/p&gt;
&lt;p&gt;There are three techniques to implement secure bindings:
hardware mechanisms,software caching, and downloading application code.&lt;/p&gt;
&lt;h2 id="understanding-secure-binding-through-examples"&gt;Understanding Secure Binding through Examples&lt;/h2&gt;
&lt;p&gt;Secure binding is rather abstract and hard-to-comprehend concept without concrete examples. Here are some examples illustrating how secure multiplying is achieved through secure binding.&lt;/p&gt;
&lt;p&gt;Take memory allocation for an example. When a library OS tries to allocate a physical memory page, the exokernel creates a secure binding for that page by recording the owner and the capabilities specified by the library OS. Essentially, accessing memory resources is achieved through capability. The exokernel acts as a door-keeper that checks the validity of the capability from the library OS.&lt;/p&gt;
&lt;p&gt;I personally like to think the role of the exokernel in memory system is to act as a security guard that protects resources that can be access by the library OS through some form of interface. For example, if the hardware defines a page-table interface, which can be accessed by the lib OS, the exokernel must guard the page table. If the lib OS tries to enter a new virtual-to-physical memory mapping, then the exokernel must check the corresponding memory capability.&lt;/p&gt;
&lt;p&gt;In summary, privileged machine operations must be guarded by the exokernel.&lt;/p&gt;
&lt;h3 id="aegis-the-exokernel"&gt;Aegis: the exokernel&lt;/h3&gt;
&lt;p&gt;Up to this point I find it still hard to full understand what exokernel is capable of. Having a concrete system to study for is much more helpful. So here comes Aegis.&lt;/p&gt;
&lt;p&gt;Here is a subset of Aegis&amp;rsquo;s primitives and sys call interfaces that encapsulate these exported primitives. Having a concrete list feels so much better than reading a list of abstract terms!&lt;/p&gt;
&lt;p&gt;Here is a sublist of primitives:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Primitive Operations&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: center"&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TLBwr&lt;/td&gt;
&lt;td style="text-align: center"&gt;Insert mapping into TLB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TLBvadelete&lt;/td&gt;
&lt;td style="text-align: center"&gt;Delete virtual address from TLB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;And here is a sublist of system call interfaces:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;System Call&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: center"&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Yield&lt;/td&gt;
&lt;td style="text-align: center"&gt;Yield processor to named process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alloc&lt;/td&gt;
&lt;td style="text-align: center"&gt;Allocation of resources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scall&lt;/td&gt;
&lt;td style="text-align: center"&gt;Synchronous protected control transfer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="address-translation"&gt;Address Translation&lt;/h3&gt;
&lt;p&gt;It&amp;rsquo;s important to first mention that Aegis provides a small number of guaranteed mappings by partitioning an application&amp;rsquo;s virtual address space into two segments. The first segments hold normal application data; the other one has guaranteed mapping and holds exception code and page-table. (Guaranteed mapping is sort of a safe lock.)&lt;/p&gt;
&lt;p&gt;When a TLB miss happens, there are several steps happening:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Aegis checks which segment the virtual address resides in. If it&amp;rsquo;s in the standard user segment the exception is dispatched to the application. Otherwise, the exokernel handles the exception or forwards it to the application depends on whether there&amp;rsquo;s guaranteed mapping.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The application looks up the address in it page table, inserts TLB entry and creates capability, then invokes Aegis system routine.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Aegis validifies the capability. Upon approval, the mapping is installed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Application resumes execution from kernel mode.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key takeaway here is the exokernel itself is involved in very few privileged operations such as interacting directly with the hardware via low-level primitives. the bulk of the work is done in the application level.&lt;/p&gt;
&lt;p&gt;Because the kernel contains minimal functionalities, it can be extremely fast compared to a monolithic kernel. However, does that mean the overhead is shifted to the library OS instead?&lt;/p&gt;
&lt;h3 id="exos-the-library-os"&gt;ExOS: the Library OS&lt;/h3&gt;
&lt;p&gt;The most prominent feature about library OS is that it manages operating system abstractions at application level.&lt;/p&gt;
&lt;p&gt;The GEMM operation on both ExOS and Ultrix (a monolithic kernel OS) doesn&amp;rsquo;t seem to have much difference since GEMM doesn&amp;rsquo;t use any special abilities of both OSes. It does indicates that the performance gain from the minimal design of exokernel is somewhat cancelled out by the application-space overhead.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The exokernel paper mentions that in the context of networking, the major reason for ExOS to download code is that the network buffers on our machines cannot be easily mapped into application space in a secure way. Downloading the code into the kernel allows applications integrating operations such as checksum during the copy of the message from these buffers to user space. However, I&amp;rsquo;m a little bit skeptical of this statement today. Usually a highly performant TCP stack will be implemented in userspace, along with some polling (DPDK for example). But it will be interesting to compare the exokernel approach to the gigantic Linux TCP stack. The second reason is downloaded code is bounded, thus allowed full context switch to an unscheduled application.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I do find the graph in the exokernel paper interesting. It shows that when application-level message handlers are downloaded into the kernel, the roundtrip latency is almost not affected by the number of processes. Since the operation is performed inside kernel upon message arrival, no handling is needed from the application. This has the advantage that application handler is subject to scheduling, which has performance implications. (The choice of scheduler is the key bottleneck here.)&lt;/p&gt;
&lt;p align="center"&gt;
&lt;a href="https://pdos.csail.mit.edu/6.828/2008/readings/engler95exokernel.pdf"&gt;
&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/exokernel/throughput.png" width="80%"&gt;
&lt;/a&gt;
&lt;/p&gt;
&lt;h2 id="modularity"&gt;Modularity&lt;/h2&gt;
&lt;p&gt;It a natural property of exokernel since the exokernel itself is simplistic. Thus, operating system abstractions can be redefined simply by changing the library OS. Thus, applications have finer-grained control over resources. However, I think it comes at a cost. In a monolithic kernel, applications are subject to general purpose scheduler. Having modular domain-specific schedulers can indeed improve performances, however, it might also leads to multiple scheduler contention, which is not covered in the paper.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Exokernel does offer some new insights into system design. The simple design concept of the exokernel itself has major performance benefits as well as a limited set of primitives which gives much freedom to the application. However, that means the library OS has to take more responsibility. The paper didn&amp;rsquo;t cover enough analytics on more general use cases. The performance gain seems to come from some highly specialized, exokernel-specific implementations of OS abstractions (such as IPC, VM, etc.). The more general case, such as GEMM, seem to be much less performance, when compared to traditional approaches. It will be good to see how exokernel performs under more diverse workloads.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve also heard that one reason microkernels never took off was partially due to the performance slowdown compared to monolithic kernels. Since exokernel shared many similarities with microkernels (seems like exokernel is a more stripped-down version of microkernel since it barely has an OS core), it will likely fall into the same caveat. However, there doesn&amp;rsquo;t seems to have a comprehensive benchmarking trials to compare all major types of kernels.&lt;/p&gt;</description></item><item><title>Sketch on the UNIX Timesharing System</title><link>https://www.bodunhu.com/blog/posts/sketch-on-the-unix-timesharing-system/</link><pubDate>Thu, 27 Aug 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/sketch-on-the-unix-timesharing-system/</guid><description>&lt;p&gt;Unix is general-purpose, multi-user, interactive operating system, it offers
several new features hardly found in other larger operating systems back in
the day. These features include (1) a hierarchical file system incorporating demountable volumes; (2) compatible file, device, and inter-process I/O; (3) the ability to initiate asynchronous processes; (4) system command language selectable on a per-user basis; and (5) over 100 subsystems including a dozen languages.&lt;/p&gt;
&lt;!--description--&gt;
&lt;h2 id="simplicity-at-its-core"&gt;Simplicity at its Core&lt;/h2&gt;
&lt;p&gt;Simplicity was engraved into the gene of Unix since its birth, as the paper states: &amp;ldquo;Perhaps the most important achievement of UNIX is to demonstrate that a powerful operating system for interactive use need not be expensive either in equipment or in human effort&amp;rdquo;. Therefore, it is important to keep in mind how simplicity
is reflected in the design of Unix.&lt;/p&gt;
&lt;h2 id="the-file-system"&gt;The File System&lt;/h2&gt;
&lt;p&gt;Perhaps the singly most important part of Unix. The &amp;ldquo;everything is file&amp;rdquo; concept that influences all modern system designs. Here is a short description of each major file types.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ordinary Files&lt;/strong&gt;: no particular structuring is expected by the system. The structure of files is controlled by the programs which use them, not by the system.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Directories&lt;/strong&gt; provide the mapping between the names of files and the files themselves, inducing a structure on the file system. The only difference between directory and normal file is the the directory can&amp;rsquo;t be written on by unprivileged programs, meaning the contents of directories are controlled by the system.&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;linking&lt;/em&gt; allows the same non-directory file to appear in several directories under possibly different names; a directory entry for a file is sometimes called a link. All links to a file have equal rights. A directory entry for a file consists merely of its name and a pointer to the file metadata. Therefore a file exists independently of any
directory entry. Directory can be considered as link.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Special Files&lt;/strong&gt;: perhaps the most prominent feature of the &amp;ldquo;everything is a file&amp;rdquo; principle. They are read and written just like ordinary disk files, but requests to read and write will result in activation of the I/O device. It blurs the line between file and device I/O since they share identical interfaces and are subject to the same protection mechanism.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="removable-file-system"&gt;Removable File System&lt;/h2&gt;
&lt;p&gt;The Unix file system has a &lt;em&gt;mount&lt;/em&gt; system request which, in effect, replaces a leaf of the hierarchy tree (the ordinary file) by a whole new subtree (the hierarchy stored on the removable volume). It provides a unified abstraction of the file system hierarchy where the underlying storage components become transparent to the user.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One exception to the identical treatment of files on different devices: no link may exist between one file sys hierarchy and another. Otherwise, some form of bookkeeping would be required to when a removable volume is dismounted from one file system but not the other.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="protection"&gt;Protection&lt;/h2&gt;
&lt;p&gt;Each user is assigned a unique user ID. A file, upon its creation, is marked with the user ID of its owner. Also given for new files is a set of seven protection bits. Six of these specify independently read, write, and execute permission for the owner of the file and for all other users. This is a perfect example of ACL (access control list) system.&lt;/p&gt;
&lt;h2 id="io-calls"&gt;I/O Calls&lt;/h2&gt;
&lt;p&gt;Once again, we see how Unix is trying to provide a unified interface such that performing I/O on different devices doesn&amp;rsquo;t would not require different accessing patterns or styles. There is no distinction between &amp;ldquo;random&amp;rdquo; and sequential I/O, nor is any logical record size imposed by the system. Calls like &lt;em&gt;open&lt;/em&gt;, &lt;em&gt;seed&lt;/em&gt;, &lt;em&gt;read&lt;/em&gt;, and &lt;em&gt;write&lt;/em&gt; can be found in all major Unix-like systems today.&lt;/p&gt;
&lt;p&gt;I found it interesting that the authors were arguing why there are no user-visible locks in the file system. The first argument says: &amp;ldquo;they are unnecessary because we are not faced with large, single-file data bases maintained by independent processes&amp;rdquo;. It might be different today on modern systems so I have some doubts on that argument. The next one is &amp;ldquo;they are insufficient because locks in the ordinary sense, whereby one user is prevented from writing on a file which another user is reading, cannot prevent confusion when, for example, both users are editing a file with an editor which makes a copy of the file being edited.&amp;rdquo; This certainly is true because the the copies are separate files with distinct metadata during editing but once the editing is finished then it becomes tricky when the updated content needs to be written back to the original file without some form of synchronization or ordering.&lt;/p&gt;
&lt;p&gt;The paper further explains the the system has sufficient internal interlocks to prevent these situations from happening. The exact details of how it works is not quite clear at this stage.&lt;/p&gt;
&lt;h2 id="implementation"&gt;Implementation&lt;/h2&gt;
&lt;p&gt;As we&amp;rsquo;ve already known, a directory entry contains only a name for the associated file and a pointer to the file itself. This pointer is an integer called the &lt;em&gt;i-number&lt;/em&gt;. When the file is accessed, its i-number is used as an index into a system table (the i-list) stored in a known part of the device on which the directory resides.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Directory entry -&amp;gt; (File Name, i-number) -&amp;gt; i-list -&amp;gt; i-node -&amp;gt; description of the file&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Because the file is described by its corresponding i-node, any copy and deleting operations are circulating around modifying directory entry or i-node link-count field without actually touching the bulk of the file itself.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It important to distinguish between file descriptor and inode. By definition, files are represented by inodes. The inode of a file is a structure kept by the filesystem which holds information about a file, like its type, owner, permissions, inode links count and so on. Other other hand, the file descriptor is the value returned by an open call is termed a file descriptor and is essentially an index into an array of open files kept by the kernel. There is an inode in the i-list but every process can have its own file descriptor for one file.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="processes"&gt;Processes&lt;/h2&gt;
&lt;p&gt;A process is the execution of an image. An image is a computer execution environment. It includes a core image, general register values, status of open files, current directory, and the like. An image is the current state of a pseudo computer. You can imagine the image as a motionless snapshot of current state of the processor, or you can image as the content saved to the main memory when a currently executing process is preemptied by another one.&lt;/p&gt;
&lt;p&gt;The user-core part of an image has three logical segments. The program text segments starting from location 0. At the first 8K byte boundary above the text segment is a non-shared, writable data segment. The highest address in the virtual address space is a stack segment.&lt;/p&gt;
&lt;p&gt;One key feature of UNIX is a new process can come into existence only by ise of the &lt;em&gt;fork&lt;/em&gt; system call. Another system primitive is invoked by &lt;em&gt;execute&lt;/em&gt;. This call resembles a &amp;ldquo;jump&amp;rdquo; machine instruction rather than a sub-routine call.&lt;/p&gt;
&lt;h2 id="shell"&gt;Shell&lt;/h2&gt;
&lt;p&gt;Shell is a command line interpreter. Programs executed by the Shell start off with two open files which have file descriptors 0 and 1, representing files for reading and writing. The symbol &amp;ldquo;&amp;lt;&amp;rdquo; and &amp;ldquo;&amp;gt;&amp;rdquo; represent what files the file descriptor 0 and 1 will refer to for the duration of the command passed to shell.&lt;/p&gt;
&lt;p&gt;A filter, represented by &amp;ldquo;|&amp;rdquo;, is a program that copies its standard input to its standard output (without processing).&lt;/p&gt;
&lt;p&gt;Command separator, represented by &amp;ldquo;;&amp;rdquo;, is used to separate multiple commands. A related feature is &amp;ldquo;&amp;amp;&amp;rdquo;, which execute the command in the background. When the shell doesn&amp;rsquo;t wait for the completion of a command, the identification of the process running that command is printed. In addition, parentheses can be used to enforce order of execution.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It&amp;rsquo;s worth noting the shell is itself a command, and may be called recursively.&lt;br&gt;
Since it&amp;rsquo;s a command, it also shared the luxury of having standard I/O file descriptor. Thus, command such as:&lt;br&gt;
&lt;strong&gt;sh &amp;lt; file_containing_shell_commands&lt;/strong&gt; would work.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The last step in the initialization of UNIX is the creation of a single process and the invocation of a program called &lt;em&gt;init&lt;/em&gt;. &lt;em&gt;init&lt;/em&gt; have various sub-instances prompting for user login information. If the login succeeds, &lt;em&gt;init&lt;/em&gt; performs an &lt;em&gt;execute&lt;/em&gt; of the Shell. Essentially, &lt;em&gt;init&lt;/em&gt; is the parent process of Shell.&lt;/p&gt;</description></item><item><title>Monads in Haskell</title><link>https://www.bodunhu.com/blog/posts/monads-in-haskell/</link><pubDate>Sun, 01 Mar 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/monads-in-haskell/</guid><description>&lt;p&gt;I&amp;rsquo;ve scratched my head for quite a while trying to understand the concept of monad in Haskell. This is a brief summary of monads. I take William Cook&amp;rsquo;s &lt;a href="http://www.cs.utexas.edu/~wcook/anatomy/anatomy.htm"&gt;Anatomy of Programming Languages&lt;/a&gt; as my reference.&lt;/p&gt;
&lt;h2 id="definitions-of-monads"&gt;Definitions of Monads&lt;/h2&gt;
&lt;p&gt;A monad is defined as a computational structure that involves three parts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A generic data type \(m\)&lt;/li&gt;
&lt;li&gt;A &lt;em&gt;return&lt;/em&gt; function \(return_m\) :: \(t\rightarrow mt\)&lt;/li&gt;
&lt;li&gt;A &lt;em&gt;bind&lt;/em&gt; function \(\triangleright_mt\rightarrow (t\rightarrow ms)\rightarrow ms\)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here the symbol \(m\) gives the name of the monad as well as the shape of the computation. We can call the program that uses the monad \(m\) as an m-computation. The instantiation of the generic type \(mt\) at a particular type \(t\) represents n m-computation that produces a value of type \(t\). The \(m\)-computation indicates that in addition to value \(t\), some additional requirements or effects will take place. This is the essence of monads.&lt;/p&gt;
&lt;p&gt;The definition of the &lt;code&gt;return&lt;/code&gt; function states that how values are converted into m-computations. The &lt;code&gt;return&lt;/code&gt; will just return the value of type \(t\). For example, if we pass in a stateful memory information, &lt;code&gt;return&lt;/code&gt; shouldn&amp;rsquo;t modify the actual but only provide a context to which the value lies in. The reason we convert value into m-computation is that if any error occur then &lt;code&gt;return&lt;/code&gt; will catch the error without adding additional error checking codes.&lt;/p&gt;
&lt;p&gt;The bind function \(\triangleright_m\) specifies how computations are combined together. THe general idea is that the computation behavior of \(A\triangleright_m F\) indicates the m-computation \(A\) is performed first, the value it produces wil be passed to the function \(F\) to create a second m-computation. Because \(A\) is a m-computation, if an error happens, the computation will stop and \(F\) will not be performed.&lt;/p&gt;
&lt;h2 id="monads-in-haskell"&gt;Monads in Haskell&lt;/h2&gt;
&lt;p&gt;In Haskell, we can use Monads using type class. A type class is defined as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;class&lt;/span&gt; &lt;span class="kt"&gt;Monad&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="kr"&gt;where&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;::&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;return&lt;/span&gt; &lt;span class="ow"&gt;::&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For a object of generic type \(m\) to be a Monad, it must have those two functions defined. A type class allows us to overload functions according to their type.&lt;/p&gt;
&lt;p&gt;So why do we need Monads in the first place? If we are given a function \(func1\) which takes in an Int value and produces an Int output, we could link the function together to form a chain of computation. If we make a function like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;func1&lt;/span&gt; &lt;span class="ow"&gt;::&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;func1&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="n"&gt;func1&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;we could use the output of the function as the input to the same function to produce another value. This process can be repeated and thus form a chain of operation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;func1&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;func1&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;func1&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;However, the function \(func1\) could potentially return a Nothing if the given input doesn&amp;rsquo;t meet certain standards (exp. divide by 0). Therefore, \(func1\) can modified to:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;func1&lt;/span&gt; &lt;span class="ow"&gt;::&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;Maybe&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The previous definition of \(func1\) says \(func1\) takes a (Int, Int) tuple as one input, but now if we feed the output of \(func1\) directly to the next \(func1\) in the chain, error would occur because \(func1\) takes a raw (Int, Int) tuple as the input, but now we have (Int, Int) wrapped in a Maybe context. The &amp;amp; operator is not able to pass the argument with a context to the next func1. Fortunately, we have the bind operator defined.&lt;/p&gt;
&lt;p&gt;If we look at the definition of the &lt;code&gt;&amp;gt;&amp;gt;=&lt;/code&gt; in Monad definition, we see:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;::&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This means &lt;code&gt;&amp;gt;&amp;gt;=&lt;/code&gt; is able to take a value within certain context and map a function that takes the raw value as input to the it. We can simply switch the &lt;code&gt;&amp;amp;&lt;/code&gt; operator to &lt;code&gt;&amp;gt;&amp;gt;=&lt;/code&gt; such that the chaining would still work:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;func1&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;func1&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;func1&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If an error occurred in one part of the chain (let&amp;rsquo;s assume one computation yields Nothing). Then the Nothing value will be propagated to the next function, which will automatically generate an error, or Nothing. Otherwise we would have written error checking code at the end of each single computation to check their output.&lt;/p&gt;
&lt;p&gt;In short, &lt;code&gt;&amp;gt;&amp;gt;=&lt;/code&gt; is just a way to chain functions with parametric polymorphism together.&lt;/p&gt;
&lt;h2 id="haskell-do-notation"&gt;Haskell &lt;em&gt;do&lt;/em&gt; Notation&lt;/h2&gt;
&lt;p&gt;Using the do notation can simply the use of bind operator. The basic pattern of do notation is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;do&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;e1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;e2&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;which is equivalent to:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-haskell" data-lang="haskell"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;e1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;e2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;&amp;lt;-&lt;/code&gt; notation simply indicates \(x\) is bind to the value the computation generates. In other words, \(x\) doesn&amp;rsquo;t lie in a context. if \(e1\) returns Nothing, \(x\) is not bind to anything. It&amp;rsquo;s important to remember that do expressions are just different syntax for chaining monadic values.&lt;/p&gt;
&lt;p&gt;For a more detailed explanation of Monads, I found &lt;a href="http://learnyouahaskell.com/a-fistful-of-monads"&gt;A Fistful of Monads&lt;/a&gt; to be extremely helpful in terms of clarifying the concept.&lt;/p&gt;</description></item><item><title>Singular Value Decomposition</title><link>https://www.bodunhu.com/blog/posts/singular-value-decomposition/</link><pubDate>Mon, 10 Feb 2020 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/singular-value-decomposition/</guid><description>&lt;p&gt;Unitary matrices and the Singular Value Decomposition (SVD) are two important concepts in linear algebra. In order to fully understand these concepts, we will need to first discuss orthogonality. Most materials are converted in Advanced Linear Algebra: Foundations to Frontiers taught by professor &lt;a href="https://www.cs.utexas.edu/~rvdg/"&gt;Robert van de Geijn&lt;/a&gt;. This is a brief summary over the important concepts covered in Chapter 2.&lt;/p&gt;
&lt;!--description--&gt;
&lt;h2 id="components-in-the-direction-of-a-vector"&gt;Components in the direction of a vector&lt;/h2&gt;
&lt;p&gt;By Pythagorean theorem, we know that \(b = \chi a + c\) where \(a\) is a unit vector orthogonal to \(c\) and \(\chi\) is a scaler. Then we have&lt;/p&gt;
&lt;p&gt;\[a^T (b-\chi a) = 0\]&lt;/p&gt;
&lt;p&gt;Solving it gives us \(\chi = \frac{a^T b}{a^T a}\). We have \(\frac{a^T b}{a^T a}a = \frac{a a^T}{a^T a}b\). And \(\frac{a a^T}{a^T a}\) can map vector \(b\) in the direction of \(a\). The orthogonal component of \(a\) can thus be calculated as \(I-\frac{a a^T}{a^T a}\).&lt;/p&gt;
&lt;p&gt;The linear transformation can be simplified by letting \(\left\lVert a\right\rVert_{2}=1\) because this will render \(a^T a = 1\).&lt;/p&gt;
&lt;h2 id="unitary-matrix"&gt;Unitary Matrix&lt;/h2&gt;
&lt;p&gt;A matrix \(U\) is said to unitary matrix is if \(U\) is a square matrix and satisfies \(U^H U= I\).&lt;/p&gt;
&lt;p&gt;In addition, unitary matrix has some nice properties. First, the product of a sequence of unitary matrix is also unitary matrix. This can be proven by first explore the product of \((U_0 U_1)^H (U_0 U_1)= I\), showing \(U_0 U_1\) is a unitary matrix, and then perform induction.&lt;/p&gt;
&lt;p&gt;Unitary matrix also preserves &lt;strong&gt;length&lt;/strong&gt;. This is done by showing \(\left\lVert Ux \right\rVert 2^2 = (Ux)^H (Ux) = x^H x= \left\lVert x \right\rVert _2^2\).&lt;/p&gt;
&lt;h2 id="change-of-orthonormal-basis"&gt;Change of orthonormal basis&lt;/h2&gt;
&lt;p&gt;We mentioned we can map a vector \(x\) another vector in the same direction as vector \(a\). Now we extend it to express a vector \(x\) using a set of orthonormal basis \(U\).&lt;/p&gt;
&lt;p&gt;We know that \(x = Ix= UU^Tx=U(U^Tx)=u_0^Hxu_0+&amp;hellip;+u_{m-1}^Hxu_{m-1}\). We notice that \(u_0^Hx\) is a scalar so we can write then equation as \(U(U^Tx)=a_0u_0+&amp;hellip;+a_{m-1}u_{m-1}\). We successfully expressed the vector \(x\) based on the orthonormal basis.&lt;/p&gt;
&lt;h2 id="todo"&gt;TODO&lt;/h2&gt;</description></item><item><title>Understanding Probabilistic Clock Synchronization</title><link>https://www.bodunhu.com/blog/posts/understanding-probabilistic-clock-synchronization/</link><pubDate>Tue, 17 Sep 2019 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/understanding-probabilistic-clock-synchronization/</guid><description>&lt;p&gt;This post is meant to discuss the probabilistic clock synchronization technique. The main goal of this technique is to bound the difference between systems by setting up an upper bound. Formally, we define the problem as \(|P(t)-Q(t)|\leq \varepsilon\), or the difference between clocks across the network. We will go over the technical detains and discuss what these symbols represent in later sections. Most of these materials are from Prof. &lt;a href="https://www.cs.utexas.edu/people/faculty-researchers/aloysius-k-mok"&gt;Mok&lt;/a&gt;&amp;rsquo;s slides on dependable systems classes.&lt;/p&gt;
&lt;h2 id="perfect-synchronization"&gt;Perfect Synchronization&lt;/h2&gt;
&lt;p&gt;The motivation behind this technique is that synchronization always involves overheads. In a perfect environment where network delay and request processing time are both 0, the clocks can be synchronized with ease. A slave P will send &amp;lsquo;&amp;lsquo;Time = ?&amp;rsquo;&amp;rsquo; at global time \(t\) to master Q and master Q replies &amp;lsquo;&amp;lsquo;Time = Q(t)&amp;rsquo;&amp;rsquo; instantaneously at global time \(t\). Then P will adjust its clock P(t) according to Q(t). However, such case only exists in imagination.&lt;/p&gt;
&lt;h2 id="amortization"&gt;Amortization&lt;/h2&gt;
&lt;p&gt;Suppose the difference between the clock of P and Q is \(\Delta\) at synchronization, our goal is to adjust P&amp;rsquo;s logical clock C(t) to mitigate the difference. The adjustment is simple:&lt;/p&gt;
&lt;p&gt;\[C(t)=H(t)+A(t)\]&lt;/p&gt;
&lt;p&gt;Here C(t) is P&amp;rsquo;s logical clock, H(t) is P&amp;rsquo;s hardware clock, and A(t) is the adjustment function(can also be A(H(t))).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/clock.png?token=ACKPLVNGE4DFY4GQF55PU7C5QFOGW#center" alt="probabilistic-sync"&gt;&lt;/p&gt;
&lt;p&gt;A naive method will be simply subtract or add
\(\Delta\) to C(t) to mitigate the difference. However, it will create a discontinuity in P&amp;rsquo;s clock, which may disrupt systems services. For example, if
\(\Delta = 2\) seconds,the logical clock will instantly jump ahead 2 seconds and a stopwatch will skip one second.&lt;/p&gt;
&lt;p&gt;So the adjustment function is as follows:&lt;/p&gt;
&lt;p&gt;\[A(t)=m\cdot H(t)+N\]&lt;/p&gt;
&lt;p&gt;Now the logical clock can be derived as follows:&lt;/p&gt;
&lt;p&gt;\[C(t)=(1+m)\cdot H(t)+N\]&lt;/p&gt;
&lt;p&gt;This process is called amortization.&lt;/p&gt;
&lt;p&gt;However, how do we know the value for m and N? Let&amp;rsquo;s take a look at the time when amortization process starts, the logical time of P at this moment is:&lt;/p&gt;
&lt;p&gt;\[L=(1+m)\cdot H+N \qquad (1)\]&lt;/p&gt;
&lt;p&gt;At the end of the amortization (lasts for time period \(\alpha\)) we have reached \(M=H+\alpha\). Here M is the master logical clock sent by master Q. So at the end of the amortization, the slave P should be able to catch up with its master&amp;rsquo;s logical clock after \(\alpha\) period of time. Therefore, we have:&lt;/p&gt;
&lt;p&gt;\[M+\alpha = (1+m)(H+\alpha)+N \qquad (2)\]&lt;/p&gt;
&lt;p&gt;Solving (1) and (2) together, we now get:&lt;/p&gt;
&lt;p&gt;\[m = \frac{M-L}{\alpha}\]&lt;/p&gt;
&lt;p&gt;\[N = L - (1+m)H\]&lt;/p&gt;
&lt;p&gt;Thus, at the end of amortization at time \(t\) where \(t &amp;gt; H+\alpha\), we would want the following to be true:&lt;/p&gt;
&lt;p&gt;\[C(t)=C(H+\alpha)+(H(t)-H(H+\alpha))=H(t)+M-H\]&lt;/p&gt;
&lt;p&gt;Here is a question, why is N required in this case. Couldn&amp;rsquo;t we simply use m to amortize the time difference? Here&amp;rsquo;s my interpretation(feel free to pin me if you have something else in mind): if N is set to be 0, then at the beginning of amortization, we would have:&lt;/p&gt;
&lt;p&gt;\[L=(1+m)H\]&lt;/p&gt;
&lt;p&gt;Therefore,
\(m = \frac{L-H}{H}\)
. Now, m is settled by L and H. Compared to \(m=\frac{M-L}{\alpha}\)
, we can see that now m is a constant and not determined by the value of \(\alpha\). We lost control of the amortization rate \(m\), which is not desirable.&lt;/p&gt;
&lt;h2 id="general-case"&gt;General Case&lt;/h2&gt;
&lt;p&gt;We now return to the general case where network delay and processing time are both present. The situation is represented below:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/general_case.png#center" alt="probabilistic-sync-general-case"&gt;&lt;/p&gt;
&lt;p&gt;Looking at this graph, we can see slave P takes 2d real time to for a round-trip. Let&amp;rsquo;s also assume that 2D is the round-trip delay measured by P&amp;rsquo;s clock between sending and receiving. Then we can bound the clock time 2D based on the drift rate \(\rho\) of the clock:&lt;/p&gt;
&lt;p&gt;\[2d(1-\rho)\leq 2D \leq2d(1+\rho)\]&lt;/p&gt;
&lt;p&gt;Ignoring higher order terms of \(\rho\), we now have \(2d\leq(1+\rho)2D\).&lt;/p&gt;
&lt;p&gt;When looking at the graph above, one thing to notice is we are not sure of the time \(\alpha\) and \(\beta\). However, if we are going to pick one, \(\beta\) will be more important than \(\alpha\). This is because if we know the value of \(\beta\), then we know the lower bound of the round-trip delay. Here we assume min is the minimum amount of time required for network transfer, \(\beta\) will be the time master Q spends between processing the request and responds the result back to P.&lt;/p&gt;
&lt;p&gt;Now we&amp;rsquo;ve narrowed down our focus to \(min+\beta\). The time interval between \(Q(t)=T\) and the arrival of &amp;lsquo;&amp;lsquo;Time=T&amp;rsquo;&amp;rsquo; at P will be at least \(min(1-\rho)\). This is based on \(\beta=0\) and clock drift rate.&lt;/p&gt;
&lt;p&gt;The upper bound of the interval will be \((min+\beta)(1+\rho)\), assuming no time is wasted for \(Q\) to wait until it starts processing the request from P. The time required will be \(min+\beta\) and we need to take Q&amp;rsquo;s drift rate \(\rho\) into account. We can also see that the total round-trip real time is \(2d=2min+\alpha+\beta\). Thus we get:&lt;/p&gt;
&lt;p&gt;\[\beta=2d-2min-\alpha \leq 2d-2min\]&lt;/p&gt;
&lt;p&gt;With this equation, we can see that the upper bound measured from Q(t)=T is also bounded. Thus, we have:&lt;/p&gt;
&lt;p&gt;\[
\begin{eqnarray}
(min+\beta)(1+\rho) &amp;amp;\leq&amp;amp; (min+2d-2min)(1+\rho) \nonumber \newline
&amp;amp;=&amp;amp; (2d-min)(1+\rho) \nonumber \newline
&amp;amp;=&amp;amp;(1+\rho)2d-min(1+\rho) \nonumber \newline
&amp;amp;\leq&amp;amp;(1+\rho)2D(1+\rho)-min(1+\rho) \nonumber \newline
&amp;amp;=&amp;amp;(1+2\rho +\rho^2)2D-min(1+\rho) \nonumber \newline
&amp;amp;\approx&amp;amp;(1+2\rho )2D-min(1+\rho) \nonumber
\end{eqnarray}
\]&lt;/p&gt;
&lt;p&gt;Now we can see that master Q&amp;rsquo;s clock time when P receives the response is bounded in the interval \([T+min(1-\rho), T+2D(1+2\rho )-min(1+\rho)]\). The take away here is that we can&amp;rsquo;t use real time t in a distributed system because it&amp;rsquo;s merely an abstract concept since all systems in a network essentially rely on their own clock time. We need to find the relationship between T and master&amp;rsquo;s clock cycle because P will rely on T, not real time \(t\).&lt;/p&gt;</description></item><item><title>How to Put Papers on ArXiv</title><link>https://www.bodunhu.com/blog/posts/how-to-put-papers-on-arxiv/</link><pubDate>Tue, 25 Jun 2019 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/how-to-put-papers-on-arxiv/</guid><description>&lt;p&gt;Recently, I was trying to put my research paper draft on ArXiv. I thought it would be as simple as submitting the pdf file, which should take approximately less than ten minutes. I was wrong. It took several hours to figure what was going on. I included some tips here to prevent mistakes I made from happening again.&lt;/p&gt;
&lt;p&gt;The first mistake I made was assuming a single submission of pdf file would be sufficient. ArXiv apparently has mechanisms detecting whether the submitted pdf file is generated using Tex/Latex. According to ArXiv:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;a PDF file created from a TeX/LaTeX file will be rejected. There are &lt;a href="https://arxiv.org/help/faq/whytex"&gt;good reasons&lt;/a&gt; why arXiv insists on TeX/LaTeX source if it is available. arXiv produces PDF automatically from all TeX submitted source. For information on viewing the PDF provided by arXiv, see our PDF browsing help.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So, the first thing I came up with was to somehow make the pdf appearing &amp;ldquo;anonymous&amp;rdquo; to ArXiv. The were several methods but none of them appear to be practical. If you are interested there is a &lt;a href="https://tex.stackexchange.com/questions/95080/making-an-anonymous-pdf-file-using-pdflatex/95109"&gt;link&lt;/a&gt; to some methods that might be useful. &lt;a href="https://ctan.org/pkg/pdfprivacy"&gt;pdfprivacy&lt;/a&gt; is package used to remove or suppress pdf meta-data and it sounds promising but I haven&amp;rsquo;t tried yet.&lt;/p&gt;
&lt;p&gt;So the only option left was to follow the restriction described. It was confusing at the beginning because the everything worked like a charm on Overleaf but it completely fell apart when I tried to compile the sources locally. I was under the impression that if it worked on Overleaf i should work everywhere else, which cause many hours of searching for potential problems related to local environment.&lt;/p&gt;
&lt;p&gt;After hours of frustration, it started to appear that there was nothing wrong with my local environment. The pdf produced by Overleaf was only &amp;ldquo;appearing&amp;rdquo; correct. There were several syntax issue in my .bib file, mostly caused by careless copy-and-paste and duplicate records. Overleaf simple suppressed some of those errors, which leads me to think everything was fine.&lt;/p&gt;
&lt;p&gt;There were also error messages popping up during compilation. Most of them are related to undefined references. Something like:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Warning--empty journal in article&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The problem was that bibliographic information obtained from Google Scholar night include serious mistakes. The warning message was telling that entries of type @article require a non-empty journal field. For example, the entry could look like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;@article{article,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; title={Something Cool},
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; author={Somebody},
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; year={2019},
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; publisher={IET}
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The four required fields for entries of type @article are author, title, journal, and year. This is why the warning message showed up. But it doesn&amp;rsquo;t really affect the compilation on ArXiv.&lt;/p&gt;
&lt;p&gt;When I finally compiled all sources locally with success, I immediately moved all source on ArXiv hoping it would finally work. It didn&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;! LaTeX Error: File 'shoc.pdf' not found.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;I had no idea why this occurred. All sources I used to compile were uploaded to ArXiv so there were no reasons for it to fail. More surprisingly, the references only failed for my .eps files but not .png files. According to ArXiv there are several &lt;a href="https://arxiv.org/help/faq/psbad"&gt;reasons&lt;/a&gt; why PostScript (PS/EPS) figures might fail on ArXiv. Due to the error message, it appears the system is trying to find a file called &lt;em&gt;shoc.pdf&lt;/em&gt; to insert into the main pdf but somehow couldn&amp;rsquo;t locate the file.&lt;/p&gt;
&lt;p&gt;The solution was to upload the pdf files produced locally to ArXiv. However, the locally generated files have slightly different name. All files names are modified to &amp;ldquo;name-eps-convert-to.pdf&amp;rdquo;. What a hassle!&lt;/p&gt;
&lt;p&gt;Overall, uploading to ArXiv was not the most pleasant experience. Latex&amp;rsquo;s compilation system is the one to blame.&lt;/p&gt;</description></item><item><title>A Little Review on Barrelfish Memory Managements</title><link>https://www.bodunhu.com/blog/posts/a-little-review-on-barrelfish-memory-managements/</link><pubDate>Mon, 18 Feb 2019 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/a-little-review-on-barrelfish-memory-managements/</guid><description>&lt;p&gt;The memory management has been mentioned numerous times and still remains huge topic. virtual vs. physical memory, physical frame allocation, MMUs, page faults, address space layout, and demand paging and swapping are familiar terms for every undergrad in college. In monolithic kernels such as Linux, much of the functionality is handled in kernel. However, there are OSes, such as Barrelfish, that takes a different approach by pushing these functionalities to user space. Many concept here will thus be borrowed from the &lt;a href="http://www.barrelfish.org/"&gt;Barrelfish OS&lt;/a&gt;. I also borrow some materials from the main pdf from Barrelfish course materials provided by Professor &lt;a href="https://www.cs.utexas.edu/~simon/"&gt;Simon Peter&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="memory-management-in-general"&gt;Memory Management in General&lt;/h2&gt;
&lt;p&gt;Microkernels like &lt;a href="https://en.wikipedia.org/wiki/L4_microkernel_family"&gt;L4&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Mach_(kernel)"&gt;Mach&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/ChorusOS"&gt;Chorus&lt;/a&gt;, and &lt;a href="https://en.wikipedia.org/wiki/Spring_(operating_system)"&gt;Spring&lt;/a&gt;, trapped page faults in the kernel but then reflected them up to other processes which carried out the
actual page fault handling. This was done on a per-region basis, so each area of
virtual memory was associated with some paging server. Memory objects could be shared between different processes and mapped differently in different address spaces.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/Barrelfish/os.png#center" alt="operating-system-kernel"&gt;&lt;/p&gt;
&lt;p&gt;Such abstraction means that what happens when a page fault happens is entirely dependent on the code in the user-level pager. This design is highly extensible since it&amp;rsquo;s all user code and thus isolated, which means that if a user-level pager crashes, there&amp;rsquo;s a good chance the rest of the OS can continue quite happily since much of the functionality is moved away from the kernel.&lt;/p&gt;
&lt;p&gt;However, moving functionality out of the kernel an important question: if user-space processes can manipulate virtual address spaces, how can
we make sure that one user&amp;rsquo;s program can&amp;rsquo;t manipulate another address space and memory? Here we will introduce the concept of capabilities.&lt;/p&gt;
&lt;h2 id="capabilities"&gt;Capabilities&lt;/h2&gt;
&lt;p&gt;Capabilities are introduced to solve the access control problem in operating systems. Access control is the problem of specifying, and enforcing, which subjects (or principals) can perform particular actions on particular objects in an operating system.&lt;/p&gt;
&lt;p&gt;The Barrelfish documentation does a good job illustrating capabilities: abstractly, access control can be thought of as a matrix, which represents all possible combinations of operations in the system. Each row of the matrix represents a
different subject, and each column represents a different object. Each entry in the
matrix contains a list of permissible actions.&lt;/p&gt;
&lt;p&gt;Thus, we have two targets to emphasis: the subject and the object. The ACL(access control list) focuses on the object being operated on.&lt;/p&gt;
&lt;p&gt;A good example will be whenever you enter &lt;em&gt;ls -a&lt;/em&gt; in a Linux terminal, you will get list of entries specifies the attributes of a file. Here the attributes represent how a object (in this case, a file) may be accessed.&lt;/p&gt;
&lt;p&gt;On the other hand, a capability can be thought of as a &amp;ldquo;key&amp;rdquo; or &amp;ldquo;license&amp;rdquo;. It is an unforgettable token which grants authority. Possession of a capability for an object gives the holder the right to perform certain operations on the object.&lt;/p&gt;
&lt;p&gt;A good example will be the file descriptor in Linux. A file is accessed through its file descriptor. Here the file descriptor serves as the &amp;ldquo;key&amp;rdquo; to gain access to the file itself. Capabilities provide fine-grained access control: it is easy to provide access to specific subjects, and it is easy to delegate permissions to others in a controlled manner.&lt;/p&gt;
&lt;p&gt;Note that to be correct, any capability representation must protect capabilities
against forgery. Capabilities can be implemented in various ways such as tagged capabilities, sparse capabilities, or partitioned capabilities. In Barrelfish we used the partitioned capabilities.&lt;/p&gt;
&lt;p&gt;In partitioned capabilities, the kernel ensures that memory used to store capabilities is always separated from that used by user processes to store data and code, for example by using the MMU or ensuring that capability memory is only accessible in kernel mode. The OS maintains the list of capabilities each user principal holds (the clist), and explicitly validates access when performing any privileged operation. Thus, whenever the user accesses memory, the operation can only be done through the resources&amp;rsquo; corresponding capability. For example, one can map a page frame in the page table page through functions calls with only capabilities.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;Caprefa&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;install&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Caprefb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="capabilities-in-barrelfish"&gt;Capabilities in Barrelfish&lt;/h2&gt;
&lt;p&gt;According to Barrelfish documentation, all memory in Barrelfish (and some other system resources which do not occupy memory) is described using capabilities. Capabilities are typed, and capabilities can be retyped by users holding them according to certain rules and restrictions. The official documentation has very good explanation on the capability management in Barrelfish. Here is the permissible types for the retype invocation capability retyping:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/Barrelfish/cap.png#center" alt="barrelfish-os-capability"&gt;&lt;/p&gt;
&lt;p align="center"&gt;
&lt;a href="http://www.barrelfish.org/publications/TN-013-CapabilityManagement.pdf"&gt;Image source&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Capabilities referring to memory regions. Capabilities can also be split, resulting 2 new capabilities of the same type, one for each half of the region. Some of the more important capability types in Barrelfish are shown in figure below. The picture is from the Barrelfish manual provided in CS378 Multi-core class by Simon Peter:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/Barrelfish/cap_aos.png#center" alt="barrelfish-os-capability-aos"&gt;&lt;/p&gt;
&lt;p&gt;Allocation and management of physical memory is achieved by retyping and splitting operations on capabilities. For most kernels, the implementation is to constantly allocate and deallocate memory for a wide variety of purposes, much as any large C program relies heavily on &lt;code&gt;malloc&lt;/code&gt; and &lt;code&gt;free&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The problem is what the kernel should do when this runs out. The current solution in Linux is little more than &amp;ldquo;kill a random process and reclaim its memory&amp;rdquo;, which can be a problem for system stability. In Barrelfish, all kernel objects are actually allocated by user programs. If a user process wants to create another process (or dispatcher in Barrelfish parlance), it has to get a capability to a DRAM area of the right size, retype this capability to type Dispatcher, and hand this to the kernel. This will be covered in later posts. To access different types of memory resources, the corresponding capability has to be retyped to the right type.&lt;/p&gt;
&lt;h2 id="more-on-implementation"&gt;More On Implementation&lt;/h2&gt;
&lt;p&gt;In Barrelfish, every capability resides in a slot in a CNode, so a pair (CNode, slot) would identify a capability. It is important to point out that the CNode is another capability itself. Each process in Barrelfish has a CSpace which is structured as a two-level table. So there are actually two different CNode capability types - one for the first level of the table, and one for the second. Every process has, within its &amp;ldquo;dispatcher control block&amp;rdquo;, a pointer to the top-level or root CNode which the kernel can traverse.&lt;/p&gt;
&lt;p&gt;A capability reference in Barrelfish is very similar to VA: the first few bits can represent an index into the first level L1CNode, while the next few bits refer to a slot in a CNode referred to by the capability in the L1CNode slot. Here is a picture from the main pdf showing how the the CSpace is represented in Barrelfish:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/Barrelfish/Cspace.png#center" alt="barrelfish-os-cspace"&gt;&lt;/p&gt;
&lt;h2 id="thoughts-on-design-decisions"&gt;Thoughts on Design Decisions&lt;/h2&gt;
&lt;p&gt;Even though it is pretty straight forward to understand the CSpace structure, the actual implementation is a lot more complicated than that. Since the CSpace is not directly accessible by user space program, there are additional data structures used to keep track of available memory resources.&lt;/p&gt;
&lt;p&gt;In our implementation, the user process keeps a doubly linked list of &lt;code&gt;struct mmnode&lt;/code&gt; to indicate the memory available for allocation. Each element in the free list tracks the information corresponding to one capability. However, there is a big problem with this seemingly simple implementation. Every time we allocate a practical memory space from the memory region, a new capability is created while the old capability still remain in the physical memory pointing to a memory range before the allocation happens. Therefore, the old capability would cover extra memory spaces that are already allocated and managed by other capabilities.&lt;/p&gt;
&lt;p&gt;To solve this problem, we maintain the allocation information in the &lt;code&gt;struct mmnode&lt;/code&gt; each time an allocation occurs. If a capability covering physical address space from 0 to 100 is requested for 20 units of memory space, then the memory available for the next allocation would be from 20 to 100 even though the capability itself still manages 0 to 100. By restricting subsequent accesses only to the new memory range, the old capability can still be kept around and used later for retyping.&lt;/p&gt;
&lt;p&gt;Another Problem emerges when we try to free a memory. Since everything is managed by capabilities, freeing a piece of memory also involves managing the capability responsible for the memory. So an intuitive thought could be whenever a memory space is freed, the corresponding capability is merged back to a piece of memory adjacent to it, managed by a different capability.&lt;/p&gt;
&lt;p&gt;However, since capabilities can not be merged, an alternative choice would be to simple destroy it during free. However, this is even a bigger problem in Barrelfish.&lt;/p&gt;
&lt;p&gt;Imagine the scenario where capability A is partially allocate from memory space 30 to 100. Later on another memory is freed and that piece of memory is managed by capability with base 100 and size 20, so the memory range covers 100 to 120, which indicates these the two capability could be &amp;ldquo;merged&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;In this case, if the first capability is destroyed, all children of the first capability will also be destroyed, thus the already allocated memory from 0 to 20 will be thrown away, which is not desired. If the second capability is destroyed, the first one will also be destroyed to create a new capability covering 20 to 120, which will still results in the destruction of capability A.&lt;/p&gt;
&lt;p&gt;Our assumption here is that the parent or root capability is never destroyed when added to the free list. Whenever a capability needs to be freed, the memory manager is responsible to make sure the capability is only merged with another capability from the same parent capability.&lt;/p&gt;
&lt;p&gt;This is done by creating another list of nodes that tracks all parent capabilities. It is only added when the memory manager adds new capabilities to the free list. After the user initializes free, the memory manager actually creates a new free &lt;code&gt;struct mmnode&lt;/code&gt; first, then it find the node&amp;rsquo;s parent node, copying the parent&amp;rsquo;s capability and attributes to the newly created node with updated offset to indicate that the memory hasn&amp;rsquo;t been freed yet.&lt;/p&gt;
&lt;p&gt;After that, the memory manager insert the node into the free list. If the memory manager finds out that there are capabilities adjacent to the just-added node, then we simply need to update the attributes of the corresponding mmnode to indicate that merging succeeds. The old mmnode is simply thrown away.&lt;/p&gt;
&lt;p&gt;The advantage of this implementation is that root or parent capabilities are kept around and the next retype will be fairly simple. The implementation is also very straightforward.&lt;/p&gt;
&lt;p&gt;There is of course more efficient solution than a linked list. For example, Linux uses both linked list and red-black tree to store thread information. The redundant data structures can be used in different scenarios when appropriate. However, we only use this simplified version to prove our concepts. Optimizations vary but the general concept still works pretty well.&lt;/p&gt;</description></item><item><title>Pascal GPU memory and cache hierarchy</title><link>https://www.bodunhu.com/blog/posts/pascal-gpu-memory-and-cache-hierarchy/</link><pubDate>Tue, 15 Jan 2019 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/pascal-gpu-memory-and-cache-hierarchy/</guid><description>&lt;p&gt;Memory access efficiency allows fully utilizing the computational power of graphics processing units (GPUs). However, many GPU vendors like NVIDIA kept the GPU memory hierarchy as a secret. Therefore it becomes hard to measure GPUs performance and sets barriers to understand memory access patterns, which is a key component to improve program&amp;rsquo;s performance.&lt;/p&gt;
&lt;p&gt;We introduce a novel fine-grained micro-benchmark approach and apply to the Pascal generation. Turing architecture might have different results, but the method we used here can be applied as well with slight modification. The method we use in this guide is inspired by the research paper: &lt;a href="https://ieeexplore.ieee.org/document/7445236"&gt;Dissecting GPU Memory Hierarchy through Microbenchmarking&lt;/a&gt;. Here we will explain how P-Chase works and walk through a small example.&lt;/p&gt;
&lt;h2 id="memory-hierarchy-overview"&gt;Memory Hierarchy Overview&lt;/h2&gt;
&lt;p&gt;GPU memory hierarchy is different compared to CPU memory hierarchy. Using the terminologies of CUDA, GPU memory space can be categorized in these groups: register, constant memory, shared memory, texture memory, local memory, and global memory. Each different memory space have its own properties. Since we are interested the cache systems, here is a picture demonstrating the memory hierarchy of a NVIDIA GPU:&lt;/p&gt;
&lt;p align="center"&gt;
&lt;img src="https://gistbok.ucgis.org/sites/default/files/1000px-Memory.svg_.png" width="450"&gt;
&lt;/p&gt;
&lt;p align="center"&gt;
&lt;a href="https://gistbok.ucgis.org/bok-topics/graphics-processing-units-gpus"&gt;Image source&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The characteristics of each memory space can be found in &lt;a href="https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf"&gt;NVIDIA CUDA C Programming Guide
&lt;/a&gt;. Here we will focus on some target memory space we are interested in. The paper lists some properties of our target memory space:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Memory&lt;/th&gt;
&lt;th style="text-align: center"&gt;Type&lt;/th&gt;
&lt;th style="text-align: right"&gt;Cached&lt;/th&gt;
&lt;th style="text-align: right"&gt;Scope&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Global&lt;/td&gt;
&lt;td style="text-align: center"&gt;R/W&lt;/td&gt;
&lt;td style="text-align: right"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: right"&gt;All Threads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shared&lt;/td&gt;
&lt;td style="text-align: center"&gt;R/W&lt;/td&gt;
&lt;td style="text-align: right"&gt;N/A&lt;/td&gt;
&lt;td style="text-align: right"&gt;Thread Blocks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Texture&lt;/td&gt;
&lt;td style="text-align: center"&gt;R&lt;/td&gt;
&lt;td style="text-align: right"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: right"&gt;All Threads&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Even though the paper targets Fermi, Kepler and Maxwell generations of GPU, the properties of the table still holds for Pascal GPU and possibly Turing as well. The cached global/texture memory uses a two-level caching system. The L1 cache is located in each stream multiprocessor (SM), while the L2 cache is off-chip and shared among all SMs. It is unified for instruction, data and page table access. According to CUDA documentation, like Maxwell, Pascal combines the functionality of the L1 and texture caches into a unified L1/Texture cache which acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler. Page table is used by GPU to map virtual addresses to physical addresses, and is usually stored in the global memory. The page table is cached in TLB to reduce memory access latency. Once a thread cannot ﬁnd the page entry in the TLB, it would access the global memory to search in the page table, which introduced significant memory access latency. The GPU-specific shared memory is located in the SMs. On the Fermi and Kepler devices, it shares memory space with the L1 data cache. On Maxwell and Pascal devices, it has a dedicated space, since the functionality of the L1 and texture caches have been merged. One thing to note here is that shared memory is accessed by the thread blocks. Thread-blocks remain limited to 48 KB of shared memory in Pascal. Therefore, NVIDIA recommends that applications use at most 32 KB of shared memory in any one thread block. This would, for example, allow at least two thread blocks to fit per GP100 SM, or 3 thread blocks per GP104 SM.&lt;/p&gt;
&lt;p&gt;However, we should be careful that by default, GP100 caches global loads in the L1/Texture cache. In contrast, GP104 follows Kepler and Maxwell in caching global loads in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler. As with previous architectures, GP104 allows the developer to opt-in to caching all global loads in the unified L1/Texture cache by passing the -Xptxas -dlcm=ca flag to &lt;code&gt;nvcc&lt;/code&gt; at compile time. Even though both GP100 and GP104 belongs to Pascal family, we only focus on GP100 here because that&amp;rsquo;s the GPU we use. Another thing to notice is that unlike Maxwell but similar to Kepler, Pascal caches thread-local memory in the L1 cache. This can mitigate the cost of register spills compared to Maxwell. To illustrate our point, we checked both &lt;code&gt;cudaDevAttrGlobalL1CacheSupported&lt;/code&gt; and &lt;code&gt;cudaDevAttrLocalL1CacheSupported&lt;/code&gt; on Tesla P100 and GTX 1080 and find both attributes to be 1.&lt;/p&gt;
&lt;p&gt;In addition to the L2 data cache, global memory data that is read-only for the entire lifetime of a kernel can be cached in the read-only data cache with a compute capability of 3.5 or above. We will also explore the size of this read-only cache using __ldg() intrinsic.&lt;/p&gt;
&lt;h2 id="p-chase"&gt;P-Chase&lt;/h2&gt;
&lt;p&gt;Most existing GPU microbenchmark studies on cache architecture assume a classical set-associative cache model with the least recently used (LRU) replacement policy, the same as the conventional CPU cache. So here we will use this assumption and proceed with our experiments. Here are some notations we will use throughout this post.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Notation&lt;/th&gt;
&lt;th style="text-align: center"&gt;Description&lt;/th&gt;
&lt;th style="text-align: right"&gt;Notation&lt;/th&gt;
&lt;th style="text-align: right"&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td style="text-align: center"&gt;Cache Size&lt;/td&gt;
&lt;td style="text-align: right"&gt;N&lt;/td&gt;
&lt;td style="text-align: right"&gt;array size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;b&lt;/td&gt;
&lt;td style="text-align: center"&gt;cache line size&lt;/td&gt;
&lt;td style="text-align: right"&gt;s&lt;/td&gt;
&lt;td style="text-align: right"&gt;stride size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;a&lt;/td&gt;
&lt;td style="text-align: center"&gt;cache associativity&lt;/td&gt;
&lt;td style="text-align: right"&gt;k&lt;/td&gt;
&lt;td style="text-align: right"&gt;iterations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;T&lt;/td&gt;
&lt;td style="text-align: center"&gt;number of cache set&lt;/td&gt;
&lt;td style="text-align: right"&gt;r&lt;/td&gt;
&lt;td style="text-align: right"&gt;cache miss rate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Under our assumptions, data is loaded from main memory to lower cache in the basic unit of a cache line. The number of words in a cache line is referred to as the line size (b). For the LRU set-associative cache, the cache memory is divided into T cache sets, each of which consists of \(a\) cache lines. It is essential to have these three assumptions using this kind of cache model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Assumption 1&lt;/strong&gt; All cache sets have the same size. The cache parameter should satisfy \(T \cdot a \cdot b = C\).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Assumption 2&lt;/strong&gt; In the memory address, the bits representing the cache set are immediately followed by the bits representing the offset.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Assumption 3&lt;/strong&gt; Cache replacement policy should be LRU.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We will later see why these assumptions are essential as we proceed with the experiment. We won&amp;rsquo;t go through how P-Chase work exactly. To find more information, this &lt;a href="https://arxiv.org/pdf/1509.02308.pdf"&gt;paper&lt;/a&gt; does a good job illustrating how P-Chase work. The takeaway is, we need to brute force an array with one element more than a cache can hold so that cache miss will start to occur periodically whereas such array with less or equal elements to the cache capacity will always result in cache hit and thus no access overhead will be introduced after all data is loaded into the cache. This is the algorithm the paper proposed and we will use it to do the experiment:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;__global__&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;KernelFunction&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;//declare shared memory space
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;__shared__&lt;/span&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;tvalue&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;__shared__&lt;/span&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;preheat&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// implementation varies
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;iter&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;clock&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;my_array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;//store the array index
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// This following line is essential because due to
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// instruction-level parallelism (ILP), function clock() may
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// overlap with its previous instruction and even return before
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// the previous instruction finishes. For example,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// end_time=clock() can return before j = my_array[j] returns.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// adding s_index [it]= j since it have data dependency on the
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// previous line. Thus the memory access will be over before
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// end_time=clock() started.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;s_index&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;end_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;clock&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;//store the access latency
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;s_tvalue&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end_time&lt;/span&gt;&lt;span class="err"&gt;−&lt;/span&gt;&lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The steps is the same as the paper proposes, so here we show the paper&amp;rsquo;s method:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Determine cache size C . We set s to 1. We then initialize N with a small value and increase it gradually until the ﬁrst cache miss appears. C equals the maximum N where all memory accesses are cache hits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Determine cache line size b. We set s to 1. We begin with N = C + 1 and increase N gradually again. When N &amp;lt; C + b + 1, the numbers of cache misses are close. When N is increased to C + b + 1, there is a sudden increase on the number of cache misses, despite that we only increase N by 1. Accordingly we can ﬁnd b. Based on the memory access patterns, we can also have a general idea on the cache replacement policy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Determine number of cache sets T . We set s to b. We then start with N = C and increase N at the granularity of b. Every increment causes cache misses of a new cache set. When N &amp;gt; C + (T − 1)b, all cache sets are missed. We can then deduce T from cache miss patterns accordingly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Determine cache replacement policy. As mentioned before, if the cache replacement policy is LRU, then the memory access process should be periodic and all the cache ways in the cache set are missed. If memory access process is aperiodic, then the replacement policy cannot be LRU. Under this circumstance, we set N = C + b, s = b with a considerable large k (k &amp;raquo; N/s) so that we can traverse the array multiple times. All cache misses are from one cache set. Every cache miss is caused by its former cache replacement because we overflow the cache by only one cache line. We have the accessed data indices thus we can reproduce the full memory access process and ﬁnd how the cache lines are updated.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="texture-l1-cache-and-read-only-data-cache"&gt;Texture L1 Cache and Read-only Data Cache&lt;/h2&gt;
&lt;p&gt;When use the &lt;a href="http://www.comp.hkbu.edu.hk/~chxw/Code/fine_grain_Maxwell_texture_L1.cu"&gt;code&lt;/a&gt; with increased our own data preheat implementation because the texture L1 cache can potentially be greater than the shared memory. The original code uses the first iteration of the loop in the algorithm as a way to preheat data:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6144&lt;/span&gt; &lt;span class="c1"&gt;// texture L1 may hold more elements,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// So the first iteration may not cold
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// hit all elements, some cold hits can
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// be moved to the second iteration,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;// causing confusion
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;clock&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;tex1Dfetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tex_ref&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;s_value&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;clock&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;s_tvalue&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;However, if texture L1 cache is greater than the shared memory allowed for each thread block, then some reads in the second loop will trigger cache misses. But such misses are in fact cold misses, not misses caused after the texture L1 cache is completely filled up. One solution is increase iteration to a much larger number so that the first iteration will always fill up the texture L1 cache. Note that if you move the data
preheat out such as&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;tex1Dfetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tex_ref&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The compiler can optimize this whole step out and thus nothing actually gets executed.&lt;/p&gt;
&lt;p&gt;After we run the modified code, the result shows that the we the cache missed starts when we set our array size to 6145, indicating the texture L1 cache can hold 6144 ints, which is equivalent to 24 kb. We also notice that each miss is followed by 7 consecutive hits. This means the cache line size is 8 words(b = 32 bytes). The structure of the L1 TLB is shown below, notice there are 192 lines in each set:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Set1&lt;/th&gt;
&lt;th style="text-align: center"&gt;Set2&lt;/th&gt;
&lt;th style="text-align: right"&gt;Set3&lt;/th&gt;
&lt;th style="text-align: right"&gt;Set4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1-8&lt;/td&gt;
&lt;td style="text-align: center"&gt;33-40&lt;/td&gt;
&lt;td style="text-align: right"&gt;65-72&lt;/td&gt;
&lt;td style="text-align: right"&gt;97-104&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9-16&lt;/td&gt;
&lt;td style="text-align: center"&gt;41-48&lt;/td&gt;
&lt;td style="text-align: right"&gt;&amp;hellip;&lt;/td&gt;
&lt;td style="text-align: right"&gt;&amp;hellip;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17-24&lt;/td&gt;
&lt;td style="text-align: center"&gt;46-56&lt;/td&gt;
&lt;td style="text-align: right"&gt;&amp;hellip;&lt;/td&gt;
&lt;td style="text-align: right"&gt;&amp;hellip;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25-32&lt;/td&gt;
&lt;td style="text-align: center"&gt;57-64&lt;/td&gt;
&lt;td style="text-align: right"&gt;89-96&lt;/td&gt;
&lt;td style="text-align: right"&gt;121-128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;129-136&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;hellip;&lt;/td&gt;
&lt;td style="text-align: center"&gt;&amp;hellip;&lt;/td&gt;
&lt;td style="text-align: right"&gt;&amp;hellip;&lt;/td&gt;
&lt;td style="text-align: right"&gt;&amp;hellip;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2969-2976&lt;/td&gt;
&lt;td style="text-align: center"&gt;3001-3008&lt;/td&gt;
&lt;td style="text-align: right"&gt;3033-3040&lt;/td&gt;
&lt;td style="text-align: right"&gt;3065-3072&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;According to CUDA documentation, GK110 adds the ability for read-only data in global memory to be loaded through the same cache used by the texture pipeline via a standard pointer without the need to bind a texture beforehand and without the sizing limitations of standard textures. The read-only data cache is loaded by calling __ldg(const restricted * address). We modified the code used to test texture L1 cache. The basic logic remains the same. When the arrays size is set to 6144 integers no cache misses occur with stride set as 32 (s=32 bytes). As soon as we increased one more element in the array cache misses start occurring. This shows the read-only cache is 24kb. We then noticed that the misses occur in a group of either 4 or 8. We infer the cache line to be 32 bytes and the replacement policy is LRU, same as Maxwell. We we increase the array to include 6248 elements(6144+32&lt;em&gt;3+8, 6144 is the max capacity of the cache, 32 consecutive number in a set, 32&lt;/em&gt;3 to cause cache miss in set1, set2, and set3, only need to include 8 more to cause cache miss in set4 since s=32bytes), no caches hits occur. Therefore, we infer the caches set number to be 4, each cache line is 32 bytes, and each set contains 192 cache lines, the same as the texture L1 cache. The memory mapping seems arbitrary because the hit and miss patterns didn&amp;rsquo;t follow that of the texture L1 cache.&lt;/p&gt;
&lt;!-- ## L1 TLB cache
Many studies show that NVIDIA GPUs use a fully-associative L1 TLB cache. However, my test result have many evidences indicating that Pascal uses a set-associative L1 TLB cache. --&gt;
&lt;!-- At the second stage, --&gt;</description></item><item><title>Map Reduce</title><link>https://www.bodunhu.com/blog/posts/map-reduce/</link><pubDate>Sun, 01 Apr 2018 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/map-reduce/</guid><description>&lt;p&gt;I was always interested by the name &amp;ldquo;map reduce&amp;rdquo; since two years ago when I first heard this term. But I&amp;rsquo;ve never put any effort to know the concept until Chris mentioned it in class because it will be on the next exam so I figured I&amp;rsquo;d better figure out what is going on before it was too late. Just kidding:) But map reduce does borrows a lot of characteristics from traditional relational databases even though many useful and important features in RDBMS are eliminated from the map reduce system. You can check this long list of roasts on map reduce &lt;a href="http://www.cs.utexas.edu/~rossbach/cs378/papers/dewitt08blog-mapreduce-backwards.pdf"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But the intention of this post is not about roasting map reduce so if you absolutely resent how map reduce is such a disgrace to RDBMS you are in the wrong place. Essentially, MapReduce is a programming model. Users need to define a &lt;em&gt;map&lt;/em&gt; function that processes a key/value pair, producing a set of key/value pairs, then a &lt;em&gt;reduce&lt;/em&gt; function will read these intermediate pair, merging pairs with the same intermediate key. It is important to realize the MapReduce is a programming model because it allows the programmers to follow this model without having to worry about the technical details needed to ensure the operations between clusters. In fact, the programming model is very easy to understand. Everything you need is already summarized in the name &lt;em&gt;MapReduce&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Basically, the computation takes a set of pair/key values are input and output a set of pair/key values. The users write the map function which take an input pair and produce a set of intermediate key/value pairs(we will know why the output in &lt;em&gt;intermediate&lt;/em&gt;). The MapReduce library takes all intermediate pairs and group the ones with the same key and pass them to the reduce function. The reduce function is also written by the user. It takes an intermediate key with a set of values corresponding to that key, merging those values in hope to form a smaller set of values. What it means is that the reduce function usually produces zero or just one output value. The intermediate values are supplied to reduce function via an iterator. There might be occasions where the memory doesn&amp;rsquo;t have enough space for all intermediate value and thus some values needs to be pushed to permanent storage.&lt;/p&gt;</description></item><item><title>Networks</title><link>https://www.bodunhu.com/blog/posts/networks/</link><pubDate>Mon, 13 Nov 2017 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/networks/</guid><description>&lt;p&gt;The concept of a worldwide of networks of information was introduced long before the technology used to build the internet. The first workable prototype came in the late 1960s with the creation of ARPANET(The Advanced Research Projects Agency Network). The famous TCP/IP, or Transmission Control Protocol and Internet Protocol, was developed by Robert Kahn and Vinton Cerf in the 1970s. In the 1980s, research by Tim Berners-Lee gave birth to the World Wide Web, linking hypertext documents into an information system, making them accessible from any node on the network (&lt;a href="https://en.wikipedia.org/wiki/History_of_the_Internet"&gt;History of Internet&lt;/a&gt;). The implementation and the evolution of the internet has improved ever since. Today, for most users, the internet feels like smoke and mirrors since requiring everyone to understand the technical implementation will be way too harsh. However, as software developers, they are much more likely to deal with networks sometime in life. This article is meant to unveil technical details of networks mainly from a programmer&amp;rsquo;s perspective so the focus will be put on the software side.&lt;/p&gt;
&lt;!--description--&gt;
&lt;h2 id="the-os-view-of-networks"&gt;The OS View of Networks&lt;/h2&gt;
&lt;p&gt;For the operating system, the network is perceived as an extra device. The Network Interface Controller(NIC), which is a hardware device used to connect the computer to a computer network, is added to the bus.The data can be transferred to/from memory to NIC through two methods: DMA or memory-mapped I/O. DMA refers to direct memory access. The name suggests that the hardware is able to read or write memory without the involvement of CPU. On the other hand, memory-mapped I/O means the CPU can control the hardware to read or write specific memory addresses, which means the CPU is doing the job of writing/reading to/from memory. DMA is usually used for high-bandwidth operations such as disk I/O while memory-mapped I/O is used in low-bandwidth operations like change control bits.&lt;/p&gt;
&lt;h2 id="layers-of-network"&gt;Layers of Network&lt;/h2&gt;
&lt;p&gt;The OSI (Open Source Interconnection) 7 Layer Model divides network communication into seven layers. Here I will discuss each layer and its corresponding function. Starting from the lowest one:&lt;/p&gt;
&lt;p&gt;Layer 1: This layer is the physical layer which is concerned with the transmission of unconstructed raw bit stream over the physical medium. Thus the protocol data unit(PDU) is bit for this layer.&lt;/p&gt;
&lt;p&gt;Layer 2: This is the data link layer. Its function is in charge of reliably transfer the data frames between two nodes connected by a physical layer. The PDU of this layer is frame.&lt;/p&gt;
&lt;p&gt;Layer 3: This is the network layer. This layer is in charge of structuring and managing a multi-node network. Examples includes addressing, routing, and load control. The PDU for this layer is packet.&lt;/p&gt;
&lt;p&gt;Layer 4: This is the transport layer. It is used to deliver messages error-free, in sequence, and without duplications or losses. The PDU of this layer is segment/datagram(segment for TCP, datagram for UDP).&lt;/p&gt;
&lt;p&gt;Layer 5: This is the session layer. It allows session establishment between process running on different stations. Layer 5 is often OS/Library. From here, the PDU will generalized to data.&lt;/p&gt;
&lt;p&gt;Layer 6: This is the presentation layer. As the name suggests, it formats the data and present it to the application layer. It can be viewed as the translator for the network. It is usually OS/Library.&lt;/p&gt;
&lt;p&gt;Layer 7: The final layer is called the application layer. It serves as the window for the users and application processes to access the network.&lt;/p&gt;
&lt;p&gt;Note the Department of Defense Four-Layer Model has only four catagories: the Network Access Layer(layer 1-2), the Internet Layer(layer 3), the Host-to-Host Layer(layer 4), and the Process Layer(layer 5-7).&lt;/p&gt;
&lt;h2 id="more-on-layer-2-network"&gt;More on Layer 2 Network&lt;/h2&gt;
&lt;p&gt;There are three types of layer 2 networks: System Area Network(SAN), Local Area Network(LAN), and Wide Area Network(WAN).&lt;/p&gt;</description></item><item><title>File System Design</title><link>https://www.bodunhu.com/blog/posts/file-system-design/</link><pubDate>Mon, 30 Oct 2017 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/file-system-design/</guid><description>&lt;p&gt;What exactly is a file system? The general concept is that the file system provides naming organization. It manages the physical disk layout such as picking a block constituting a file, balancing locality with expandability, and managing free space. It can translate from file name and offset to the actual data block. In a nutshell, it is a servant that manages all the dirty details of communicating the data between system and the hardware in an optimal way which you aren&amp;rsquo;t required to understand so you can go on and do other things with your life. So what are the functionalities of file systems? In general, it provides file name organizations such as directories. It can manage disk layout by picking blocks that constitute a file, balancing locality with expandability, and manage free space. It can translate from file name and offset to the actual data block.&lt;/p&gt;
&lt;!--description--&gt;
&lt;h2 id="file"&gt;File&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s start from and bottom-to-top pattern to describe file system by first introducing the most fundamental unit: the file itself. So a file is composed of two parts: the metadata and the actual data. The metadata is a &lt;em&gt;file header&lt;/em&gt; that holds information about where the data is stored and attributes of the file such as a permission, access time, owner id, size, and so on. One thing to note is that meta data blocks are stored on a location that is known by the OS and thus it can be accessed without having to check another data structure. Then the actual data is the part users actually care about. There are two kinds of blocks (there can be more than these two data blocks but we will only discuss two here), The directory data block which maps file names to file headers, and file data block that contains information we care about.&lt;/p&gt;
&lt;h2 id="design-file-layout"&gt;Design File Layout&lt;/h2&gt;
&lt;p&gt;There are several factors we need to take into consideration when designing file layout:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Support for sequential and random access. Almost every file operation is either sequential or random.&lt;/li&gt;
&lt;li&gt;Lay out the files on the physical disk.&lt;/li&gt;
&lt;li&gt;Maintain file location information. This makes sense since we need an agent to keep track all files because we users are too lazy to do that.&lt;/li&gt;
&lt;li&gt;In Unix most files are small in size so we need to support small files, which means block size can&amp;rsquo;t be too large due to internal fragmentation.&lt;/li&gt;
&lt;li&gt;Most disk space is consumed by large files so we also need to support large files and accessing them should be efficient as well.&lt;/li&gt;
&lt;li&gt;I/O operations target both types of files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="block-vs-sector"&gt;Block VS Sector&lt;/h2&gt;
&lt;p&gt;Before we dig deeper into file system design, it&amp;rsquo;s important to note the the block size of file system is different from disk blocks size. According to &lt;a href="http://www.nobius.org/~dbg/practical-file-system-design.pdf"&gt;Practical File System Design&lt;/a&gt;,
block is the smallest unit writable by a disk or ﬁle system. Everything a ﬁle system does is composed of operations done on blocks. A ﬁle system block is always the same size as or larger (in integer multiples) than the disk block size. Also each blocks consists of consecutive sectors so that sequential access becomes possible. A larger block size increases transfer efficiency also because of sequential access since you don&amp;rsquo;t have to move the head too many times, it may be convenient if the block size matches the machine&amp;rsquo;s page size, this is because we don&amp;rsquo;t have to switch pages assuming the block is bigger than the page size. Many systems allows transfer of many sectors between interrupts.&lt;/p&gt;
&lt;h2 id="allocation-methods"&gt;Allocation methods&lt;/h2&gt;
&lt;h3 id="contiguous-allocation"&gt;Contiguous Allocation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;OS maintains an ordered list of free disk blocks.&lt;/li&gt;
&lt;li&gt;OS allocates a contiguous chunk of free blocks when it creates a file.&lt;/li&gt;
&lt;li&gt;Placement/allocation policy can be first-fit, best-fit, or worst-fit.&lt;/li&gt;
&lt;li&gt;File header specifies starting block and length.&lt;/li&gt;
&lt;li&gt;Pros:
&lt;ul&gt;
&lt;li&gt;All file data stored contiguously on disk.&lt;/li&gt;
&lt;li&gt;Simple to implement as bump pointer is a common way of implementation.&lt;/li&gt;
&lt;li&gt;Best performance for initial write of a file due to locality resulted from contiguous allocation.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Cons:
&lt;ul&gt;
&lt;li&gt;External fragmentation because some allocation for large files are simply impossible, resulting in wasted unallocated space, and hard to grow file in size.&lt;/li&gt;
&lt;li&gt;Later writes may cause the file to grow which would require it to be copied and moved.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="linked-allocation"&gt;Linked Allocation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Files are stored as a linked list of blocks, in each sector, there&amp;rsquo;s a pointer pointing to the next sector. (This is a hardware implementation, we still use blocks fot later discussion.)&lt;/li&gt;
&lt;li&gt;The file header keeps a pointer to the first and last sector/block allocated to that file.&lt;/li&gt;
&lt;li&gt;There are two types of implementations for Linked allocation:
&lt;ul&gt;
&lt;li&gt;Linked list of disk blocks, data blocks point to other blocks&lt;/li&gt;
&lt;li&gt;Linked list in a table (FAT file system)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Pros:
&lt;ul&gt;
&lt;li&gt;Reduce or eliminate external fragmentation since blocks can fit in if there are free blocks available.&lt;/li&gt;
&lt;li&gt;Easy to grow file just like adding elements into a linked list.&lt;/li&gt;
&lt;li&gt;Linear access is somewhat efficient(It&amp;rsquo;s linked list, what do you expect? O(1)?).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Cons:
&lt;ul&gt;
&lt;li&gt;linear random access time in linked list.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="fat-file-system-file-allocation-table"&gt;FAT File System (File Allocation Table)&lt;/h2&gt;
&lt;p&gt;FAT32 file system is very important file system created by Microsoft. It was introduced the solve the volume problem posed by FAT16. Although named FAT32, only 28 of the 32 bits are actually used and the remaining 4 bits are &amp;ldquo;reserved for future use&amp;rdquo;. As a result, a FAT32 partition has a maximum cluster count of (268,435,455)2^28-1. I found this description about FAT32 on &lt;a href="https://superuser.com/questions/983139/why-is-fat32-limited-to-just-under-228-clusters"&gt;StackExchange&lt;/a&gt; that is useful:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Although VFAT was a clever system, it did not address the limitations of With the appearance of the FAT32 file system, the maximum number of clusters per partition went increased from 65535 to 268,435,455 (228-1). FAT32 thus allows much bigger partitions (up to 8 terabytes). Although the maximum theoretical size of a FAT32 partition is 8 TB, Microsoft has voluntarily limited it to 32 GB on Windows 9x systems to promote NTFS&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;FAT32 is implemented in a completely different way . Unlike FFS in UNIX, each entry in in the MTF merely represents a block of data. Each block is able to point to another block, with multiple entries in the table to represent a file represented multiple blocks. Each file&amp;rsquo;s file number is indicated using the index of the first entry in the MTF. Thus, in order to locate a specific block of a file, we need to search the MTF sequentially.&lt;/p&gt;</description></item><item><title>Disk Introduction</title><link>https://www.bodunhu.com/blog/posts/disk-introduction/</link><pubDate>Wed, 25 Oct 2017 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/disk-introduction/</guid><description>&lt;p&gt;This chapter is all about disk. Before we start. We won&amp;rsquo;t go deep into the mechanical part of disk operation; rather we will be focusing on general concept related to disk and algorithms to improve disk performance.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/images/Chapter10/10_01_DiskMechanism.jpg" alt="disk-structure"&gt;&lt;/p&gt;
&lt;h2 id="the-evaluation-criteria"&gt;The Evaluation Criteria&lt;/h2&gt;
&lt;p&gt;Here we are introducing the basic components used to evaluate the performance of disk operation.&lt;/p&gt;
&lt;h3 id="seek-time"&gt;Seek Time&lt;/h3&gt;
&lt;p&gt;This is the time to position the head over the track. Maximum can be going from innermost track to outer most track. It usually ranges from 10ms to over 20 ms. However, the average seek time is usually to seek 1/3 of the way across the disk.&lt;/p&gt;
&lt;h3 id="head-switch-time"&gt;Head Switch Time&lt;/h3&gt;
&lt;p&gt;This is time spent to move from one track on one surface to another track on a different surface. The range is similar to seek time.&lt;/p&gt;
&lt;h4 id="rotation-delay"&gt;Rotation Delay&lt;/h4&gt;
&lt;p&gt;This is the time spend for the sector to spin underneath the head. It varies depends on how fast the disk rotates.&lt;/p&gt;
&lt;h4 id="transfer-time"&gt;Transfer Time&lt;/h4&gt;
&lt;p&gt;The time spend to read or write sector as it spins by.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transfer time: time to move the bytes from disk to memory&lt;/li&gt;
&lt;li&gt;Surface transfer time: time to transfer one or more
sequential sectors to/from surface after head reads/writes
first sector&lt;/li&gt;
&lt;li&gt;Host transfer time: time to transfer data between host
memory and disk buffer&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="disk-head-scheduling-mainly-focusing-on-hdd"&gt;Disk Head Scheduling (Mainly focusing on HDD)&lt;/h2&gt;
&lt;p&gt;Now we&amp;rsquo;ve looked at the basic performance evaluation critiria for HDD, it&amp;rsquo;s reasonable to discuss how to reduce head movement so that the amount of time spent on moving the head from on track to the other will decrease because the disk I/O request for can be stored in a queue. (Note the seek time takes the most amount of time so it&amp;rsquo;s reasonable to reduce it.)&lt;/p&gt;
&lt;h3 id="fifo"&gt;FIFO&lt;/h3&gt;
&lt;p&gt;This technique is easy to understand, the head will move to the corresponding track based the order of the queue of requests. Since the requested data can be read/written on random tracks, the performance can be heavily affected.&lt;/p&gt;
&lt;h3 id="sstf-shortest-seek-time-first"&gt;SSTF (Shortest Seek Time first)&lt;/h3&gt;
&lt;p&gt;The queue of requests is reordered such that the head will only look for the closest track it can move to and thus ignore the global state of locations of all requests. This is a greedy algorithm and thus can be trapped in local optimal value.&lt;/p&gt;
&lt;h3 id="scanelevatorlook"&gt;SCAN/Elevator/LOOK&lt;/h3&gt;
&lt;p&gt;Simply move the head to one direction until the request that is closest to that end of the disk is reached, then reverse the direction of the head and find the rest of the requests.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Optimization: the head is reset when no more requests exist between the current head position and the approaching edge of the disk (LOOK scheduling)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="c-scanc-look-circular-scan-scheduling"&gt;C-SCAN/C-LOOK (&amp;ldquo;Circular Scan&amp;rdquo; scheduling)&lt;/h3&gt;
&lt;p&gt;Move the head in one direction until an edge of the disk is reached and then reset to the opposite edge.
optimization: the head is reset when no more requests exist between the current head position and the approaching edge of the disk (called C-LOOK
scheduling).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note the only difference between SCAN and C-SCAN is that in C-SCAN, after the head reaches one edge, an optimized jump implemented by the hardware is used to directly move the head to the opposite edge instead of reversing the movement direction.)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="partitioning"&gt;Partitioning&lt;/h2&gt;
&lt;p&gt;Disks are partitioned in order to minize the largest seek possible time since each partition is a logically seperate disk. (It&amp;rsquo;s just merely a collection of cylinders.) More information covering partitioning will be covered in file system.&lt;/p&gt;
&lt;h2 id="other-techniques-to-reduce-overhead"&gt;Other Techniques to Reduce Overhead&lt;/h2&gt;
&lt;p&gt;To minimize rotational latency and seek time, we can also:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Make disks smaller (less movement distance)&lt;/li&gt;
&lt;li&gt;Spin disks faster&lt;/li&gt;
&lt;li&gt;Schedule disk operations to minimize head movement(we&amp;rsquo;ve just discussed)&lt;/li&gt;
&lt;li&gt;Lay out data on disk so that related data is on nearby tracks(locality and also less movement)&lt;/li&gt;
&lt;li&gt;Place commonly used files on disk&lt;/li&gt;
&lt;li&gt;Block size: (note disk is presented with sector address using logical block address converted by the controller)
&lt;ul&gt;
&lt;li&gt;Too small: low transfer rate because we need to perform more seeks for same amount of data.&lt;/li&gt;
&lt;li&gt;Too big: internal fragmentation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="ssd"&gt;SSD&lt;/h2&gt;
&lt;p&gt;The basic advantage of SSD is that it doesn&amp;rsquo;t have moving parts and thus random access is blazingly fast. It&amp;rsquo;s implemented using NAND and is non-volatile.&lt;/p&gt;
&lt;h3 id="basic-nand-flash-units"&gt;Basic NAND Flash Units&lt;/h3&gt;
&lt;p&gt;&lt;img src="https://miro.medium.com/max/480/1*0T4__A7XygT1xQUIrEKQXw.png#center" alt="structure-of-flash-die"&gt;&lt;/p&gt;
&lt;p&gt;The fundamental unit is a page which is 4KB. 128 pages are organized together forming a block of size 512KB. Each block is the unit forming a plane. There are 1024 blocks on one plane and the size of the plane is 512MB.&lt;/p&gt;
&lt;h3 id="operations"&gt;Operations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Read page: fast in terms of nano seconds compared to micro seconds for spinning disk.&lt;/li&gt;
&lt;li&gt;Write page: can only write to empty page, same as above.&lt;/li&gt;
&lt;li&gt;Erase block: (ms) Before a page can be written, all bits in the page need to be set to 1. Note the only way to set bits in a page to 1 is to erase the whole block.&lt;/li&gt;
&lt;li&gt;Read and Write occur in page unit.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Virtual Memory Mechanisms</title><link>https://www.bodunhu.com/blog/posts/virtual-memory-mechanisms/</link><pubDate>Thu, 19 Oct 2017 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/virtual-memory-mechanisms/</guid><description>&lt;p&gt;As we can see in the &lt;a href="https://www.bodunhu.com/blog/posts/virtual-memory-overview/"&gt;previous post&lt;/a&gt;,
all allocation algorithms we discussed lead to external fragmentation. As time goes by, external fragmentation is going to get worse and we need solutions for the problem. We can use swap areas to swap out memory onto the disk, or move allocated memory together(a process named memory compaction), leaving empty spaces together. Even these approaches can reduce external fragmentation and allow a higher degree of multiprogramming, they are not perfect. In addition, it is possible to have a single process that is just too big to fit into memory. We are going to discuss methods used to completely eliminate external fragmentation. More than that, we will discuss how to make memory sharing possible and how to allow more processes to execute at once.&lt;/p&gt;
&lt;!--description--&gt;
&lt;h2 id="too-big-to-fit"&gt;Too big to fit&lt;/h2&gt;
&lt;p&gt;It&amp;rsquo;s easy for us to assume the amount of available memory that can be allocated is not a big problem. It&amp;rsquo;s easy for programmers to assume the available memory resource is almost infinite and thus we rarely care about the situation in which the code we wrote is going to occupy all memory resource. But let&amp;rsquo;s just consider the scenario where we create a program which later create a process that is just too big to fit into memory, what should we do?&lt;/p&gt;
&lt;p&gt;The natural response would be: just cut them into pieces! This is a technique called overlay: programmers manually cut the program into pieces, or &lt;em&gt;overlays&lt;/em&gt;. When the program executes, a overlay manager is created to swap pieces in and out, allowing only necessary pieces in memory at a given time. But tell me, what is the last time you see an user-level application manually cut into &amp;ldquo;pieces&amp;rdquo; by the programmer? Doing things manually is not desired trait of a programmer. Programmers should always be lazy and automate things, or just leave it to someone else!&lt;/p&gt;
&lt;h2 id="paging"&gt;Paging&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;m pretty sure you don&amp;rsquo;t like the idea of overlaying as it requires you to do things manually. That&amp;rsquo;s where paging comes into play. Instead of dividing the program by the programmer, why don&amp;rsquo;t we let the system do the dirty job? Before we start, I&amp;rsquo;m going to throw two questions to you: why can a virtual address space be bigger than the physical memory? How are each piece of a process brought into the memory?&lt;/p&gt;
&lt;p&gt;The technique to divide a address space into fixed-size &lt;em&gt;pages&lt;/em&gt; is called paging. A copy of the address space is stored on the disk. The physical memory is viewed as a series of equal-sized &lt;em&gt;page frames&lt;/em&gt;. We will discuss later about how the system choose to load a page into a frame and how to manage pages that are currently in memory.&lt;/p&gt;
&lt;p&gt;So how do we use virtual addresses with the our recently introduced pages to find a location in memory? As we can see, a virtual address space is divided into pages, each with a fixed number of entries. In order to represent the number of pages and number of entries, we need two variables:&lt;/p&gt;
&lt;p&gt;p - page number(p&lt;sub&gt;MAX&lt;/sub&gt; pages)&lt;br&gt;
o - page offset (difference between the byte address to search and the start of the page, &lt;sub&gt;MAX&lt;/sub&gt; indicates the total number of entries in a table)&lt;br&gt;
Virtual Address calculation: o&lt;sub&gt;MAX&lt;/sub&gt; x p + o (here o is the offset in the last page)&lt;/p&gt;
&lt;p&gt;The frame size is equal to that of a page. It&amp;rsquo;s easy to understand since we need to put everything stored in a page into the frame, we need them both to be equally sized. Note that since virtual address space can be bigger than physical memory, the number of frames can be smaller than the number of pages, which means the number of bits reserved for frame number can be smaller than the number of bits used to indicate the number of pages.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.bottomupcs.com/chapter05/figures/virtaddress.png#center" alt="virtual-address"&gt;&lt;/p&gt;
&lt;p align="center"&gt;
&lt;a href="https://www.bottomupcs.com/virtual_addresses.xhtml"&gt;source&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="from-virtual-to-physical-allocation-policy"&gt;From Virtual to Physical: Allocation Policy&lt;/h2&gt;
&lt;p&gt;We&amp;rsquo;ve discussed how that a process&amp;rsquo;s virtual address space can be divided into pages and mapped to frames in physical memory. Here we are going to discuss some policies used to implement the mapping process. I&amp;rsquo;m going to leave three questions to think here as well: why pages are arbitrarily located in physical memory? How do we find them if they are arbitrarily located in physical memory? Why aren&amp;rsquo;t all pages mapped to frames? These questions will become more clear as we progress into further discussion.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the solution: a page table. Each process has one page table that contains all mapping information for each possible virtual address belonged to that process. Even though we call it table, it&amp;rsquo;s merely a data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses. However, the mapping is invisible to the process. The protection mechanism is the same as dynamic relocation we&amp;rsquo;ve discussed before.&lt;/p&gt;
&lt;h2 id="virtual-address-translation"&gt;Virtual Address Translation&lt;/h2&gt;
&lt;p&gt;Now we are going through a step-by-step description of address how to translate virtual address to physical address.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;First, the process will give the CPU a virtual address to translate.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Then MMU will split the address into two parts, the page number and the offset.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Since the size of a page and a frame are the same, the offset of the virtual address is sent along without no modification to the physical memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use page number to find the corresponding entry in the page table.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Check if the page exists in physical memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the page does exist in physical memory, the frame number is sent along. If the requested page is on the disk, then the corresponding page is moved to memory and frame number is recorded now.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Offset is appended to the end of the frame number to get the physical address.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p align="center"&gt;
&lt;img src="https://upload.wikimedia.org/wikipedia/commons/8/8d/Memory_paging.jpg"&gt;
&lt;/p&gt;
&lt;p&gt;So, we&amp;rsquo;ve achieved several goals now by using paging technique:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Reduce or even eliminate external fragmentation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Easy to grow processes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Allow more process that is too big to fit into memory to be able to execute.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Easy to allocate and deallocate.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Share memory is between processes is easy since memory used by difference processes no longer has to be contiguous. Even pages may exist in different position, they can be mapped to the same physical address in memory.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="more-about-page-table"&gt;More about Page Table&lt;/h3&gt;
&lt;p&gt;One thing to notice is that there&amp;rsquo;s only one page table for each process. The page table is part of the process&amp;rsquo;s state and serves as protection mechanism to prevent processes accessing each other&amp;rsquo;s memory. There&amp;rsquo;re several elements in each page table entry as well:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Flags: dirty bit, resident bit, and dirty bit (we will talk about them later). Flag is stored at the beginning of each entry.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Frame number: stored in the remaining bits. It tells where the page lives in physical memory.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, page table still has its disadvantages, the most important thing to notice is that we need two memory accesses to implement virtual memory reference, first access is to get the page table entry, the second access is used to get the actual data from memory, if it&amp;rsquo;s present. As we know, memory access is extremely slow and expensive, thus we need something faster.&lt;/p&gt;
&lt;h2 id="translation-look-aside-buffer-tlb"&gt;Translation Look-aside Buffer (TLB)&lt;/h2&gt;
&lt;p&gt;Since it&amp;rsquo;s hard to improve the speed from the algorithm side, let&amp;rsquo;s just drop the algorithm for a minute and switch our focus onto the hardware. Here we will discuss how to improve the speed of memory reference by adding a hardware component called TLB. Here are several basic characteristics of TLB:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The TLB holds recently used frame/page pairings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It has high hit ratios due to locality.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For a TLB hit, translation can be finished in one cycle.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So how does TLB help with efficiency? It&amp;rsquo;s actually really simple. The system simultaneously sends the page number to both page table and TLB. If there&amp;rsquo;s TLB hit, then the TLB cache sends the frame number to the memory without having to look into the page table, which avoids the first reference into the memory to find the page table. If there&amp;rsquo;s missing TLB, everything stays the same: look for the page table in memory and update the TLB.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Page_table_actions.svg/500px-Page_table_actions.svg.png#center" alt="page-table-TLB"&gt;&lt;/p&gt;
&lt;p align="center"&gt;
&lt;a href="https://commons.wikimedia.org/wiki/File:Page_table_actions.svg"&gt;source&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="problems-with-page-table"&gt;Problems with Page Table&lt;/h2&gt;
&lt;p&gt;Now we solved the problems of external fragmentation. It seems paging works like a charm and makes things a lot easier. However, we notice it&amp;rsquo;s still not perfect in terms of space usage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Data structure overhead (The page table can be huge!)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Inappropriate page size can lead to internal fragmentation, and less processes to exist in memory in the same(page too big)!&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thus we need more complex methods to solve the above issues.&lt;/p&gt;
&lt;h2 id="multi-level-page-tables"&gt;Multi-level Page Tables&lt;/h2&gt;
&lt;p&gt;The basic concept of multi-level page table is to subdivide page number into n parts(n stands for number levels of pages tables). n is decided by the architecture of the system. Each entry in each page table which is exists each entry points to the corresponding page table in the next level. The last level page table entries hold the bit/frame information associated to the page table entry.&lt;/p&gt;
&lt;p&gt;SO how does it work exactly? First, we have only one first-level page table. We extract the first subdivision of the virtual address, added to the PTBR to get the corresponding entry in the first-level page table. Then we extract the second subdivision of the virtual address, add it to the address of the beginning of the second-level page table which we got from the corresponding first-level page table entry. This process continues until we reach the last-level page table. From the corresponding entry we can get the frame number. The offset is preserved so we just need to append the offset to the frame number and we&amp;rsquo;re done! One reminder is that multi-level page table requires several lookups to eventually find the frame number, so TLB becomes extremely important here in terms of improving performance.&lt;/p&gt;
&lt;h3 id="how-does-multi-level-page-table-save-space"&gt;How does multi-level page table save space?&lt;/h3&gt;
&lt;p&gt;You&amp;rsquo;re probably still confused why multi-level page table saves space by adding &lt;strong&gt;more tables&lt;/strong&gt;. Don&amp;rsquo;t worry, I will walk you through an example to illustrate the magic behind the scene:)&lt;/p&gt;
&lt;p&gt;Assume a process has a ((2^{10}\) pages, each PTE occupying 4 bytes (32-bit system). Without multilevel page table, we need \(2^{20} \times 4 = 4MB\) for one page table stored in memory. Even we just need a portion of all pages, we need the whole page table present in memory to find the corresponding frames. Now, if we divide the virtual address into 3 sections with last one being the offset, we have a two-level page table. The first 10 bits are used to index the page table in the first level and the next 10 bits are used to index the page table in the second level. If we only need virtual addresses that have the second 10 bits modified and leave the first 10 bits untouched, then we only need to find one entry in the first-level page table. Since the first-level page table has to be always present in memory, it will consume \(2^{10} \times 4=4KB\) memory space. Now, since we need every entry in a second-level page table pointed by the entry we just found in an entry in the first-level page table, it requires \(2^{10} \times 4bytes = 4KB\) memory. So we only need to use 4 + 4 = 8KB for all memory we need instead of 4MB without multi-level page tables.&lt;/p&gt;
&lt;p&gt;Another interesting fact is that, even if we need to use all pages of a process, multi-level page table will potentially increase the space needed. Let&amp;rsquo;s take the above example and assume we need every single pages from a process. Then we need to store the first-level page table, which takes $2^{10} \times 4bytes = 4KB$. Then, for each entry in the first-level page table, there&amp;rsquo;s a corresponding second-level page table, each with the size of \(2^{10} \times 4 = 4KB\). Since the first-level table has \(2^{10}\) entries, the total number of second-level page tables is \(2^{10}\), each with the size of 4KB, so the total amount of spaces is \(2^{10} \times 4kb + 4kb = 4MB + 4KB\). Then bottom line is: if we need to map every pages to its frames, then the total amount of entries in the last level will be the number of pages regardless of how many levels we use since each page has to have a mapping. Under such case, the total amount of memory used by the last-level page tables will be equivalent to the amount used when we use only one huge page table. The additional space comes from the upper levels, but the previous level will only save the corresponding number of entries. (number of table in the next level).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/X86_Paging_PAE_4K.svg/440px-X86_Paging_PAE_4K.svg.png#center" alt="page-table-address-translation"&gt;&lt;/p&gt;
&lt;p align="center"&gt;
&lt;a href="https://en.wikipedia.org/wiki/Page_table"&gt;source&lt;/a&gt;&lt;/p&gt;</description></item><item><title>Virtual Memory Overview</title><link>https://www.bodunhu.com/blog/posts/virtual-memory-overview/</link><pubDate>Sun, 08 Oct 2017 00:00:00 +0000</pubDate><guid>https://www.bodunhu.com/blog/posts/virtual-memory-overview/</guid><description>&lt;p&gt;I love pointers. Pointer is very a useful feature in programming languages like C/C++. I can pass weird hexadecimal numbers to a function and then it will magically locate where the program is in memory. However, all those values we see are merely virtual addresses, a running program&amp;rsquo;s view of memory in system. Any address we can see while programming user-level programs is a virtual address. It is no more than an illusion of where the data is actually laid out in memory. Only the almighty OS knows where the data actually locates in physical memory.&lt;/p&gt;
&lt;h2 id="three-goals-of-vm"&gt;Three Goals of VM&lt;/h2&gt;
&lt;p&gt;When the OS provides this layer of abstraction to translate virtual address to physical address, we say that the OS is &lt;em&gt;virtualizing memory&lt;/em&gt;. In order to achieve virtualization, there&amp;rsquo;re three goals to keep in mind.&lt;/p&gt;
&lt;p&gt;Transparency: The OS will implement transparency in a way that is invisible to the running program. Usually, transparency would suggest a clear understanding of how things work. However, when we are talking about VM, it means the running program is unaware of the fact its memory is translated by the OS(and hardware) behind the scene.&lt;/p&gt;
&lt;p&gt;Efficiency: The OS should do its best to ensure the efficiency of virtualization, in both time and space. Some of the methods used to improved efficiency, including hardware components like translation look-aside buffer(TLB), will be discussed in the following chapter.&lt;/p&gt;
&lt;p&gt;Protection: Being able to protect processes from interfering each other or even the OS is important. When one process performs actions like read and write, the process should be isolated so that it&amp;rsquo;s unable to modify the data of other processes or behave maliciously.&lt;/p&gt;
&lt;h2 id="basic-concept-address-space"&gt;Basic Concept: Address Space&lt;/h2&gt;
&lt;p&gt;Before we start, there are several terms I will use throughout the discussion so it&amp;rsquo;s better to get familiar with them.&lt;/p&gt;
&lt;p&gt;Physical address space: it is merely a collection of physical addresses used by the hardware. The value of the address can range from 0 to the MAX&lt;sub&gt;sys&lt;/sub&gt;. The address is actually utilized by the physical memory to fetch the contents inside.&lt;/p&gt;
&lt;p&gt;Logical/Virtual address: It a collection of address a process is able to access (the process is not aware of the actual physical location). It can be bigger than physical address due a technique called &lt;strong&gt;paging&lt;/strong&gt; which will be discussed later.&lt;/p&gt;
&lt;p&gt;Segment: A chunk of memory assigned to the process to use.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Memory_management_unit"&gt;&lt;img src="https://upload.wikimedia.org/wikipedia/commons/d/dc/MMU_principle_updated.png" alt="memory-management"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="how-does-an-address-get-generated"&gt;How does an address get generated&lt;/h2&gt;
&lt;p&gt;Uniprogramming: This is a very simple scenario, there&amp;rsquo;s only one process executing at the moment, its address always starts at 0 and the OS is loaded in a fixed part of the memory. The process executes in contiguous section of memory. The process is able to use the all the available portions of the memory as long as it doesn&amp;rsquo;t touch the OS&amp;rsquo;s property.&lt;/p&gt;
&lt;h3 id="relocation"&gt;Relocation&lt;/h3&gt;
&lt;p&gt;Same as uniprogramming, the OS locates at the highest location in the memory. The OS allocates a contiguous segment of memory for processes to use. If it doesn&amp;rsquo;t have enough space, it simply wait the process to terminate. The relocation address (also known as base address) is the first physical address a process can use. Limit address is the largest physical address the process can use.&lt;/p&gt;
&lt;p&gt;Also note there&amp;rsquo;re two types of relocation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;static: The address is produced during load time and the process is load into the given location to execute. The OS can&amp;rsquo;t move it as soon as the process is loaded in. Note the static address can be changed in both linking and loading stage. Linking stage might involving library routines and loading stage might increment the base address by the amount of memory used by previous process already present in memory. Note the printed value here is the actual physical address, not a virtual address.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;dynamic: Physical address is obtained by adding the base register to virtual address. The result has to be less than bound register or an exception will be raised by the processor.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Problem with relocation:&lt;/p&gt;
&lt;p&gt;Even though the concept of relocation is easy to understand, it can easily lead to problems with unused memory space. The biggest problem is &lt;strong&gt;external fragmentation&lt;/strong&gt;. When a process finished executing and the memory it occupied is deallocated, it will leave a &amp;ldquo;hole&amp;rdquo; behind and become available for other process to use. The problem is, as the size of the emory of the previously running process can be any random number, the &amp;ldquo;hole&amp;rdquo; it leaves behind may be too small for other processes to fit in. Even if it is big enough for other processes to fit in, they may eventually lead to smaller fragments that are too small for a process to use, which leads to external fragmentation. There&amp;rsquo;s also another problem called &lt;strong&gt;internal fragmentation&lt;/strong&gt;, but it won&amp;rsquo;t be discussed here.&lt;/p&gt;
&lt;h2 id="how-to-minimize-external-fragmentation"&gt;How to minimize external fragmentation?&lt;/h2&gt;
&lt;p&gt;As we can see, relocation leaves rooms for external fragmentation, which can be a problem since unused spaces will be a waste. Are there any methods we can use to reduce the external fragmentation to minimize external fragmentation and better utilize the free blocks? There&amp;rsquo;re three policies we will discuss that can be used to achieve our goals. A reminder is that they can&amp;rsquo;t completely eliminate external fragmentation but merely minimize it to a certain level.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.differencebetween.net/technology/difference-between-internal-fragmentation-and-external-fragmentation/"&gt;&lt;img src="https://cdn.jsdelivr.net/gh/BDHU/Page_Pics/posts/virtual-memory-overview/extermnal-frag.png#center" alt="external-fragmentation"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="policies"&gt;Policies&lt;/h2&gt;
&lt;h3 id="first-fit-policy"&gt;First-Fit policy&lt;/h3&gt;
&lt;p&gt;Finding the first free block that can hold the requested amount of memory.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All free blocks are tracked in a list and sorted by address.&lt;/li&gt;
&lt;li&gt;Allocation requires a search throughout the list to find the first free block to fit in the memory.&lt;/li&gt;
&lt;li&gt;After deallocation, the freed block might need to be merged with other free ones.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;very easy to implement.&lt;/li&gt;
&lt;li&gt;Tend to produces large free block towards the end of the address space.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Allocation is slow.&lt;/li&gt;
&lt;li&gt;It will eventually lead to fragmentation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="best-fit-policy"&gt;Best-Fit policy&lt;/h3&gt;
&lt;p&gt;Finding the smallest free block to allocate the request memory.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Avoid fragmenting huge free block.&lt;/li&gt;
&lt;li&gt;Minimize the size of external fragmentation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All free blocks are tracked in a list and sorted by address.&lt;/li&gt;
&lt;li&gt;Allocation still needs a search in the list to find a suitable block to fit in.&lt;/li&gt;
&lt;li&gt;After deallocation, the freed block may need to be merged with other free blocks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Kinda simple to implement&lt;/li&gt;
&lt;li&gt;Reduce external fragmentation, works well when allocations size gets smaller.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Still leaves room for external fragmentation.&lt;/li&gt;
&lt;li&gt;When it is deallocated and merged back, the new free block may require to be resorted.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="worst-fit-policy"&gt;Worst-Fit policy&lt;/h3&gt;
&lt;p&gt;Finding the largest free block to allocate request amount of bytes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To avoid having too many small free segments(reduce external fragmentation).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Requirement&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Free blocks sorted by size&lt;/li&gt;
&lt;li&gt;Allocation is fast since the first one is always the largest one.&lt;/li&gt;
&lt;li&gt;After dealloaction, it needs to check if the new free block needs to be merged back and resort the list.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Works great if all allocations are of medium size.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;cons&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Still external fragmentation&lt;/li&gt;
&lt;li&gt;Deallocation is slow(need to resort and merge)&lt;/li&gt;
&lt;li&gt;Tends to break down large free blocks, which can lead to failure to allocate large blocks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Technique to further reduce and eliminate external fragmentation will be discussed later.&lt;/p&gt;
&lt;h2 id="dynamic-relocation"&gt;Dynamic Relocation&lt;/h2&gt;
&lt;h3 id="advantages"&gt;Advantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Processes can move during execution.&lt;/li&gt;
&lt;li&gt;Processes can grow over time.&lt;/li&gt;
&lt;li&gt;It&amp;rsquo;s easy to provide protection since we only need two registers.&lt;/li&gt;
&lt;li&gt;Fast and simple&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="disadvantages"&gt;Disadvantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Allocation is contiguous.&lt;/li&gt;
&lt;li&gt;Sharing memory is hard since there&amp;rsquo;re no way to set base and bound register to be the same for more than one processes.&lt;/li&gt;
&lt;li&gt;Multiprogramming is limited since all active processes have to be loaded into the memory, which creates another problem that physical memory becomes the limit for how many processes can be loaded. (Swapping might help but the number of active processes still needs to be in the memory.)&lt;/li&gt;
&lt;li&gt;Need to add memory references every allocation.&lt;/li&gt;
&lt;li&gt;Memory management is a mess.&lt;/li&gt;
&lt;li&gt;Everyone has the same permission throughout the address space, which can potential create problems.&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>