This is my personal list of resources related to machine learning systems. Feel free to drop me an email if you think there’s something worth mentioning. I will try to update this page frequently to include the most recent stuffs in mlsys.
- Facebook’s external large-scale work
- NGC Container Doc: great for development, without having to manually install CUDA, pytorch, and other dependencies.
- Awesome-System-for-Machine-Learning: A curated list of research in machine learning systems (MLSys). Paper notes are also provided.
- Deep Learning Systems: Algorithms and Implementation
- Machine Learning Compilation: offered by Tianqi Chen, intro to ML compiler. Open to all.
- 15-884: Machine Learning Systems: offered by Tianqi Chen at CMU.
- CSE 291F: Advanced Data Analytics and ML Systems: offered by Arun Kumar at UCSD.
- Tinyflow: tutorial code on how to build your own Deep Learning System in 2k Lines.
- CSE 599W: Systems for ML: offered by Tianqi Chen at UW.
- CS8803-SMR: Special Topics: Systems for Machine Learning: offered by Alexey Tumanov at Georgia Tech. [schedule]
- CS 744: Big Data Systems: offered by Aditya Akella back at UW-Madison.
- CS 329S: Machine Learning Systems Design: offered by Stanford.
- EECS 598: Systems for AI: offered by Mosharaf Chowdhury at UMich.
- CS 378: Big Data Systems: offered by Aditya Akella at UT.
- Machine Learning Systems (Fall 2019): from UCB
- ECE 382V SysML: Computer Systems and Machine Learning Interplay: taught by Neeraja Yadwadkar at UT.
Labs & Faculties
- CMU Catalyst
- Berkeley RISE Lab
- MIT DASIL Lab
- UW SAMPL
- Shivaram Venkataraman at UW-Madison
- UTNS Lab
- Neeraja Yadwadkar at UT Austin
- SAIL: Systems for Artificial Intelligence Lab @ GT
- Amar Phanishayee @ MSR
- Hong Zhang @ University of Waterloo
- Manya Ghobadi @ MIT
- Dive into Deep Learning: interactive deep learning book with code, math, and discussions.
- CS231n: Convolutional Neural Networks for Visual Recognition
- Pytorch-Internals: must-read for PyTorch basics.
- Differential Programming with JAX
- Getting started with JAX (MLPs, CNNs & RNNs)
- Physics-based Deep Learning
- Dive into Deep Learning Compiler
- Autodidax: JAX core from scratch: really really good resource for learning Jax internals.
- Extending JAX with custom C++ and CUDA code
- Vector, Matrix, and Tensor Derivatives
- ML Memory Optimization: slides from UW. Visualization of dataflow graph helps understand how to optimize memory.
- Semianalysis: many good posts
- How to Load PyTorch Models 340 Times Faster with Ray
- 金雪锋: MindSpore 技术负责人
- Why are ML Compilers so Hard?
- Alpa: Automated Model-Parallel Deep Learning
- Data Transfer Speed Comparison: Ray Plasma Store vs. S3
- 一文搞懂 TorchDynamo 原理
This section could potentially be extremely long..
- Refer to Awesome-LLM
- Large Transformer Model Inference Optimization
- Efficient Memory Management for Large Language Model Serving with PagedAttention: vLLM
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity: load as sparse, compute as dense
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- Learning to Optimize Tensor Programs: facilitate efficient ML kernel search using ML.
- The OoO VLIW JIT Compiler for GPU Inference: JIT Compiler to enable better GPU multiplexing.
- LazyTensor: combining eager execution with domain-specific compilers: combining dynamic graph with JIT. summary
- Automatic Generation of High-Performance Quantized Machine Learning Kernels
- DietCode: Automatic Optimization for Dynamic Tensor Programs
- The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding
- The Deep Learning Compiler: A Comprehensive Survey
- Graphiler: A Compiler for Graph Neural Networks: ML compiler specifically designed for GNN.
- Accelerating Multi-Model Inference by Merging DNNs of Different Weights: combine multiple instances of a model into one computational graph
- INFaaS: Automated Model-less Inference Serving: developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale: transformer-specific inference optimization done by DeepSpeed.
- Clipper: A Low-Latency Online Prediction Serving System: a nice overview of inference system. Not SOTA but a good starter.
- Orloj: Predictably Serving Unpredictable DNNs: shares similarity to Clipper, but targeting models that may yield unpredictable performance.
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving: how to multiplex devices to serve multiple model while meeting latency constraint.
- MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
- A Survey of Multi-Tenant Deep Learning Inference on GPU: efficient resource management for multi-tenant inference.
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction: overlap DNN operators of different models in an online fashion
- THEMIS : Fair and Efficient GPU Cluster Scheduling: minimize the maximum finish time fairness across all ML apps while efficiently utilizing cluster GPUs
- HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face: use LLM as the controller to coordinate with exernal models for complicated tasks.
Dynamic Neural Network
- Dynamic Neural Networks: A Survey
- Dynamic Multimodal Fusion
- Hash Layers For Large Sparse Models: Using hashing for MoE gating.
- An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems: use evolution algorithm to update model structure during the training phase.
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- Beyond Data and Model Parallelism for Deep Neural Networks: called FlexFlow.
Switch & ML
- Do Switches Dream of Machine Learning? Toward In-Network Classification
- Taurus: A Data Plane Architecture for Per-Packet ML
- Pathways: Asynchronous Distributed Dataflow for ML: Google’s new DL systems, specifically designed for TPU.
- Ray: A Distributed Framework for Emerging AI Applications: RiseLab’s new distributed system. Using shared memory for data communication.
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
- OneFlow: Redesign the Distributed Deep Learning Framework from Scratch: shared many similarities to Google’s Pathways.
- Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training: find trade-offs between DNN training performance optimization and energy consumption, by configuring batch size and GPU power limit.