Machine Learning System Resources

May 5, 2025

This is my personal list of resources related to machine learning systems. Feel free to drop me an email if you think there’s something worth mentioning. I will try to update this page frequently to include the most recent stuffs in mlsys.

Resources

Facebook’s external large-scale work
NGC Container Doc: great for development, without having to manually install CUDA, pytorch, and other dependencies.
Awesome-System-for-Machine-Learning: A curated list of research in machine learning systems (MLSys). Paper notes are also provided.
Mastering LLM Techniques: Inference Optimization: summary of techniques used for LLM deployment
LLM Visualization

Courses

Deep Learning Systems: Algorithms and Implementation
Machine Learning Compilation: offered by Tianqi Chen, intro to ML compiler. Open to all.
15-884: Machine Learning Systems: offered by Tianqi Chen at CMU.
CSE 291F: Advanced Data Analytics and ML Systems: offered by Arun Kumar at UCSD.
Tinyflow: tutorial code on how to build your own Deep Learning System in 2k Lines.
CSE 599W: Systems for ML: offered by Tianqi Chen at UW.
CS8803-SMR: Special Topics: Systems for Machine Learning: offered by Alexey Tumanov at Georgia Tech. [schedule]
CS 744: Big Data Systems: offered by Aditya Akella back at UW-Madison.
CS 329S: Machine Learning Systems Design: offered by Stanford.
EECS 598: Systems for AI: offered by Mosharaf Chowdhury at UMich.
CS 378: Systems for Machine Learning: offered by Aditya Akella at UT.
Machine Learning Systems (Fall 2019): from UCB
ECE 382V SysML: Computer Systems and Machine Learning Interplay: taught by Neeraja Yadwadkar at UT.

Labs & Faculties

Tutorials

The Illustrated Transformer: best introduction for transformer
Dive into Deep Learning: interactive deep learning book with code, math, and discussions.
CS231n: Convolutional Neural Networks for Visual Recognition
Pytorch-Internals: must-read for PyTorch basics.
Differential Programming with JAX
Getting started with JAX (MLPs, CNNs & RNNs)
Physics-based Deep Learning
机器学习系统：设计和实现
Dive into Deep Learning Compiler
Autodidax: JAX core from scratch: really really good resource for learning Jax internals.
Extending JAX with custom C++ and CUDA code
Vector, Matrix, and Tensor Derivatives
ML Memory Optimization: slides from UW. Visualization of dataflow graph helps understand how to optimize memory.
Pytorch模型加速系列（一）——新的Torch-TensorRT以及TorchScript/FX/dynamo
Semianalysis: many good posts
How to Load PyTorch Models 340 Times Faster with Ray
分布式深度学习系统
MLsys各方向综述
金雪锋: MindSpore 技术负责人
解读谷歌Pathways架构（一）：Single-controller与Multi-controller
Why are ML Compilers so Hard?
Alpa: Automated Model-Parallel Deep Learning
Data Transfer Speed Comparison: Ray Plasma Store vs. S3
从零开始学深度学习编译器
一文搞懂 TorchDynamo 原理

Communication

PyTorch SymmetricMemory: Harnessing NVLink Programmability with Ease

Seminars

Stanford MLSys Seminar

Papers

This section could potentially be extremely long..

Training

Really broad topic…

The Llama 3 Herd of Models: really great paper explaining how SOTA models are trained in real world!

LLM

You an also refer to Awesome-LLM

Large Transformer Model Inference Optimization
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face: use LLM as the controller to coordinate with exernal models for complicated tasks
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity: load as sparse, compute as dense
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism: Overlapping LLM forward with KV caching computation.
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache: managing KV cache in distributed settings.
MatFormer: Nested Transformer for Elastic Inference: adaptive blocks during inference.

NAS

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs: Applying block-wise local distillation to every alternative subblock replacement in parallel and scoring its quality and inference cost to build a “library” of blocks. Then, using Mixed-Integer-Programming to assemble a heterogeneous architecture that optimizes quality under constraints such as throughput, latency and memory usage.

Diffusion

Inference

INFaaS: Automated Model-less Inference Serving: developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale: transformer-specific inference optimization done by DeepSpeed.
Clipper: A Low-Latency Online Prediction Serving System: a nice overview of inference system. Not SOTA but a good starter.
Orloj: Predictably Serving Unpredictable DNNs: shares similarity to Clipper, but targeting models that may yield unpredictable performance.
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving: how to multiplex devices to serve multiple model while meeting latency constraint.
MOSEL: Inference Serving Using Dynamic Modality Selection: dynamic modality selections for accuracy and SLO tradeoff
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models: finer-grained LLM serving
Orca: A Distributed Serving System for Transformer-Based Generative Models: iteration-based LLM inference
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: inference on heterogeneous accelerators using max-flow
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving: automatic configuration of inter and intra-node parallelism for model for both prefill and decoding.

Multitenancy

MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
A Survey of Multi-Tenant Deep Learning Inference on GPU: efficient resource management for multi-tenant inference.
Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction: overlap DNN operators of different models in an online fashion
THEMIS : Fair and Efficient GPU Cluster Scheduling: minimize the maximum finish time fairness across all ML apps while efficiently utilizing cluster GPUs

Dynamic Neural Network

Dynamic Neural Networks: A Survey
Dynamic Multimodal Fusion
Hash Layers For Large Sparse Models: Using hashing for MoE gating.
An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems: use evolution algorithm to update model structure during the training phase.
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping: dynamically skip FFN layers in Transformers suing cosine similarity.
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference: monotonically decrease LLM layers to ease KV cache management

Auto Placement

Reasoning LLM

Federated Learning

A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Switch & ML

Memory Management

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

System Design

Pathways: Asynchronous Distributed Dataflow for ML: Google’s new DL systems, specifically designed for TPU.
Ray: A Distributed Framework for Emerging AI Applications: RiseLab’s new distributed system. Using shared memory for data communication.
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch: shared many similarities to Google’s Pathways.

Trade-off

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training: find trade-offs between DNN training performance optimization and energy consumption, by configuring batch size and GPU power limit.

Structured LLM Generation

Async Training

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models: training individual LMs before merging
Large Scale Distributed Deep Networks: parameter server

Self-play

O1 Replication Journey: A strategic Progress Report

Costs

RouteLLM: Learning to Route LLMs with Preference Data

RAG

From Local to Global: A Graph RAG Approach to Query-Focused Summarization: GraphRAG, compressing information into graphs so retrieval may contain more information
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting: SpeculativeRAG: use smaller models with a subset of retrieved documents for faster draft generation, and use the larger LM for verification.
Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models: summarize retrieval documents instead of merely just retrieving documents as is.
Seven Failure Points When Engineering a Retrieval Augmented Generation System: a study showing how different configurations in different retrieval stages may affect generation quality.
Active Retrieval Augmented Generation: determine when to perform retrieval, and how to induce the query for retrieval
MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery