Machine Learning System Resources
This is my personal list of resources related to machine learning systems. Feel free to drop me an email if you think there’s something worth mentioning. I will try to update this page frequently to include the most recent stuffs in mlsys.
Resources
- Facebook’s external large-scale work
- NGC Container Doc: great for development, without having to manually install CUDA, pytorch, and other dependencies.
- Awesome-System-for-Machine-Learning: A curated list of research in machine learning systems (MLSys). Paper notes are also provided.
- Mastering LLM Techniques: Inference Optimization: summary of techniques used for LLM deployment
- LLM Visualization
Courses
- Deep Learning Systems: Algorithms and Implementation
- Machine Learning Compilation: offered by Tianqi Chen, intro to ML compiler. Open to all.
- 15-884: Machine Learning Systems: offered by Tianqi Chen at CMU.
- CSE 291F: Advanced Data Analytics and ML Systems: offered by Arun Kumar at UCSD.
- Tinyflow: tutorial code on how to build your own Deep Learning System in 2k Lines.
- CSE 599W: Systems for ML: offered by Tianqi Chen at UW.
- CS8803-SMR: Special Topics: Systems for Machine Learning: offered by Alexey Tumanov at Georgia Tech. [schedule]
- CS 744: Big Data Systems: offered by Aditya Akella back at UW-Madison.
- CS 329S: Machine Learning Systems Design: offered by Stanford.
- EECS 598: Systems for AI: offered by Mosharaf Chowdhury at UMich.
- CS 378: Systems for Machine Learning: offered by Aditya Akella at UT.
- Machine Learning Systems (Fall 2019): from UCB
- ECE 382V SysML: Computer Systems and Machine Learning Interplay: taught by Neeraja Yadwadkar at UT.
Labs & Faculties
- CMU Catalyst
- Berkeley RISE Lab
- MIT DASIL Lab
- UW SAMPL
- SymbioticLab
- Shivaram Venkataraman at UW-Madison
- UTNS Lab
- Neeraja Yadwadkar at UT Austin
- SAIL: Systems for Artificial Intelligence Lab @ GT
- Amar Phanishayee @ MSR
- Hong Zhang @ University of Waterloo
- Manya Ghobadi @ MIT
- WukLab @ UCSD
- Hao AI Lab @ UCSD
Tutorials
- The Illustrated Transformer: best introduction for transformer
- Dive into Deep Learning: interactive deep learning book with code, math, and discussions.
- CS231n: Convolutional Neural Networks for Visual Recognition
- Pytorch-Internals: must-read for PyTorch basics.
- Differential Programming with JAX
- Getting started with JAX (MLPs, CNNs & RNNs)
- Physics-based Deep Learning
- 机器学习系统:设计和实现
- Dive into Deep Learning Compiler
- Autodidax: JAX core from scratch: really really good resource for learning Jax internals.
- Extending JAX with custom C++ and CUDA code
- Vector, Matrix, and Tensor Derivatives
- ML Memory Optimization: slides from UW. Visualization of dataflow graph helps understand how to optimize memory.
- Pytorch模型加速系列(一)——新的Torch-TensorRT以及TorchScript/FX/dynamo
- Semianalysis: many good posts
- How to Load PyTorch Models 340 Times Faster with Ray
- 分布式深度学习系统
- MLsys各方向综述
- 金雪锋: MindSpore 技术负责人
- 解读谷歌Pathways架构(一):Single-controller与Multi-controller
- Why are ML Compilers so Hard?
- Alpa: Automated Model-Parallel Deep Learning
- Data Transfer Speed Comparison: Ray Plasma Store vs. S3
- 从零开始学深度学习编译器
- 一文搞懂 TorchDynamo 原理
Seminars
Papers
This section could potentially be extremely long..
Training
Really broad topic…
- The Llama 3 Herd of Models: really great paper explaining how SOTA models are trained in real world!
LLM
You an also refer to Awesome-LLM
- Large Transformer Model Inference Optimization
- HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face: use LLM as the controller to coordinate with exernal models for complicated tasks
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity: load as sparse, compute as dense
- EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism: Overlapping LLM forward with KV caching computation.
- Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache: managing KV cache in distributed settings.
- MatFormer: Nested Transformer for Elastic Inference: adaptive blocks during inference.
KV Cache
- CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving
- Efficient Memory Management for Large Language Model Serving with PagedAttention: (vLLM) allows non-contiguous KV cache
- Efficiently Programming Large Language Models using SGLang: RadixAttention that allows Kv cache sharing in prompt engineering
- MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS: dynamic KV cached token dropping to long sequence.
- Efficient Streaming Language Models with Attention Sinks: (Stream-LLM)
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- FlashDecoding++: Faster Large Language Model Inference on GPUs
- FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention
- vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Datasets
- LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset: real-world conversation
ML Compilers
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- Learning to Optimize Tensor Programs: facilitate efficient ML kernel search using ML.
- The OoO VLIW JIT Compiler for GPU Inference: JIT Compiler to enable better GPU multiplexing.
- LazyTensor: combining eager execution with domain-specific compilers: combining dynamic graph with JIT. summary
- Automatic Generation of High-Performance Quantized Machine Learning Kernels
- DietCode: Automatic Optimization for Dynamic Tensor Programs
- The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding
- The Deep Learning Compiler: A Comprehensive Survey
- Graphiler: A Compiler for Graph Neural Networks: ML compiler specifically designed for GNN.
Graph Optimization
- Accelerating Multi-Model Inference by Merging DNNs of Different Weights: combine multiple instances of a model into one computational graph
Inference
- INFaaS: Automated Model-less Inference Serving: developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale: transformer-specific inference optimization done by DeepSpeed.
- Clipper: A Low-Latency Online Prediction Serving System: a nice overview of inference system. Not SOTA but a good starter.
- Orloj: Predictably Serving Unpredictable DNNs: shares similarity to Clipper, but targeting models that may yield unpredictable performance.
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving: how to multiplex devices to serve multiple model while meeting latency constraint.
- MOSEL: Inference Serving Using Dynamic Modality Selection: dynamic modality selections for accuracy and SLO tradeoff
- BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models: finer-grained LLM serving
- Orca: A Distributed Serving System for Transformer-Based Generative Models: iteration-based LLM inference
- Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: inference on heterogeneous accelerators using max-flow
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving: automatic configuration of inter and intra-node parallelism for model for both prefill and decoding.
Multitenancy
- MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
- A Survey of Multi-Tenant Deep Learning Inference on GPU: efficient resource management for multi-tenant inference.
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction: overlap DNN operators of different models in an online fashion
- THEMIS : Fair and Efficient GPU Cluster Scheduling: minimize the maximum finish time fairness across all ML apps while efficiently utilizing cluster GPUs
Dynamic Neural Network
- Dynamic Neural Networks: A Survey
- Dynamic Multimodal Fusion
- Hash Layers For Large Sparse Models: Using hashing for MoE gating.
- An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems: use evolution algorithm to update model structure during the training phase.
- FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping: dynamically skip FFN layers in Transformers suing cosine similarity.
- SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference: monotonically decrease LLM layers to ease KV cache management
Auto Placement
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- Beyond Data and Model Parallelism for Deep Neural Networks: called FlexFlow.
Federated Learning
Switch & ML
- Do Switches Dream of Machine Learning? Toward In-Network Classification
- Taurus: A Data Plane Architecture for Per-Packet ML
Memory Management
System Design
- Pathways: Asynchronous Distributed Dataflow for ML: Google’s new DL systems, specifically designed for TPU.
- Ray: A Distributed Framework for Emerging AI Applications: RiseLab’s new distributed system. Using shared memory for data communication.
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
- OneFlow: Redesign the Distributed Deep Learning Framework from Scratch: shared many similarities to Google’s Pathways.
Trade-off
- Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training: find trade-offs between DNN training performance optimization and energy consumption, by configuring batch size and GPU power limit.
Structured LLM Generation
- Efficient Guided Generation for Large Language Models: Outline
- Prompting Is Programming: A Query Language for Large Language Models
Async Training
- Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models: training individual LMs before merging
- Large Scale Distributed Deep Networks: parameter server
Self-play
Costs
RAG
- From Local to Global: A Graph RAG Approach to Query-Focused Summarization: GraphRAG, compressing information into graphs so retrieval may contain more information
- Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting: SpeculativeRAG: use smaller models with a subset of retrieved documents for faster draft generation, and use the larger LM for verification.
- Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models: summarize retrieval documents instead of merely just retrieving documents as is.
- Seven Failure Points When Engineering a Retrieval Augmented Generation System: a study showing how different configurations in different retrieval stages may affect generation quality.
- Active Retrieval Augmented Generation: determine when to perform retrieval, and how to induce the query for retrieval
- MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery