Things I’m Reading
ECE 8803 - Spring 2025 Reading List
- In-Datacenter Performance Analysis of a Tensor Processing Unit
- A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim
- Self-adaptive reconfigurable arrays (SARA)
- Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows
- Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for CNNs
- MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators
- PyTorch 2: Faster ML Through Bytecode Transformation
- Attention is all you need
- ShiftAddViT: Efficient Vision Transformer
- FlashAttention: IO-Aware Efficient Attention
- VEGETA: Sparse/Dense GEMM Tile Acceleration on CPUs
- Gamma: Accelerating Sparse Matrix Multiplication
- HighLight: DNN Acceleration with Structured Sparsity
- Abstracting Sparse DNN Acceleration via Structured Tensor Decomposition
- Timeloop: Systematic DNN Accelerator Evaluation
- Fully Sharded Data Parallel (FSDP)
- Megatron-LM: Training Multi-Billion Param Models
- LIBRA: Optimizing Network Topology for Distributed Training
- NCCL Collectives
- TopoOpt: Network + Parallelization Strategy Co-Design
- TACOS: Topology-Aware Collective Algorithm Synthesizer
- Alpa: Automating Parallelism in Deep Learning
- XRBench: XR Benchmarking for the Metaverse
- Benchmarking AI (MLSysBook Chapter)
- AIrchitect: ML-based Hardware Mapping Optimization
- TPU v4: Optically Reconfigurable ML Supercomputer
- Data-Driven Offline Optimization for Hardware Accelerators
- Merged Logic and Memory Fabrics for ML
- FPGA/DNN Co-Design for Edge AI
- Digital Row-Pipelined Compute-in-Memory NN Accelerator
- CogSys: Neurosymbolic Cognition System Co-Design
CS8803 - Datacenter Networks & Systems (Spring 2022)
- Berkeley View on Cloud
- Datacenter as a Computer
- Berkeley View on Serverless
- Fifth Epoch of Distributed Computing (YouTube)
- Spark
- Ray
- PipeDream
- μTune
- SVE
- Data Analytics
- Inside Facebook’s Datacenter
- Facebook’s Microbursts
- Nature of Microsoft’s Datacenter
- WAN Traffic
- FatTree
- FatClique
- Portland
- High Performance Datacenter Networks
- SmartNICs
- TPUs
- nanoPU
- Pigasus
- Network for Disaggregation
- LegoOS
- AIFM
- LeapIO
- DemiKernel
- SNAP
- Video Streaming
- Firecracker
- eRPC
- R2P2
- Breakwater
- Cerebros
- RDMA Design Guidelines
- Rethinking RDMA
- RDMA over Ethernet at Scale
- FaSST
- DCTCP
- NDP
- HPCC
- pFabric
- Sonata
- SIMON
- Omnimon
- Pingmesh
- P4
- RMT
- SwitchML
- SilkRoad
- ATP
- Borg
- Tetris
- Themis
- DRF
- Decima
- FairCloud
- VMWare Network Virtualization
- FastPass
- PicNIC
18.330: Introduction to Numerical Analysis (Spring 2012) Lecture Notes
- Lecture 1: Series and Sequences
- Lecture 2: Integrals as Sums and Derivatives as Differences
- Lecture 3: Interpolation
- Lecture 4: Nonlinear Equations
- Lecture 5: Methods for Ordinary Differential Equations
- Lecture 6: Fourier Analysis
- Lecture 7: Spectral Interpolation, Differentiation, Quadrature
6.172: Performance Engineering of Software Systems (Fall 2018) Supplemental Readings
- GraphIt: A High-Performance Graph DSL
- OpenTuner: An Extensible Framework for Program Autotuning
- The Cilk++ Concurrency Platform
- How to Survive the Multicore Software Revolution (or at Least Survive the Hype)
- The Implementation of the Cilk-5 Multithreaded Language
- The Cilkview Scalability Analyzer
- Producing Wrong Data Without Doing Anything Obviously Wrong
- Hoard: A Scalable Memory Allocator for Multithreaded Applications
- SuperMalloc: A Super Fast Multithreaded Malloc for 64-bit Machines
- Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
- Cache-Oblivious Algorithms
- Cache-Oblivious Algorithms and Data Structures
- A Simple Deterministic Algorithm for Guaranteeing the Forward Progress of Transactions
Misc Papers
- Multi-Level Instruction Caching for Deep Learning Workloads
- XLA: Optimizing Compiler for Machine Learning
- Understanding Performance of Tensor Computations
- The Conflict-Driven Clause Learning SAT Solvers
- The Case for Learned SAT Solvers
- Model Checking TinyOS Applications with T2
- IEEE 754: An Interview with Prof. William Kahan
- Rethinking Floating Point for Deep Learning
- BFloat16: The Secret to High-Performance AI
- Cube-and-Conquer: A Hybrid SAT Solver for Hard Problems
- Finite Model Finding for Quantified Formulas
- Efficient Implementation of Watch Literals
- The Next 700 SAT Solvers
- High-Level Synthesis of Digital Systems
- Hash Table on FPGA
Misc Courses
- Parallel Programming Course – Aalto University
- Performance Engineering
- Digital Systems Design
- Linux Kernel Development
Misc Textbooks
- Computer Networks: An Open Source Approach
- Computer Network Systems
- Algorithmica HPC Book Index
- Computer Networks: A Systems Approach
- Systems Performance: Enterprise and the Cloud (2020)
- The TCP/IP Guide
- Handbook of Satisfiability
- Computer Networks
- TCP/IP Illustrated
- How to Scale Your Model
- Programming Massively Parallel Processors