Gordon Lichtstein
A website for my code, projects, and bio
View My GitHub Profile
Things I’m Reading
ECE 8803 - Spring 2025 Reading List
In-Datacenter Performance Analysis of a Tensor Processing Unit
A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim
Self-adaptive reconfigurable arrays (SARA)
Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for CNNs
MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators
PyTorch 2: Faster ML Through Bytecode Transformation
Attention is all you need
ShiftAddViT: Efficient Vision Transformer
FlashAttention: IO-Aware Efficient Attention
VEGETA: Sparse/Dense GEMM Tile Acceleration on CPUs
Gamma: Accelerating Sparse Matrix Multiplication
HighLight: DNN Acceleration with Structured Sparsity
Abstracting Sparse DNN Acceleration via Structured Tensor Decomposition
Timeloop: Systematic DNN Accelerator Evaluation
Fully Sharded Data Parallel (FSDP)
Megatron-LM: Training Multi-Billion Param Models
LIBRA: Optimizing Network Topology for Distributed Training
NCCL Collectives
TopoOpt: Network + Parallelization Strategy Co-Design
TACOS: Topology-Aware Collective Algorithm Synthesizer
Alpa: Automating Parallelism in Deep Learning
XRBench: XR Benchmarking for the Metaverse
Benchmarking AI (MLSysBook Chapter)
AIrchitect: ML-based Hardware Mapping Optimization
TPU v4: Optically Reconfigurable ML Supercomputer
Data-Driven Offline Optimization for Hardware Accelerators
Merged Logic and Memory Fabrics for ML
FPGA/DNN Co-Design for Edge AI
Digital Row-Pipelined Compute-in-Memory NN Accelerator
CogSys: Neurosymbolic Cognition System Co-Design
CS8803 - Datacenter Networks & Systems (Spring 2022)
[*]
Berkeley View on Cloud
Datacenter as a Computer
Berkeley View on Serverless
Fifth Epoch of Distributed Computing (YouTube)
Spark
Ray
PipeDream
μTune
SVE
Data Analytics
Inside Facebook’s Datacenter
Facebook’s Microbursts
Nature of Microsoft’s Datacenter
WAN Traffic
FatTree
FatClique
Portland
High Performance Datacenter Networks
SmartNICs
TPUs
nanoPU
Pigasus
Network for Disaggregation
LegoOS
AIFM
LeapIO
DemiKernel
SNAP
Video Streaming
Firecracker
eRPC
R2P2
Breakwater
Cerebros
RDMA Design Guidelines
Rethinking RDMA
RDMA over Ethernet at Scale
FaSST
DCTCP
NDP
HPCC
pFabric
Sonata
SIMON
Omnimon
Pingmesh
P4
RMT
SwitchML
SilkRoad
ATP
Borg
Tetris
Themis
DRF
Decima
FairCloud
VMWare Network Virtualization
FastPass
PicNIC
18.330: Introduction to Numerical Analysis (Spring 2012) Lecture Notes
Lecture 1: Series and Sequences
Lecture 2: Integrals as Sums and Derivatives as Differences
Lecture 3: Interpolation
Lecture 4: Nonlinear Equations
Lecture 5: Methods for Ordinary Differential Equations
Lecture 6: Fourier Analysis
Lecture 7: Spectral Interpolation, Differentiation, Quadrature
6.172: Performance Engineering of Software Systems (Fall 2018) Supplemental Readings
GraphIt: A High-Performance Graph DSL
OpenTuner: An Extensible Framework for Program Autotuning
The Cilk++ Concurrency Platform
How to Survive the Multicore Software Revolution (or at Least Survive the Hype)
The Implementation of the Cilk-5 Multithreaded Language
The Cilkview Scalability Analyzer
Producing Wrong Data Without Doing Anything Obviously Wrong
Hoard: A Scalable Memory Allocator for Multithreaded Applications
SuperMalloc: A Super Fast Multithreaded Malloc for 64-bit Machines
Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
Cache-Oblivious Algorithms
Cache-Oblivious Algorithms and Data Structures
A Simple Deterministic Algorithm for Guaranteeing the Forward Progress of Transactions
Misc Papers
Multi-Level Instruction Caching for Deep Learning Workloads
XLA: Optimizing Compiler for Machine Learning
Understanding Performance of Tensor Computations
The Conflict-Driven Clause Learning SAT Solvers
The Case for Learned SAT Solvers
Model Checking TinyOS Applications with T2
IEEE 754: An Interview with Prof. William Kahan
Rethinking Floating Point for Deep Learning
BFloat16: The Secret to High-Performance AI
Cube-and-Conquer: A Hybrid SAT Solver for Hard Problems
Finite Model Finding for Quantified Formulas
Efficient Implementation of Watch Literals
The Next 700 SAT Solvers
High-Level Synthesis of Digital Systems
Hash Table on FPGA
Misc Courses
Parallel Programming Course – Aalto University
Performance Engineering
Digital Systems Design
Linux Kernel Development
Misc Textbooks
Computer Networks: An Open Source Approach
Computer Network Systems
Algorithmica HPC Book Index
Computer Networks: A Systems Approach
Systems Performance: Enterprise and the Cloud (2020)
The TCP/IP Guide
Handbook of Satisfiability
Computer Networks
TCP/IP Illustrated
How to Scale Your Model
Programming Massively Parallel Processors