Abstract:
Remote Direct Memory Access (RDMA) is now a default building block for datacenter services, from request-response and storage workloads to dependency-heavy AI collectives. Proper scheduling can reduce communication time, yet datacenters typically leave RDMA traffic to simple fair sharing. In this talk, we introduce STORM, a NIC-level scheduler that uses only NIC-visible information: known RDMA request size and per-queue-pair backlog. STORM maps these signals to a small number of wire priorities, prioritizing requests that are near completion or blocking queued dependent work. It requires no application hints, supports both in-order RoCEv2 and reordering-tolerant RDMA stacks, and improves cloud and LLM training workloads.
Bio:
Jichun Wu is a PhD student at the University of Cambridge. His research focuses on RDMA and congestion control & load balancing for low-latency datacenter networking.
