skip to content

Department of Computer Science and Technology

Thursday, 23 July, 2020 - 15:00 to 16:00
Manya Ghobadi

The emergence of optical I/O chiplets enables compute/memory chips to communicate with several Tbps bandwidth. Many technology trends point to the arrival of optical I/O chiplets as a key industry inflection point to realize fully disaggregated systems. In this talk, I will focus on the potential of optical I/O-enabled accelerators for building high bandwidth interconnects tailored for distributed machine learning training. Our goal is to scale the state-of-the-art ML training platforms, such as NVIDIA's DGX, from a few tightly connected GPUs in one package to hundreds of GPUs while maintaining Tbps communication bandwidth across the chips. Our design enables accelerating the training time of popular ML models using a device placement algorithm that partitions the training job with data, model, and pipeline parallelism across nodes, while ensuring a sparse and local communication pattern that can be supported efficiently on the interconnect.

Bio: Manya Ghobadi is an assistant professor at the EECS department at MIT. Before MIT, she was a researcher at Microsoft Research and a software engineer at Google Platforms. Manya is a computer systems researcher with a networking focus and has worked on a broad set of topics, including data center networking, optical networks, transport protocols, and network measurement. Her work has won the best dataset award and best paper award at the ACM Internet Measurement Conference (IMC) as well as Google research excellent paper award.

Seminar series: 
Systems Research Group Seminar