Transformers have recently sparked significant interest in AI, driving advancements in accuracy and enabling a wide range of applications, from multi-modal intelligent assistants to autonomous systems. While their scaling laws promise even greater capabilities, the demands on hardware and data present significant challenges. In response, there is growing interest in compressing these models to smaller, more efficient forms, making them feasible for deployment with lower resource requirements. As edge and mobile devices are integrating increasingly powerful System-On-Chips (SoCs), deploying these models locally becomes viable, thus enabling new use-cases while enhancing privacy, sustainability and task-specific customization.
In this talk, I will be touching upon two areas: first, measuring the execution efficiency and deployability of Large Language Models (LLMs) on mobile and edge devices; and second optimising DNN workloads for efficiency through low-rank decompositions. I will introduce MELT (MobiCom'24), a benchmarking framework designed to assess the computational, memory, energy, and thermal characteristics of LLMs running on device, identifying associated bottlenecks. Following this, I will present Maestro (ICML'24), a novel approach leveraging trainable low-rank decompositions to enable more efficient training and deployment of DNNs, enabled via data-informed progressive shrinking of networks.
"You can also join us on Zoom":https://cam-ac-uk.zoom.us/j/83400335522?pwd=LkjYvMOvVpMbabOV1MVTm8QU6DrGN7.1