Abstract:
Current large language models (LLMs) rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for languages and domains underrepresented in the original vocabulary. In this talk, I will show how we can efficiently distill high-quality input embeddings for novel tokens from existing pretrained LLMs, allowing for flexible adaptation of an LLM’s vocabulary to new languages and domains. Crucially, even though these distilled embeddings were not part of the LLM’s token embedding matrix during training, they can be used directly without modifying the main Transformer parameters. This suggests that LLMs already implicitly learn to operate over single-vector representations of semantic units, even when those units were not seen as dedicated tokens during training. Building on this perspective, we will finish by discussing how we might rethink the use of token embeddings in future LLMs.
Bio:
Konstantin Dobler is a PhD student at the Hasso Plattner Institute and the ELLIS Unit Potsdam, and is currently a Visiting Scholar at the University of Copenhagen. His research investigates the representation spaces and input/output units of language models — usually called “tokens” — aiming to improve multilingual modeling, tokenizer adaptation, and computational efficiency.
