When is Multilinguality a Curse? Language Modeling for 350 Languages

Date:

Friday, 6 June, 2025 - 15:00 to 16:00

Speaker:

Catherine Arnett and Tyler Chang (EleutherAI and UC San Diego)

Venue:

ONLINE ONLY. Here is the Zoom link: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09

NOTE THE UNUSUAL TIME FOR THIS SEMINAR

Language models work well for a small number of languages. For the other languages, the best existing language model is likely multilingual, still with the vast majority of the training data coming from English and a few "priority" languages. We show that in many cases, multilinguality leads to worse performance across many languages due to limited model capacity. We then train a suite of over 1,000 monolingual models for 350 languages, finding that these models can outperform multilingual models over ten times their size. However, multilinguality can also be a blessing: we train a small number of controlled bilingual models in order to study how crosslingual transfer happens. We aim to better understand transfer learning in order to better leverage multilinguality to improve language model performance for all languages.

Seminar series:

NLIP Seminar Series

View on talks.cam

Calendar

Upcoming seminars

10Oct

NLIP 2025 Social: Meet New PhD Students

Speaker to be confirmed

NLIP Seminar Series
10Oct

Evaluating Baseline and Forecasting Success: Making REDD+ More Credible

E-Ping Rau, University of Cambridge

Energy and Environment Group
10Oct

Semiring Semantics: Algebraic Properties vs. Logical Results

Sophie Brinke (RWTH Aachen)

Logic and Semantics Seminar
13Oct

Perplexity AI: Under the Hood of LLM Inference

Nandor Licker

Technical Talks
16Oct

Using interactive theorem provers in physics

Joseph Tooby-Smith (University of Bath)

Formalisation of mathematics with interactive theorem provers

View all seminars

Upcoming seminars

About the department

Social media

Study at Cambridge

About the University

Research at Cambridge