skip to content

Department of Computer Science and Technology

Date: 
Friday, 30 May, 2025 - 12:00 to 13:00
Speaker: 
Pietro Lesci (University of Cambridge)
Venue: 
Room FW26 with Hybrid Format. Here is the Zoom link for those that wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09

While model design gets much of the spotlight, subtle data choices, such as which documents are seen and how they’re represented, can profoundly shape the behaviour of language models. Nowadays, training data is the secret sauce behind a language model’s success, yet it remains relatively understudied. In this talk, I will discuss how training data influences a model’s behaviour via two key phenomena: **memorisation** and **tokenisation bias**.
First, I’ll present our work on **memorisation**, asking: *To what extent does a model remember specific documents it was trained on?* Directly answering this question is computationally expensive. Instead, we frame memorisation as a causal question and introduce an efficient method to estimate it without re-training. This reveals how memorisation depends on factors such as data order and model size.
Next, I’ll discuss how **subword tokenisation**, often seen as a preprocessing detail, systematically biases model predictions. We ask: *How would a model’s output change if a piece of text were tokenised as one subword instead of two?* Using tools from econometrics, we estimate this counterfactual question without re-training the model using a different vocabulary. We show that when a piece of text is tokenised into fewer subwords, it consistently receives a higher probability.
Together, these results show that training data profoundly shapes a model’s behaviour. Causal methods let us efficiently estimate and understand these phenomena, offering insight into how to better train language models.

Bio: Pietro Lesci is a final-year PhD student in Computer Science at the University of Cambridge, working with Prof Andreas Vlachos. His research explores how training data shape a model’s behaviour, focusing on memorisation, tokenisation, and generalisation. To study this question, he draws on causal methods from econometrics. His work has been presented at major machine learning conferences such as ICLR, ACL, NAACL, and EMNLP. He has received the Best Paper Award at ACL 2024, the Paper of the Year Award from Cambridge’s Department of Computer Science and Technology, and funding from Translated’s Imminent Research Grant. Pietro’s experience spans academia and industry, including 3+ years working in research labs, consulting firms, and international institutions. He holds an MSc in Economic and Social Sciences from Bocconi University.

Seminar series: 
NLIP Seminar Series

Upcoming seminars