Unveiling the Secret Sauce: A Causal Look at Data Memorisation and Tokenisation in Language Models

Date:

Friday, 30 May, 2025 - 12:00 to 13:00

Speaker:

Pietro Lesci (University of Cambridge)

Venue:

Room FW26 with Hybrid Format. Here is the Zoom link for those that wish to join online: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09

While model design gets much of the spotlight, subtle data choices, such as which documents are seen and how they’re represented, can profoundly shape the behaviour of language models. Nowadays, training data is the secret sauce behind a language model’s success, yet it remains relatively understudied. In this talk, I will discuss how training data influences a model’s behaviour via two key phenomena: **memorisation** and **tokenisation bias**.
First, I’ll present our work on **memorisation**, asking: *To what extent does a model remember specific documents it was trained on?* Directly answering this question is computationally expensive. Instead, we frame memorisation as a causal question and introduce an efficient method to estimate it without re-training. This reveals how memorisation depends on factors such as data order and model size.
Next, I’ll discuss how **subword tokenisation**, often seen as a preprocessing detail, systematically biases model predictions. We ask: *How would a model’s output change if a piece of text were tokenised as one subword instead of two?* Using tools from econometrics, we estimate this counterfactual question without re-training the model using a different vocabulary. We show that when a piece of text is tokenised into fewer subwords, it consistently receives a higher probability.
Together, these results show that training data profoundly shapes a model’s behaviour. Causal methods let us efficiently estimate and understand these phenomena, offering insight into how to better train language models.

Bio: Pietro Lesci is a final-year PhD student in Computer Science at the University of Cambridge, working with Prof Andreas Vlachos. His research explores how training data shape a model’s behaviour, focusing on memorisation, tokenisation, and generalisation. To study this question, he draws on causal methods from econometrics. His work has been presented at major machine learning conferences such as ICLR, ACL, NAACL, and EMNLP. He has received the Best Paper Award at ACL 2024, the Paper of the Year Award from Cambridge’s Department of Computer Science and Technology, and funding from Translated’s Imminent Research Grant. Pietro’s experience spans academia and industry, including 3+ years working in research labs, consulting firms, and international institutions. He holds an MSc in Economic and Social Sciences from Bocconi University.

Seminar series:

NLIP Seminar Series

View on talks.cam

Calendar

Upcoming seminars

20Oct

Federated Learning at H.IAAC: On-going Research and Opportunities

Allan M. de Souza & Luiz Bittencourt, Universidade Estadual de Campinas (UNICAMP), Brazil

Cambridge ML Systems Seminar Series
20Oct

Bloomberg: Observability in Action: Designing Effective Dashboards

Speaker to be confirmed

Technical Talks
20Oct

Talk by Professor Bjarne Stroustrup: 'Concept-based Generic Programming'

Bjarne Stroustrup, Professor of Computer Science at Columbia University

Department of Computer Science and Technology talks and seminars
21Oct

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

William Marino (University of Cambridge)

Artificial Intelligence Research Group Talks
21Oct

Scalable and Verifiable Carbon Accounting in Supply Chains: Towards an Integrated Framework

Jonathan Heiss (TU Berlin)

Security Seminar

View all seminars

Upcoming seminars

About the department

Social media

Study at Cambridge

About the University

Research at Cambridge