Making and breaking tokenizers

Date:

Friday, 17 October, 2025 - 12:00 to 13:00

Speaker:

Sander Land (Writer)

Venue:

SS03 Hybrid (In-Person + Online). Here is the Zoom link: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09

Despite massive investments in training large language models, tokenizers remain a critical but often neglected component with weaknesses that can cause wild hallucinations, bypass safety guardrails, and break downstream applications. This talk will cover:

Our recent research in automatically detecting problematic 'glitch' tokens in any model

Fundamental issues with pretokenizers and their design

Novel approaches to encodings and pretokenization that address some of these problems.

**Speaker Bio**
Sander Land is a researcher at Writer, previously working at Cohere. He completed his PhD at the Department of Computer Science, University of Oxford, before undertaking a postdoc at Biomedical Engineering, King's College London, University of London.

Seminar series:

NLIP Seminar Series

View on talks.cam

Calendar

Upcoming seminars

13Oct

Perplexity AI: Under the Hood of LLM Inference

Nandor Licker

Technical Talks
16Oct

Using interactive theorem provers in physics

Joseph Tooby-Smith (University of Bath)

Formalisation of mathematics with interactive theorem provers
17Oct

Making and breaking tokenizers

Sander Land (Writer)

NLIP Seminar Series
17Oct

The Dichotomy Theorem on the computational complexity of the Constraint Satisfaction Problem

Petar Markovic (University of Novi Sad)

Logic and Semantics Seminar
20Oct

Federated Learning at H.IAAC: On-going Research and Opportunities

Allan M. de Souza & Luiz Bittencourt, Universidade Estadual de Campinas (UNICAMP), Brazil

Cambridge ML Systems Seminar Series

View all seminars

Upcoming seminars

About the department

Social media

Study at Cambridge

About the University

Research at Cambridge