The Past, Present and Future of Tokenization

Date:

Friday, 29 November, 2024 - 12:00 to 13:00

Speaker:

Benjamin Minixhofer (Language Technology Lab, University of Cambridge)

Venue:

Zoom link: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09

Abstract:

Current large language models (LLMs) predominantly use subword tokenization. They see text as chunks (called "tokens") made up of individual words, or parts of words. This has a number of consequences. For example, LLMs often struggle with seemingly simple tasks involving character-level knowledge, like counting the number of letters in a word or comparing two numbers. Subword tokenization can also lead to discrepancies across languages: processing English text with an LLM is often cheaper than processing text in other languages. We will talk about how these issues came to be, as well as how to potentially improve tokenization by moving away from subwords (e.g., to models directly ingesting bytes) and/or towards more adaptive, modular, tokenization. Finally, we will conclude with discussing the far reach of tokenization into seemingly unrelated fields (model merging and multimodality).

Speaker Biography: Benjamin Minixhofer is a PhD student in the Language Technology Lab, interested in multilinguality, tokenization and language emergence.

Seminar series:

NLIP Seminar Series

View on talks.cam

Calendar

Upcoming seminars

20Oct

Federated Learning at H.IAAC: On-going Research and Opportunities

Allan M. de Souza & Luiz Bittencourt, Universidade Estadual de Campinas (UNICAMP), Brazil

Cambridge ML Systems Seminar Series
20Oct

Bloomberg: Observability in Action: Designing Effective Dashboards

Speaker to be confirmed

Technical Talks
20Oct

Talk by Professor Bjarne Stroustrup: 'Concept-based Generic Programming'

Bjarne Stroustrup, Professor of Computer Science at Columbia University

Department of Computer Science and Technology talks and seminars
21Oct

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

William Marino (University of Cambridge)

Artificial Intelligence Research Group Talks
21Oct

Scalable and Verifiable Carbon Accounting in Supply Chains: Towards an Integrated Framework

Jonathan Heiss (TU Berlin)

Security Seminar

View all seminars

Upcoming seminars

About the department

Social media

Study at Cambridge

About the University

Research at Cambridge