- PhD Candidate
Suchir Salhan is a PhD candidate at the University of Cambridge (Gonville & Caius College), working on Language Models and Cognitively-Inspired AI. With an interdisciplinary background in Computer Science, Cognitive Science, and Linguistics, his research focuses on Small Language Models, leveraging insights from human cognition to develop multi-agent AI systems that are interpretable, fair, and equitable.
Biography
I’ve had a long fascination with the intersection of language and computation—how humans have developed the capability to acquire natural language to communicate, learn, and reason, despite the diversity of linguistic systems; and how we might build machines that can do the same. I arrived in Cambridge in 2020 to pursue a BA and MEng in Computer Science & Linguistics at Gonville & Caius College, Cambridge, where I earned a “starred First” and a Distinction. Throughout my undergraduate studies, I worked on multimodality in the Language Technology Lab (with Prof Nigel Collier and Dr Fangyu Liu, Google DeepMind), code-switching with Dr Li Nguyen and with Prof. Paula Buttery and Prof. Andrew Caines on multiple research projects over several years that formed the foundation of my PhD work on Small Language Models and Cognitively Inspired AI. Outside of academia, I lead Per Capita Media, Cambridge University's newest independent publication supported by a team of students and academics from Cambridge and other academic institutions, including the University of Oxford and the University of the Arts London. I am also involved in student policy think tanks, as the Head of Policy at The Wilberforce Society, the UK's oldest student think tank in the UK based at the University of Cambridge, and organise several speaker events throughout the University. See my extended CV or my academic website for further information.
Research
Small Language Models: Our group released PicoLM, the Cambridge Small Language Model & Learning Dynamics Framework in March 2025 to investigate these research questions. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube. We use Pico to investigate questions related to multilingual pretraining, learning dynamics and language model interpretability. I manage and supervise students working on the Pico Framework.
Cognitively-Inspired AI: My Masters' Thesis focused on the BabyLM Shared Task to train Small Language Models using acquisition-inspired strategies on “cognitively-plausible” corpora for several languages. My PhD work now connects the BabyLM paradigm with the fast-moving Language Modelling ecosystem. While Large Language Models (LLMs) are increasingly used in high-stakes applications—such as assessing human performance—they often lack steerability, alignment, and interpretability. I work to address this by developing Cognitively-Inspired Small Language Models (SLMs). These SLMs can guide and calibrate LLM behaviour in multi-agent environments, aligning AI outputs with user preferences and domain-specific tasks.
See my Cambridge Language Sciences page for more information on my research interests in Cognitive Science and Linguistics.
Teaching
Guest Lecturer and Teaching Assistant
- L95 (ACS/Part III) Introduction to Natural Language Syntax and Parsing. Led significant refactoring and updating of L95 course material
- Guest Lecturer for Li18 Computational Linguistics, 2025-26 (Part II Linguistics Tripos) (convened by Dr Guy Emerson).
- Machine Learning & Real World Data (Part IA, Computer Science Tripos). Teaching Assistant (2024-25), lectured by Dr Fermin Moscoso del Prado Martin and Dr Luca Benedetto.
Research Supervision
- MPhil ACS Thesis: Bianca Ganescu (2024-25), Yeji Heo (2025-26)
- Undergraduate Research Opportunity Programme (UROP) Supervisor: Shivan Arora and Ellie Polyakova Reed (Summer 2025).
- PicoLM Research Mentor for Google DeepMind Research Ready Programme: Ali Kheirkhah (Summer 2025)
- Visiting Research Student: Andrzej Szablewski (with Dr Clara Meister and Dr Tiago Pimentel)
- ALTA Institute Research Assistant: Working with Laura Barbanel, Lily Goulder & Aoife O'Driscoll, line-managed by Prof Paula Buttery and Dr Andrew Caines.
Supervisions
Machine Learning and Bayesian Inference (Part II, Computer Science Tripos)
Data Science (Part IB, Computer Science Tripos)
Formal Models of Language (Part IB, Computer Science Tripos)
Artificial Intelligence (Part IB, Computer Science Tripos)
Probability (Part IA, Computer Science Tripos)
Li18 Computational Linguistics (Part IIA/IIB Linguistics Tripos)
College Supervisor for Linguistics Tripos (Gonville & Caius College) – Linguistic Theory (Part IIB, Linguistics Tripos), Part I Linguistics Tripos.
College Examiner for Computer Science Tripos Mock Examinations (Gonville & Caius College)
Professional Activities
Departmental Activity
Organiser and Host of the Natural Language & Information Processing Seminars, 2024 -. Natural Language & Information Processing Group (CST). Organising 30+ departmental seminars with leading academics and industry researchers on Language Models, Computational Linguistics and Natural Language Processing. List of Organised Seminars.
University-Wide & Interdisciplinary Initiatives
Language Sciences Annual Symposium 2025: Ambitions for language science in 2050. Poster Session Organiser for 2025 Cambridge Language Sciences Symposium with Sammy Weiss (MRC Cognition and Brain Sciences Unit) and Shanshan Hu (TAL). CLS 2025 Website.
23rd Old-World Conference in Phonology (OCP23). Member of Organising Committee. Gonville & Caius College (January 2026). OCP23 Website (Phonetic Laboratory, Department of Theoretical & Applied Linguistics).
Reviewing & Service
Reviewer for BabyLM 2024. ACL 2025 Emergency Reviewer. Reviewer for The First Workshop on Large Language Model Memorization – L2M2 Proceedings @ ACL 2025. Reviewer for NEURIPS CogInterp Workshop. Reviewer for NEURIPS What Can't Transformers Do (WCTD) Workshop.
Publications
Key Themes: (i) Cognitively-Inspired Design, Interpretability and Evaluation (♣), (ii) Language Model Pretraining (✰), (iii) Multilinguality (✦), (iv) Tokenization (✿), (v) Alignment and Interaction (✒︎) and (vi) Cognitive Science and Linguistics (♦️).
Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. 2024. Suchir Salhan, Richard Diehl-Martinez, Zebulon Goriely, Paula Buttery. 2024. BabyLM Shared Task (Paper Track), Conference of Natural Language Learnning (CoNLL). Poster Presentation at EMNLP (Miami, FL, USA, November 2024). ♣ ✰
The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2025. Fermin Moscoso del Prado Martin, Suchir Salhan. 13th Conference on the Mental Lexicon. Invited Keynote delivered by Fermin Moscoso del Prado Martin in McGill University, Montreal (June 2025). Slides. ✦♦️
ByteSpan: Information-Driven Subword Tokenisation. Zebulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery. ICML 2025 Tokenization Workshop (TokShop). Delivered ByteSpan Poster Presentation in Vancouver, Canada (August 2025). ✿
Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance. Fermin Moscoso del Prado Martin, Suchir Salhan. ACL Main Conference (Poster) – I presented this in Vienna, Austria (August 2025). Poster | Slides. ♦️
Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research. Richard Diehl-Martinez, David Demitri Africa, Yuval Weiss, Suchir Salhan, Ryan Daniels, Paula Buttery. EMNLP Systems Demonstration 2025. Presentation in Suzhou, China. Pico Website | Demo Video (YouTube) | HuggingFace. ✰
Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction. Suchir Salhan, Hongyi Gu, Donya Rooein, Diana Galvan-Sosa, Gabrielle Gaudeau, Andrew Caines, Zheng Yuan, Paula Buttery. BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ✒︎ ♣
What's the Best Sequence Length for BabyLM?. Suchir Salhan, Richard Diehl-Martinez, Zebulon Goriely, Paula Buttery. BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✰
BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models. Yuan Gao , Suchir Salhan, Andrew Caines, Paula Buttery, Weiwei Sun BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✦
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling. Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines, Paula Buttery (Supervised MPhil Advanced Computer Science Thesis). BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✰
Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages. David Demitri Africa, Suchir Salhan, Yuval Weiss, Paula Buttery, Richard Diehl Martinez. 5th Workshop on Multilingual Representation Learning (MRL), EMNLP 2025. Presentation in Suzhou, China. ✦ ✰
Extended Abstract for "Linguistic Universals": Emergent Shared Features in Independent Monolingual Language Models via Sparse Autoencoders. Ej Zhou, Suchir Salhan. 5th Workshop on Multilingual Representation Learning (MRL), EMNLP 2025. Presentation in Suzhou, China. ✦♣
Pedagogical Alignment of LLMs requires Diverse Cognitively-Inspired Student Proxies. 2025. Suchir Salhan, Andrew Caines, Paula Buttery. NEURIPS First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models. Presentation in San Diego, California, USA. ♣
Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability. 2025. Suchir Salhan, Konstantinos Voudouris. NEURIPS First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models. Presentation in San Diego, California, USA. ♣
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data. Jaap Jumelet, Abdellah Fourtassi, Akari Haga, Bastian Bunzeck, Bhargav Shandilya, Diana Galvan-Sosa, Faiz Ghifari Haznitrama, Francesca Padovani, François Meyer, Hai Hu, Julen Etxaniz, Laurent Prévot, Linyang He, María Grandury, Mila Marcheva, Negar Foroutan, Nikitas Theodoropoulos, Pouya Sadeghi, Siyuan Song, Suchir Salhan, Susana Zhou, Yurii Paniv, Ziyin Zhang, Arianna Bisazza, Alex Warstadt, Leshem Choshen. ✦ Preprint | BabyBabelLM (Multilingual BabyLM) Website
The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2026. Fermin Moscoso del Prado Martin, Suchir Salhan. 23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College. (Accepted Oral). ♦️
Convergent Equilibria in Cross-Lingual Phoneme Surprisal Distributions: Statistical and Simulation-Based Analysis. 2026. Suchir Salhan, Fermin Moscoso del Prado Martin. 23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College. (Accepted Oral). Abstract♦️
Other publications:
On the Potential for Maximising Minimal Means in Transformer Language Models: A Dynamical Systems Perspective. Suchir Salhan. In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2023. Paper | Slides (Undergraduate Dissertation, Presentation at SyntaxLab, February 2023, St John's College, Cambridge, organised by Dr Theresa Biberauer)
Linguistics in the Age of Language Models: What Can Cognitively-Inspired Language Models Offer to Linguistic Theory? * . Suchir Salhan . In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2025. Paper.
Invited Talks, Presentations and Posters:
Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. Suchir Salhan. Presentations at Cambridge Language Sciences Symposium (November 2024), Poster at HumanCLAIM Workshop organised by Prof Lisa Beinborn in Gottingen Germany in March 2025. Accepted Poster and Demonstration at Cambridge CHIA (Centre for Human-Inspired AI) Annual Conference in June 2025.
Human-Validated Grammar Profiles for Language Models. Tubingen, Germany; March 2025 in a workshop organised by Prof Detmar Meurers
LLMs “off-the-shelf” or Pretrain-from-Scratch? Recalibrating Biases and Improving Transparency using Small-Scale Language Models.
Suchir Salhan, Richard Diehl-Martinez, Zebulon Goriely, Andrew Caines, Paula Buttery
Learning & Human Intelligence Group, Department of Computer Science & Technology, 2024
Bilingual Small Language Models as Cognitive Proxies for LLM Interaction and Calibration. Suchir Salhan. Learning & Human Intelligence Group, Department of Computer Science & Technology, 2025.
Engineering Small Language Models as Learner Models for LLM Interaction and Calibration. Suchir Salhan. ALTA Annual Review 2025.

