Suchir Salhan

PhD Candidate

I am a PhD candidate at the University of Cambridge (Gonville & Caius College), working on Language Models and Cognitively-Inspired AI. My interdisciplinary background in Computer Science, Cognitive Science, and Linguistics drives my interest in leveraging insights from human cognition to develop AI systems that are interpretable, fair, and equitable.

My research spans Machine Learning and Cognitive Science, focusing on multilinguality, interpretability, and multi-agent alignment. I approach these questions through Cognitively-Inspired AI, an emerging paradigm in NLP aimed at enhancing the cognitive capabilities of language models, evaluated in cognitively plausible environments.

While Large Language Models (LLMs) are increasingly used in high-stakes applications—such as assessing human performance—they often lack steerability, alignment, and interpretability. My work addresses this by developing Cognitively-Inspired Small Language Models (SLMs), which can calibrate LLM behavior in multi-agent settings and better align AI outputs with user preferences and domain-specific tasks. By explicitly modeling underrepresented populations of speakers and learners, these SLMs contribute to more equitable and robust AI systems.

Biography

I’ve had a long fascination with the intersection of language and computation—how humans have developed the capability to acquire natural language to communicate, learn, and reason, despite the diversity of linguistic systems; and how we might build machines that can do the same. I arrived in Cambridge in 2020 to pursue a BA and MEng in Computer Science & Linguistics at Gonville & Caius College, Cambridge, where I earned a “starred First” and a Distinction. During my time as an undergraduate, I explored code-switching with Dr Li Nguyen, worked on multimodal vision-language models with Prof Nigel Collier and Fangyu Liu (now at Google DeepMind), and participated in a funded internship at the ALTA Institute. I probed models like CLIP to understand their semantic representations, experimented with Nearest Neighbour Algorithms for Offline Imitation Learning, and investigated Explainable AI, Argumentation Mining, and Shortcut Learning in NLP. At the same time, my linguistic interests – mainly in typology and theoretical linguistics (syntactic theory, morphology, and phonology)—taught me the deep diversity and structure of human language, and inspired me to think about how AI might better reflect this complexity.

These experiences have shaped my current PhD work, where I aim to build AI systems that are both powerful and cognitively inspired, bridging insights from human language and computation. My Masters Thesis focused on the BabyLM Shared Task to train Small Language Models using acquisition-inspired strategies on “cognitively-plausible” corpora (e.g., child-directed speech) for several languages. My PhD work now connects the BabyLM paradigm with the fast-moving Language Modelling ecosystem. While Large Language Models (LLMs) are increasingly used in high-stakes applications—such as assessing human performance—they often lack steerability, alignment, and interpretability. I work to address this by developing Cognitively-Inspired Small Language Models (SLMs). These SLMs can guide and calibrate LLM behaviour in multi-agent environments, aligning AI outputs with user preferences and domain-specific tasks. By explicitly modelling underrepresented populations of speakers and learners, these models help make AI systems more equitable, robust, and human-aligned.

Outside of academia, I lead Per Capita Media, Cambridge University's newest independent publication supported by a team of students and academics from Cambridge and other academic institutions nationwide, including the University of Oxford and the University of the Arts London. I founded the publication in 2024, with the generous support of Lady Stothard, Dr Ruth Scurr FRSL. My journalistic output has seen me work with The One Show, and liaise with journalists from The Sunday Times and BBC Radio 5Live. I am also involved in student policy think tanks, as the Head of Policy at The Wilberforce Society, the UK's oldest student think tank in the UK based at the University of Cambridge, and organise several speaker events throughout the University. In the past, I have helped organise policy events with the Editor of the BBC Russian Service and the Foreign Minister of Sri Lanka.

Research

Small Language Models: The viability of 'Small LMs' as a coherent research programme relies on a successful consideration of efficiency, acceleration and architectural questions in pretraining.

Our group released PicoLM, the Cambridge Small Language Model & Learning Dynamics Framework in March 2025 to investigate these research questions. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube.
I have worked on dynamic tokenization and supported similar projects in the NLIP group and the L65 (Geometric Deep Learning) course on the MPhil ACS.

Cognitively-Inspired AI: The emergent capabilities of Transformers are subject to a great deal of interpretability work, however there is a clear mismatch between human language acquisition (which is data-efficient in many regards) and the data-hungriness of Transformers. I am personally very invested in research questions that draw on insights from language acquisition in the context of the BabyLM Shared Task, leading and working as part of teams working on the Multimodal, Multilingual and Interaction Tracks of the Shared Task.

See my Cambridge Language Sciences page for more information on my research interests in Cognitive Science and Linguistics.

Themes

Teaching

Guest Lecturer and Teaching Assistant

L95 (ACS/Part III) Introduction to Natural Language Syntax and Parsing.
- Delivered a lecture on Language Model Evaluation and Mechanistic Interpretability (Nov 2024).
- Michaelmas 2025. Lecture I on Language Model Evaluation and Mechanistic Interpretability, corresponding presentation session supported by Dr David Strohmaier). Lecture II and Presentation Session on Tokenization. Led significant refactoring and updating of L95 course material and the introduction of weekly paper presentations with Dr Moscoso del Prado Martin and Prof Buttery. Annotated Bibliography – Linguistic Structure & Language Models (L95 2025-26).
Guest Lecturer for Li18 Computational Linguistics, 2025-26 (Part II Linguistics Tripos). Delivering two Guest Lectures for Li18 (convened by Dr Guy Emerson).
Machine Learning & Real World Data (Part IA, Computer Science Tripos). Teaching Assistant (2024-25), lectured by Dr Fermin Moscoso del Prado Martin and Dr Luca Benedetto.

Research Supervision

MPhil ACS Project Supervisor for Bianca Ganescu with Dr Andrew Caines and Prof Paula Buttery.

Advising & Mentoring ALTA Institute Research Assistants (RAs) for Academic Year 2025 - 26 – Laura Barbanel, Lily Goulder & Aoife O'Driscoll. The three RAs are also line-managed by Prof Paula Buttery and Dr Andrew Caines.

Undergraduate Research Opportunity Programme (UROP) Supervisor. Shivan Arora and Ellie Polyakova Reed (Summer 2025).

PicoLM Research Mentor for Google DeepMind Research Ready Programme, Summer 2025. Ali Kheirkhah.

Co-Advised Two MPhil module projects for Geometric Deep Learning (L65) on (1) Dynamic Tokenisation with Dr Dobrik Georgiev, Dr Petar Velikovic & Prof Pietro Lio and (2) Attention Graph Interpretability with Chaitanya Joshi, Dr Petar Velikovic & Prof Pietro Lio.

Co-Advising and Mentoring several independent Cambridge Research Projects (Jacy To, Andrzej Szablewski).

Supervisions

Machine Learning and Bayesian Inference (Part II, Computer Science Tripos)

Formal Models of Language (Part IB, Computer Science Tripos)

Artificial Intelligence (Part IB, Computer Science Tripos)

Probability (Part IA, Computer Science Tripos)

Li18 Computational Linguistics (Part IIA/IIB Linguistics Tripos)

College Supervisor for Linguistics Tripos (Gonville & Caius College) – Linguistic Theory (Part IIB, Linguistics Tripos), Part I Linguistics Tripos.

College Examiner for Computer Science Tripos Mock Examinations (Gonville & Caius College)

Professional Activities

Departmental Activity

Organiser and Host of the Natural Language & Information Processing Seminars, 2024 -. Natural Language & Information Processing Group (CST). Organising 30+ departmental seminars with leading academics and industry researchers on Language Models, Computational Linguistics and Natural Language Processing. List of Organised Seminars.

University-Wide & Interdisciplinary Initiatives

Language Sciences Annual Symposium 2025: Ambitions for language science in 2050. Poster Session Organiser for 2025 Cambridge Language Sciences Symposium with Sammy Weiss (MRC Cognition and Brain Sciences Unit) and Shanshan Hu (TAL). CLS 2025 Website.

23rd Old-World Conference in Phonology (OCP23). Member of Organising Committee. Gonville & Caius College (January 2026). OCP23 Website (Phonetic Laboratory, Department of Theoretical & Applied Linguistics).

Reviewing & Service

Reviewer for BabyLM 2024. ACL 2025 Emergency Reviewer. Reviewer for The First Workshop on Large Language Model Memorization – L2M2 Proceedings @ ACL 2025. Reviewer for NEURIPS CogInterp Workshop. Reviewer for NEURIPS What Can't Transformers Do (WCTD) Workshop.

Publications

Key Themes: (i) Cognitively-Inspired Design, Interpretability and Evaluation (♣), (ii) Language Model Pretraining (✰), (iii) Multilinguality (✦), (iv) Tokenization (✿), (v) Alignment and Interaction (✒︎) and (vi) Cognitive Science and Linguistics (♦️).

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. 2024. Suchir Salhan, Richard Diehl-Martinez, Zebulon Goriely, Paula Buttery. 2024. BabyLM Shared Task (Paper Track), Conference of Natural Language Learnning (CoNLL). Poster Presentation at EMNLP (Miami, FL, USA, November 2024). ♣ ✰

The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2025. Fermin Moscoso del Prado Martin, Suchir Salhan. 13th Conference on the Mental Lexicon. Invited Keynote delivered by Fermin Moscoso del Prado Martin in McGill University, Montreal (June 2025). Slides. ✦♦️

ByteSpan: Information-Driven Subword Tokenisation. Zebulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery. ICML 2025 Tokenization Workshop (TokShop). Delivered ByteSpan Poster Presentation in Vancouver, Canada (August 2025). ✿

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance. Fermin Moscoso del Prado Martin, Suchir Salhan. ACL Main Conference (Poster) – I presented this in Vienna, Austria (August 2025). Poster | Slides. ♦️

Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research. Richard Diehl-Martinez, David Demitri Africa, Yuval Weiss, Suchir Salhan, Ryan Daniels, Paula Buttery. EMNLP Systems Demonstration 2025. Presentation in Suzhou, China. Pico Website | Demo Video (YouTube) | HuggingFace. ✰

Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction. Suchir Salhan, Hongyi Gu, Donya Rooein, Diana Galvan-Sosa, Gabrielle Gaudeau, Andrew Caines, Zheng Yuan, Paula Buttery. BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ✒︎ ♣

What's the Best Sequence Length for BabyLM?. Suchir Salhan, Richard Diehl-Martinez, Zebulon Goriely, Paula Buttery. BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✰

BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models. Yuan Gao , Suchir Salhan, Andrew Caines, Paula Buttery, Weiwei Sun BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✦

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling. Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines, Paula Buttery (Supervised MPhil Advanced Computer Science Thesis). BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✰

Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages. David Demitri Africa, Suchir Salhan, Yuval Weiss, Paula Buttery, Richard Diehl Martinez. 5th Workshop on Multilingual Representation Learning (MRL), EMNLP 2025. Presentation in Suzhou, China. ✦ ✰

Extended Abstract for "Linguistic Universals": Emergent Shared Features in Independent Monolingual Language Models via Sparse Autoencoders. Ej Zhou, Suchir Salhan. 5th Workshop on Multilingual Representation Learning (MRL), EMNLP 2025. Presentation in Suzhou, China. ✦♣

Pedagogical Alignment of LLMs requires Diverse Cognitively-Inspired Student Proxies. 2025. Suchir Salhan, Andrew Caines, Paula Buttery. NEURIPS First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models. Presentation in San Diego, California, USA. ♣

Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability. 2025. Suchir Salhan, Konstantinos Voudouris. NEURIPS First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models. Presentation in San Diego, California, USA. ♣

The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2026. Fermin Moscoso del Prado Martin, Suchir Salhan. 23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College. (Accepted Oral). ♦️

Convergent Equilibria in Cross-Lingual Phoneme Surprisal Distributions: Statistical and Simulation-Based Analysis. 2026. Suchir Salhan, Fermin Moscoso del Prado Martin. 23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College. (Accepted Oral). Abstract♦️

Other publications:

On the Potential for Maximising Minimal Means in Transformer Language Models: A Dynamical Systems Perspective. Suchir Salhan. In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2023. Paper | Slides (Undergraduate Dissertation, Presentation at SyntaxLab, February 2023, St John's College, Cambridge, organised by Dr Theresa Biberauer)

Linguistics in the Age of Language Models: What Can Cognitively-Inspired Language Models Offer to Linguistic Theory? * . Suchir Salhan . In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2025. Paper.

Invited Talks, Presentations and Posters:

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. Suchir Salhan. Presentations at Cambridge Language Sciences Symposium (November 2024), Poster at HumanCLAIM Workshop organised by Prof Lisa Beinborn in Gottingen Germany in March 2025. Accepted Poster and Demonstration at Cambridge CHIA (Centre for Human-Inspired AI) Annual Conference in June 2025.

Human-Validated Grammar Profiles for Language Models. Tubingen, Germany; March 2025 in a workshop organised by Prof Detmar Meurers

LLMs “off-the-shelf” or Pretrain-from-Scratch? Recalibrating Biases and Improving Transparency using Small-Scale Language Models.
Suchir Salhan, Richard Diehl-Martinez, Zebulon Goriely, Andrew Caines, Paula Buttery
Learning & Human Intelligence Group, Department of Computer Science & Technology, 2024

Bilingual Small Language Models as Cognitive Proxies for LLM Interaction and Calibration. Suchir Salhan. Learning & Human Intelligence Group, Department of Computer Science & Technology, 2025.

Engineering Small Language Models as Learner Models for LLM Interaction and Calibration. Suchir Salhan. ALTA Annual Review 2025.

Biography

Research

Themes

Teaching

Guest Lecturer and Teaching Assistant

Research Supervision

Supervisions

Professional Activities

Departmental Activity

University-Wide & Interdisciplinary Initiatives

Reviewing & Service

Publications

Other publications:

Invited Talks, Presentations and Posters:

Contact Details

About the department

Social media

Study at Cambridge

About the University

Research at Cambridge

Suchir Salhan

Biography

Research

Themes

Teaching

Guest Lecturer and Teaching Assistant

Research Supervision

Supervisions

Professional Activities

Departmental Activity

University-Wide & Interdisciplinary Initiatives

Reviewing & Service

Publications

Other publications:

Invited Talks, Presentations and Posters:

Contact Details

Related Links

About the department

Social media

Study at Cambridge

About the University

Research at Cambridge