skip to content

Department of Computer Science and Technology

  • PhD Candidate

Suchir Salhan is a PhD candidate at the University of Cambridge (Gonville & Caius College), working on Language Models and Cognitively-Inspired AI. With an interdisciplinary background in Computer Science, Cognitive Science, and Linguistics, his research focuses on Small Language Models, leveraging insights from human cognition to develop multi-agent AI systems that are interpretable, fair, and equitable.

Biography

I’ve had a long fascination with the intersection of language and computation—how humans have developed the capability to acquire natural language to communicate, learn, and reason, despite the diversity of linguistic systems; and how we might build machines that can do the same. I arrived in Cambridge in 2020 to pursue a BA and MEng in Computer Science & Linguistics at Gonville & Caius College, Cambridge, where I earned a “starred First” and a Distinction. Throughout my undergraduate studies, I worked on multimodality in the Language Technology Lab (with Prof Nigel Collier and Dr Fangyu Liu, Google DeepMind), code-switching with Dr Li Nguyen and with Prof. Paula Buttery and Prof. Andrew Caines on multiple research projects over several years that formed the foundation of my PhD work on Small Language Models and Cognitively Inspired AI.  Outside of academia, I lead Per Capita Media, Cambridge University's newest independent publication supported by a team of students and academics from Cambridge and other academic institutions, including the University of Oxford and the University of the Arts London. I am also involved in student policy think tanks, as the Head of Policy at The Wilberforce Society, the UK's oldest student think tank in the UK  based at the University of Cambridge, and organise several speaker events throughout the University.  See my extended CV or my academic website for further information. 

Research

Small Language Models:  Our group released  PicoLM, the Cambridge Small Language Model & Learning Dynamics Framework in March 2025 to investigate these research questions. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube. We use Pico to investigate questions related to multilingual pretraining, learning dynamics and language model interpretability. I manage and supervise students working on the Pico Framework. 

Cognitively-Inspired AI:  My Masters' Thesis focused on the BabyLM Shared Task to train Small Language Models using acquisition-inspired strategies on “cognitively-plausible” corpora for several languages. My PhD work now connects the BabyLM paradigm with the fast-moving Language Modelling ecosystem. While Large Language Models (LLMs) are increasingly used in high-stakes applications—such as assessing human performance—they often lack steerability, alignment, and interpretability. I work to address this by developing Cognitively-Inspired Small Language Models (SLMs). These SLMs can guide and calibrate LLM behaviour in multi-agent environments, aligning AI outputs with user preferences and domain-specific tasks. 

See my Cambridge Language Sciences page for more information on my research interests in Cognitive Science and Linguistics. 

Teaching

Guest Lecturer and Teaching Assistant

Research Supervision 

  • MPhil ACS Thesis: Bianca Ganescu (2024-25), Yeji Heo (2025-26)
  • Undergraduate Research Opportunity Programme (UROP) Supervisor: Shivan Arora and Ellie Polyakova Reed (Summer 2025). 
  • PicoLM Research Mentor for Google DeepMind Research Ready Programme: Ali Kheirkhah (Summer 2025)
  • Visiting Research Student:  Andrzej Szablewski (with Dr Clara Meister and Dr Tiago Pimentel)
  • ALTA Institute Research Assistant: Working with Laura Barbanel, Lily Goulder & Aoife O'Driscoll, line-managed by Prof Paula Buttery and Dr Andrew Caines. 

Supervisions

Machine Learning and Bayesian Inference (Part II, Computer Science Tripos)

Data Science (Part IB, Computer Science Tripos)

Formal Models of Language (Part IB, Computer Science Tripos)

Artificial Intelligence (Part IB, Computer Science Tripos)

Probability (Part IA, Computer Science Tripos)

Li18 Computational Linguistics (Part IIA/IIB Linguistics Tripos)

College Supervisor for Linguistics Tripos (Gonville & Caius College) – Linguistic Theory (Part IIB, Linguistics Tripos), Part I Linguistics Tripos.

College Examiner for Computer Science Tripos Mock Examinations (Gonville & Caius College)

Professional Activities

Departmental Activity

Organiser and Host of the Natural Language & Information Processing Seminars, 2024 -. Natural Language & Information Processing Group (CST).  Organising 30+ departmental seminars with leading academics and industry researchers on Language Models, Computational Linguistics and Natural Language Processing. List of Organised Seminars.

University-Wide & Interdisciplinary Initiatives 

Language Sciences Annual Symposium 2025: Ambitions for language science in 2050. Poster Session Organiser for 2025 Cambridge Language Sciences Symposium with Sammy Weiss (MRC Cognition and Brain Sciences Unit) and Shanshan Hu (TAL). CLS 2025 Website

23rd Old-World Conference in Phonology (OCP23). Member of Organising Committee. Gonville & Caius College (January 2026). OCP23 Website (Phonetic Laboratory, Department of Theoretical & Applied Linguistics). 

Reviewing & Service 

Reviewer for BabyLM 2024. ACL 2025 Emergency Reviewer. Reviewer for The First Workshop on Large Language Model Memorization – L2M2 Proceedings @ ACL 2025. Reviewer for NEURIPS CogInterp Workshop. Reviewer for NEURIPS What Can't Transformers Do (WCTD) Workshop. 

Publications

 

Key Themes: (i) Cognitively-Inspired Design, Interpretability and Evaluation (♣), (ii) Language Model Pretraining (✰), (iii) Multilinguality (✦), (iv) Tokenization (✿), (v) Alignment and Interaction (✒︎) and (vi) Cognitive Science and Linguistics (♦️).


Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. 2024. Suchir SalhanRichard Diehl-MartinezZebulon GorielyPaula Buttery. 2024. BabyLM Shared Task (Paper Track), Conference of Natural Language Learnning (CoNLL). Poster Presentation at EMNLP (Miami, FL, USA, November 2024). ♣ ✰

The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2025. Fermin Moscoso del Prado MartinSuchir Salhan. 13th Conference on the Mental Lexicon. Invited Keynote delivered by Fermin Moscoso del Prado Martin in McGill University, Montreal (June 2025). Slides.  ✦♦️

ByteSpan: Information-Driven Subword Tokenisation.  Zebulon GorielySuchir Salhan, Pietro Lesci, Julius Cheng,  Paula Buttery. ICML 2025 Tokenization Workshop (TokShop).  Delivered ByteSpan Poster Presentation in Vancouver, Canada (August 2025).  

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance. Fermin Moscoso del Prado MartinSuchir Salhan. ACL Main Conference (Poster) – I presented this in Vienna, Austria (August 2025). Poster | Slides. ♦️

Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research. Richard Diehl-MartinezDavid Demitri AfricaYuval WeissSuchir SalhanRyan DanielsPaula ButteryEMNLP Systems Demonstration 2025. Presentation in Suzhou, China. Pico Website | Demo Video (YouTube) | HuggingFace. ✰

Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction. Suchir SalhanHongyi GuDonya RooeinDiana Galvan-SosaGabrielle GaudeauAndrew CainesZheng YuanPaula ButteryBabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China.  ✒︎ ♣

What's the Best Sequence Length for BabyLM?. Suchir Salhan,  Richard Diehl-MartinezZebulon Goriely,  Paula ButteryBabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China.  ♣ ✰

BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models. Yuan Gao , Suchir SalhanAndrew CainesPaula ButteryWeiwei Sun  BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✦

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling.  Bianca-Mihaela Ganescu, Suchir SalhanAndrew CainesPaula Buttery (Supervised MPhil Advanced Computer Science Thesis).   BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✰

Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages. David Demitri AfricaSuchir SalhanYuval WeissPaula ButteryRichard Diehl Martinez.   5th Workshop on Multilingual Representation Learning (MRL), EMNLP 2025. Presentation in Suzhou, China. ✦ ✰

Extended Abstract for "Linguistic Universals": Emergent Shared Features in Independent Monolingual Language Models via Sparse Autoencoders. Ej ZhouSuchir Salhan.   5th Workshop on Multilingual Representation Learning (MRL), EMNLP 2025. Presentation in Suzhou, China. ✦♣

Pedagogical Alignment of LLMs requires Diverse Cognitively-Inspired Student Proxies. 2025. Suchir SalhanAndrew CainesPaula ButteryNEURIPS First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models. Presentation in San Diego, California, USA. ♣

Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability. 2025. Suchir Salhan,  Konstantinos VoudourisNEURIPS First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models. Presentation in San Diego, California, USA. ♣

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data. Jaap Jumelet, Abdellah Fourtassi, Akari Haga, Bastian Bunzeck, Bhargav Shandilya, Diana Galvan-Sosa, Faiz Ghifari Haznitrama, Francesca Padovani, François Meyer, Hai Hu, Julen Etxaniz, Laurent Prévot, Linyang He, María Grandury, Mila Marcheva, Negar Foroutan, Nikitas Theodoropoulos, Pouya Sadeghi, Siyuan Song, Suchir Salhan, Susana Zhou, Yurii Paniv, Ziyin Zhang, Arianna Bisazza, Alex Warstadt, Leshem Choshen. ✦  Preprint | BabyBabelLM (Multilingual BabyLM) Website

The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2026. Fermin Moscoso del Prado MartinSuchir Salhan. 23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College. (Accepted Oral). ♦️

Convergent Equilibria in Cross-Lingual Phoneme Surprisal Distributions: Statistical and Simulation-Based Analysis. 2026. Suchir Salhan,  Fermin Moscoso del Prado Martin23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College.  (Accepted Oral). Abstract♦️ 


Other publications: 

On the Potential for Maximising Minimal Means in Transformer Language Models: A Dynamical Systems Perspective. Suchir Salhan. In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2023. Paper | Slides (Undergraduate Dissertation, Presentation at SyntaxLab, February 2023, St John's College, Cambridge, organised by Dr Theresa Biberauer)

Linguistics in the Age of Language Models: What Can Cognitively-Inspired Language Models Offer to Linguistic Theory? * . Suchir Salhan . In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2025. Paper.


Invited Talks, Presentations and Posters: 

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. Suchir Salhan. Presentations at Cambridge Language Sciences Symposium (November 2024), Poster at HumanCLAIM Workshop organised by Prof Lisa Beinborn in Gottingen Germany in March 2025. Accepted Poster and Demonstration at Cambridge CHIA (Centre for Human-Inspired AI) Annual Conference in June 2025.

Human-Validated Grammar Profiles for Language Models. Tubingen, Germany; March 2025 in a workshop organised by Prof Detmar Meurers

LLMs “off-the-shelf” or Pretrain-from-Scratch? Recalibrating Biases and Improving Transparency using Small-Scale Language Models.
Suchir SalhanRichard Diehl-MartinezZebulon GorielyAndrew CainesPaula Buttery
Learning & Human Intelligence Group, Department of Computer Science & Technology, 2024

Bilingual Small Language Models as Cognitive Proxies for LLM Interaction and Calibration. Suchir SalhanLearning & Human Intelligence Group, Department of Computer Science & Technology, 2025.

Engineering Small Language Models as Learner Models for LLM Interaction and Calibration. Suchir Salhan. ALTA Annual Review 2025. 

Contact Details

Room: 
GS08
Office address: 
Gonville & Caius College, Trinity St, Cambridge CB2 1TA
Email: 

sas245@cam.ac.uk