skip to content

Department of Computer Science and Technology

  • PhD Candidate
  • PicoLM Head of Research

Suchir Salhan is a PhD candidate at the University of Cambridge (Gonville & Caius College), working on Language Models and Cognitively-Inspired AI. With an interdisciplinary background in Computer Science, Cognitive Science, and Linguistics, his research focuses on Human-Scale and Data-Efficient Multilingual Language Models, leveraging insights from human cognition to develop multi-agent AI systems that are interpretable, fair, and equitable.

Biography

I’ve had a long fascination with the intersection of language and computation—how humans have developed the capability to acquire natural language to communicate, learn, and reason, despite the diversity of linguistic systems; and how we might build machines that can do the same. I arrived in Cambridge in 2020 to pursue a BA and MEng in Computer Science & Linguistics at Gonville & Caius College, Cambridge, where I earned a “starred First” and a Distinction. Throughout my undergraduate studies, I worked on multimodality in the Language Technology Lab (with Prof Nigel Collier and Dr Fangyu Liu, Google DeepMind), code-switching with Dr Li Nguyen and with Prof. Paula Buttery and Prof. Andrew Caines on multiple research projects over several years that formed the foundation of my PhD work on Small Language Models and Cognitively Inspired AI. Outside of academia, I lead Per Capita Media, Cambridge University's newest independent publication supported by a team of students and academics from Cambridge and other academic institutions, including the University of Oxford and the University of the Arts London. I have also been involved in student policy think tanks, as the Head of Policy at The Wilberforce Society, the UK's oldest student think tank in the UK  based at the University of Cambridge, and organise several speaker events throughout the University.  See my extended CV or my academic website for further information. 

Research

Small Language Models:  I am the Head of Research of PicoLM, Cambridge University's Open-Source Small Language Model & Learning Dynamics Framework, working together with Prof Paula Buttery and Richard Diehl Martinez. In March 2025, our group released PicoLM to investigate language model learning dynamics, interpretability and safety. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube. We use Pico to investigate questions related to multilingual pretraining, learning dynamics and language model interpretability. I line-manage and supervise students working on the Pico Framework. 

Cognitively-Inspired AI: I am an organiser of the 2026 BabyLM Workshop & Shared Task (EMNLP 2026), co-leading the Multilingual Human-Scale Language Modelling Shared Task. My Masters' Thesis focused on the BabyLM Shared Task to train Small Language Models using acquisition-inspired strategies on “cognitively-plausible” corpora for several languages. My PhD research now connects the BabyLM paradigm with the fast-moving Language Modelling ecosystem in an interdisciplinary manner. I work with multilingual NLP researchers in academia and industry to investigate human-scale language models and collaborate with cognitive scientists, linguists and educational machine learning researchers to develop cognitively-inspired models that can guide and calibrate LLM behaviour in multi-agent environments, aligning AI outputs with user preferences and domain-specific tasks.  To date, my research has been awarded two outstanding paper awards – see a research feature on the Gonville & Caius College Website. 

See my Cambridge Language Sciences page for more information on my research interests in Cognitive Science and Linguistics. 

Teaching

L95 (ACS/Part III) Introduction to Natural Language Syntax and Parsing . (2024-) Guest Lecturer (aged 22), leading significant refactoring and updating of L95 course material.  Annotated Bibliography – Linguistic Structure & Language Models (L95 2025-26) | L95 Course Plan | Probing & Linguistic Interpretability Lecture Handout and Slides .  Guest Lecturer for Li18 Computational Linguistics, 2025-26 (Part II Linguistics Tripos) (convened by Dr Guy Emerson). Machine Learning & Real World Data (Part IA, Computer Science Tripos). Teaching Assistant (2024-25), lectured by Dr Fermin Moscoso del Prado Martin and Dr Luca Benedetto. 

Research Supervision. MPhil ACS Thesis: Bianca Ganescu (2024-25), Yeji Heo (2025-26). MPhil Machine Learning & Machine Intelligence (MLMI): Adam El Kholy. Undergraduate Research Opportunity Programme (UROP) Supervisor: Shivan Arora and Ellie Polyakova Reed (Summer 2025). PicoLM Research Mentor for Google DeepMind Research Ready Programme: Ali Kheirkhah (Summer 2025). Visiting Research Student:  Andrzej Szablewski (with Dr Clara Meister and Dr Tiago Pimentel). ALTA Institute Research Assistants: Working with Laura Barbenel, Lily Goulder & Aoife O'Driscoll.

Computer Science Tripos Supervision:  Machine Learning and Bayesian Inference (Part II, Computer Science Tripos),  Data Science (Part IB, Computer Science Tripos),  Formal Models of Language (Part IB, Computer Science Tripos), Artificial Intelligence (Part IB, Computer Science Tripos), Probability (Part IA, Computer Science Tripos). Linguistics Tripos Supervision: Li18 Computational Linguistics (Part IIA/IIB Linguistics Tripos). College Supervisor and Examiner for Linguistics Tripos (Gonville & Caius College) for Linguistic Theory (Part IIB, Linguistics Tripos), Part I Linguistics Tripos, Part IA and IB Computer Science Tripos,

Professional Activities

xBLiMPs –  Awarded £5750 from Cambridge Language Sciences. See Cambridge Language Sciences website for a project description. 

UROP Manager in ALTA Institute @ CambridgeNLIP.  I manage the Undergraduate Research Opportunity Programme (UROP) run in the CambridgeNLIP group with Prof Andrew Caines. I develop undergraduate research projects, interview students, and mentor undergraduate Computer Science and non-CS students working in Natural Language Processing. 

Department: Organiser and Host of the Computer Science Department Natural Language & Information Processing Seminars, 2024 -. Natural Language & Information Processing Group (CST).  Organising 30+ departmental seminars with leading academics and industry researchers on Language Models, Computational Linguistics and Natural Language Processing. List of Organised Seminars.

University-Wide Interdisciplinary Initiatives:  Language Sciences Annual Symposium 2025: Ambitions for language science in 2050. Poster Session Organiser for 2025 Cambridge Language Sciences Symposium with Sammy Weiss (MRC Cognition and Brain Sciences Unit) and Shanshan Hu (TAL). CLS 2025 Website.  23rd Old-World Conference in Phonology (OCP23). Member of Organising Committee. Gonville & Caius College (January 2026). OCP23 Website (Phonetic Laboratory, Department of Theoretical & Applied Linguistics). 

Reviewing & Service: BabyLM 2024, ACL 2025 (Emergency), ACL L2M2 (Language Model Memorization) @ ACL 2025, NEURIPS 2025 CogInterp & What Can't Transformers Do (WCTD) Workshop. 

Publications

 

Key Themes: (i) Cognitively-Inspired Design, Interpretability and Evaluation (♣), (ii) Language Model Pretraining (✰), (iii) Multilinguality (✦), (iv) Tokenization (✿), (v) Alignment and Interaction (✒︎) and (vi) Cognitive Science and Linguistics (♦️).


Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. 2024. Suchir SalhanRichard Diehl-MartinezZebulon GorielyPaula Buttery. 2024. BabyLM Shared Task (Paper Track), Conference of Natural Language Learnning (CoNLL). Poster Presentation at EMNLP (Miami, FL, USA, November 2024). ♣ ✰

The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2025. Fermin Moscoso del Prado MartinSuchir Salhan. 13th Conference on the Mental Lexicon. Invited Keynote delivered by Fermin Moscoso del Prado Martin in McGill University, Montreal (June 2025). Slides.  ✦♦️

ByteSpan: Information-Driven Subword Tokenisation.  Zebulon GorielySuchir Salhan, Pietro Lesci, Julius Cheng,  Paula Buttery. ICML 2025 Tokenization Workshop (TokShop).  Delivered ByteSpan Poster Presentation in Vancouver, Canada (August 2025).  

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance. Fermin Moscoso del Prado MartinSuchir Salhan. ACL Main Conference (Poster) – I presented this in Vienna, Austria (August 2025). Poster | Slides. ♦️

Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research. Richard Diehl-MartinezDavid Demitri AfricaYuval WeissSuchir SalhanRyan DanielsPaula ButteryEMNLP Systems Demonstration 2025. Presentation in Suzhou, China. Pico Website | Demo Video (YouTube) | HuggingFace. ✰

Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction. Suchir SalhanHongyi GuDonya RooeinDiana Galvan-SosaGabrielle GaudeauAndrew CainesZheng YuanPaula ButteryBabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China.  ✒︎ ♣ Outstanding Paper Award. 

What's the Best Sequence Length for BabyLM?. Suchir Salhan,  Richard Diehl-MartinezZebulon Goriely,  Paula ButteryBabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China.  ♣ ✰

BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models. Yuan Gao , Suchir SalhanAndrew CainesPaula ButteryWeiwei Sun  BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✦

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling.  Bianca-Mihaela Ganescu, Suchir SalhanAndrew CainesPaula Buttery (Supervised MPhil Advanced Computer Science Thesis).   BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✰ Outstanding Paper Award. 

Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages.  David Demitri AfricaSuchir SalhanYuval WeissPaula ButteryRichard Diehl Martinez.   5th Workshop on Multilingual Representation Learning (MRL), EMNLP 2025. Presentation in Suzhou, China. ✦ ✰

Extended Abstract for "Linguistic Universals": Emergent Shared Features in Independent Monolingual Language Models via Sparse Autoencoders. Ej ZhouSuchir Salhan.   5th Workshop on Multilingual Representation Learning (MRL), EMNLP 2025. Presentation in Suzhou, China. ✦♣

Pedagogical Alignment of LLMs requires Diverse Cognitively-Inspired Student Proxies. 2025. Suchir SalhanAndrew CainesPaula ButteryNEURIPS First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models. Presentation in San Diego, California, USA. ♣

Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability. 2025. Suchir Salhan,  Konstantinos VoudourisNEURIPS First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models. Presentation in San Diego, California, USA. ♣

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data. Jaap Jumelet, Abdellah Fourtassi, Akari Haga, Bastian Bunzeck, Bhargav Shandilya, Diana Galvan-Sosa, Faiz Ghifari Haznitrama, Francesca Padovani, François Meyer, Hai Hu, Julen Etxaniz, Laurent Prévot, Linyang He, María Grandury, Mila Marcheva, Negar Foroutan, Nikitas Theodoropoulos, Pouya Sadeghi, Siyuan Song, Suchir Salhan, Susana Zhou, Yurii Paniv, Ziyin Zhang, Arianna Bisazza, Alex Warstadt, Leshem Choshen. ✦  Preprint | BabyBabelLM (Multilingual BabyLM) WebsiteEACL 2026 Main Conference. 

Glints of Gold or Troubling Waters? Can a School of Merged Monolingual Goldfish Models Swim in Bilingual Seas? Suchir Salhan, EJ Zhou, Laura Barbenel, Aoife O’Driscoll, Lily Goulder, Lucas Resck, Catherine Arnett & Paula Buttery EACL Workshop on Multilingual Multicultural Evaluation (MME), Non-Archival Full Paper, 2026. Presentation in Rabat, Morocco.

The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2026. Fermin Moscoso del Prado MartinSuchir Salhan. 23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College. (Accepted Oral). ♦️

Convergent Equilibria in Cross-Lingual Phoneme Surprisal Distributions: Statistical and Simulation-Based Analysis. 2026. Suchir Salhan,  Fermin Moscoso del Prado Martin23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College.  (Accepted Oral). Abstract♦️ 

Modelling the Diachronic Emergence of Phoneme Frequency Distributions. 2026. Fermin Moscoso del Prado Martin,  Suchir Salhan.  In Proceedings of the Society for Computation in Linguistics (SCiL), Presentation at ACL 2026 (San Diego, USA).

Do Monolingual Language Models Learn Cross-Lingual Universal Conceptual Representations? 2026. Suchir Salhan, EJ Zhou & Paula Buttery.  Unifying Concept Representation Learning  and Workshop on Representational Alignment (Re-Align) @ ICLR 2026.

A Computational Operationalisation of Competing Maturational Theories of Syntactic Development via Statistical Grammar Induction Mila Marcheva, Suchir Salhan & Weiwei Sun. CogSci 2026 Main Conference In Proceedings of the Annual Meeting of the Cognitive Science Society Rio de Janeiro, Brazil


Other publications: 

On the Potential for Maximising Minimal Means in Transformer Language Models: A Dynamical Systems Perspective. Suchir Salhan. In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2023. Paper | Slides (Undergraduate Dissertation, Presentation at SyntaxLab, February 2023, St John's College, Cambridge, organised by Dr Theresa Biberauer)

Linguistics in the Age of Language Models: What Can Cognitively-Inspired Language Models Offer to Linguistic Theory? * . Suchir Salhan . In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2025. Paper.


Invited Seminars & Talks: 

Understanding the Human-Scale AI Frontier: Sample-Efficient and Human-Scale Language Modelling. Invited Seminar @ SheffieldNLP (May 2026)
Bilingual language models as computational models of human bilingualism and AI-based solutions to support second-language learning.
Cambridge Language Sciences Workshop: Toward a more ecological investigation of bilingualism (March 2026). Workshop on Computational Linguistic Methods for Language Learning Technology: Writing, Reading, Interaction, Content Creation, Evaluation (ALTA CST, Computer Science & Technology, Cambridge University, March 2026)

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. Suchir Salhan. Presentations at Cambridge Language Sciences Symposium (November 2024), Poster at HumanCLAIM Workshop organised by Prof Lisa Beinborn in Gottingen Germany in March 2025. Accepted Poster and Demonstration at Cambridge CHIA (Centre for Human-Inspired AI) Annual Conference in June 2025 | Human-Validated Grammar Profiles for Language Models. Tubingen, Germany; March 2025 in a workshop organised by Prof Detmar Meurers.

Local presentations in  Learning & Human Intelligence Group, Department of Computer Science & Technology, 2024 -.  Bilingual Small Language Models as Cognitive Proxies for LLM Interaction and Calibration. Suchir SalhanLearning & Human Intelligence Group, Department of Computer Science & Technology, 2025. Engineering Small Language Models as Learner Models for LLM Interaction and Calibration. Suchir Salhan. ALTA Annual Review 2025. 

Contact Details

Room: 
GS08
Office address: 
Gonville & Caius College, Trinity St, Cambridge CB2 1TA
Email: 

sas245@cam.ac.uk