skip to content

Department of Computer Science and Technology

Tuesday, 5 March, 2024 - 14:00 to 15:00
Lorenzo Pacchiardi, University of Cambridge
Webinar & FW11, Computer Laboratory, William Gates Building.

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

Meeting ID: 880 5365 2228
Passcode: 081966

RECORDING : Please note, this event will be recorded and will be available after the event for an indeterminate period under a CC BY -NC-ND license. Audience members should bear this in mind before joining the webinar or asking questions.

NOTE : Please do not post URLs for the talk, and especially Zoom links to Twitter because automated systems will pick them up and disrupt our meeting.

Seminar series: 
Security Seminar

Upcoming seminars