skip to content

Department of Computer Science and Technology

AI makes our world more useful and convenient, saving time – and possibly even our lives. But simultaneously, it creates new levels of risk to things we completely depend on like hospitals, banks, and energy and food supplies. How can we reconcile these tensions between the usefulness and the danger of AI? That’s a question that researcher Hanna Foerster will be addressing this Saturday, 21 March, when she gives a series of talks here on 'Can We Trust AI to Use Our Computers?' 

Hanna is a PhD student here. Excited by AI, yet distrustful of it, she’s exploring how the newest AI agents that are coming on stream are much more active than previously, and capable of carrying out many tasks independently of us, with far greater potential impact for both good and harm. In her talks for the public here this weekend as part of the Cambridge Festival, she'll discuss what we can do to make such AI tools more trustworthy – and offer us tips on what we should be aware of every time we use AI.

That combination in AI agents of enormous capability, always-on access, and essentially no security architecture, is exactly the kind of problem we need to be working on urgently.

Hanna Foerster

 

When did you first start thinking, and worrying, about AI?

I was already concerned about how flaws in computer security can damage critical infrastructure that we rely on today – hospitals, banks, energy infrastructure and food supply chains. That concern increased the more I learned about the AI systems that make our world more convenient: by helping cars detect lanes and recognise objects or by reading medical scans in hospital and assisting disease diagnosis, or the AI fraud detection models in banks that are making real-time decisions on financial transactions. These systems are very useful, and possibly even save lives. Yet they also create new vulnerabilities in critical infrastructure.

Does the black box nature of AI make things worse?

Yes. In traditional software, a bug has a cause that you can trace. But an AI model can work in ways its users can't see and understand. This means such models can fail in ways that are invisible and hard to attribute. This tension between high utility and high risk is what really piqued my interest.

Why are you working in AI when you don’t actually trust it?

I'm excited about many new AI technologies and want to use them myself. But to do that, I've got to be able to trust them. So, I try to understand how they work, where their limits lie, and where they can break. Because many things in AI are a 'black box' and not really understandable to users, I spent the first year in my PhD breaking things, to understand how easy it is to do so. I looked at how the AI art protection tools that should be protecting digital works of art can be defeated, explored how AI models can be stolen, and how just by observing tiny differences in how a system responds, you can work out what it is running on and exploit known weaknesses in it.

So now you’re working on building defences around AI systems?

Yes. There are AI systems that 'serve the brain' – i.e. not just answering your questions but acting as an independent agent for you, taking actions in the world on your behalf. For example, 'computer use agents' are autonomous AI systems that can 'see' your computer screen and can click, type, scroll, and run tasks on your computer for you. I'm interested in designing a security architecture around such AI models to restrict what these agents can see and do. The idea is that even if the underlying AI model is compromised, or insecure, the system as a whole will behave safely. I hope that creating such defenses will help other people to have more trust in the technology.

Many of us use AI many times a day without even thinking about it – when asking Google maps for a route, say or setting up face recognition on our smart phones. So why does the next generation of AI agents scare you more?

Because they are active rather than passive. So, they have far more potential impact, both for better and for worse. We're already seeing this shift happen. AI agents are spreading fast across several different forms. Coding agents like Claude Code or Cursor already have access to your terminal and your entire codebase, and they can write, execute and delete code autonomously. Browser agents can now be activated inside your browser to fill in forms, click buttons and read your emails on your behalf.

And there are more open-ended computer use agents that can 'see' your whole screen and take almost any action on your computer. With all the major AI companies now evaluating their models on computer use, these may be coming very soon.

Have any of the new AI agents struck you in particular?

There's OpenClaw, which connects directly into apps that enable it – like your email, calendar, files and messaging apps – and runs quietly in the background 24/7, taking actions without you asking it to. Openclaw's popularity spread incredibly fast and people love it. But as with other AI agents, it can't reliably distinguish between your instructions and malicious instructions embedded in content it encounters, so a bad actor could sneak these in through a forwarded email or a webpage.

And what damage could an instruction from a bad actor do?

It could lead to the AI agent making purchases on your behalf, leaking your passwords, or silently forwarding all your emails to a third party. Or wiping your files entirely.

It's this combination of enormous capability, always-on access, and essentially no security architecture that is exactly the kind of problem I think we need to be working on urgently.

So perhaps we shouldn't be using these new AI agents at all?

Not at all. These tools are genuinely exciting, and I already use some myself. Coding agents have helped me write and debug code much faster than I could on my own. I also regularly use chatbots with web browsing capabilities to brainstorm new research ideas quickly.

And these agents can also open doors for people who might otherwise struggle with certain tasks. For example, if someone with a disability, or someone who doesn’t have much technical knowledge or experience, is suddenly able to do tasks on a computer without needing to know how to navigate complex software, that’s a big deal.

That said, not everything is ready yet, and I think it is important for users to understand some of the risks. Something like OpenClaw is incredible as a concept – but I wouldn't recommend most people to use it today. The security simply isn't there yet.

So if we do want to keep using these AI agents, how can we build ones that are safer and more trustworthy?

I say that using these agents is a bit like hiring a personal assistant – but every letter they open might contain a hypnotic command that they'll obey, whether it's a good or not. We need to prevent this. Some argue we should train these AI models to recognise malicious manipulation. But that can't be the whole answer because new attacks are being developed all the time.

It’s like the 'Swiss cheese model' in safety engineering: you need many layers of defence, each covering the holes of the last. But the most important outer layer, the one I focus on, is security by design. Rather than trying to make the AI smarter about security, we should design the system around it so that even if the AI gets tricked, the damage is contained.

How do we do that?

By splitting the agent into two parts – a planning part that plans all the steps needed to carry out a task (like booking a flight) without ever seeing the environment and a second, separate information-retrieving AI. This one can't change the plan itself, it can only answer very contained questions like 'where is the search bar?' or 'is the booking page open? So even if a bad actor manipulates what this second AI sees, they cannot change what the agent was originally asked to do.

This sounds like a great idea. So why isn't it being used?

It's been around as a concept for about a year but not used because industry considered that online environments are too dynamic and unpredictable for a pre-made plan to hold up. What if the search returns the wrong results? What if a button doesn't exist on a webpage? The assumption was that the agent would constantly need to deviate from the plan, making the whole security model fall apart.

In my research, I've shown that this assumption was wrong. We actually built a working system and tested it on a standard set of real-world tasks that are used to evaluate these agents. We beat the odds and showed that security by design is not just a nice idea, it works. And I believe it can extend to many other types of agents too.

Where does this research take us?

It's part of solving a much larger puzzle: how do we build AI systems we can actually trust? I'll be honest that our solution doesn't solve absolutely everything. While we prevent attackers from hijacking the agent entirely, a residual vulnerability remains.

Nonetheless, the argument still stands that the best defences will come from engineering better systems around AI, not just from making the AI itself smarter. And if we can make these defences systematic, we can actually start to trust these tools. That's what motivates me.


Published by Rachel Gardner on Monday 16th March 2026