Mathematical theory of AI reasoning

Name of applicant

Frederik Hytting Jørgensen

Title

PhD student

Institution

Stanford University

Amount

DKK 2,678,330

Year

2026

Type of grant

Internationalisation Fellowships

What?

This project develops a mathematical theory for verifying and controlling AI reasoning. We provide provable guarantees on when an interpretation of a model's internal reasoning is valid in the sense that it helps us predict how the system will behave in novel situations. We validate our methods on large language models.

Why?

Just as we may not be able to tell whether someone avoids stealing based on an ethical commitment or a fear of getting caught, we cannot always tell whether an AI system is truly safe or merely appears safe in testing. If we knew what reasoning an AI relied on, then that could increase our confidence that an AI would remain safe and reliable beyond the specific situations it was tested on.

How?

Current methods can make a neural network appear to follow essentially any reasoning if we allow overly complex interpretations. We address this by combining causal abstraction – a method for testing whether a neural network follows a particular reasoning process – with ideas from statistical learning theory that constrain interpretation complexity to provide generalization guarantees.

Back to listing page