Tokenization-free Language Models: Theory and Application

Name of applicant

Desmond Elliott

Title

Associate Professor

Institution

University of Copenhagen

Amount

DKK 6,995,913

Year

2025

Type of grant

Semper Ardens: Accelerate

What?

We will develop a new type of language model that combines visual pixel representations with computer bytes to support thousands of languages. The model encodes rendered text images into vectors and generates byte sequences as output. Our goal is to match the performance of existing open models while reducing compute costs by up to 50% for low-resource languages.

Why?

Current language models poorly serve billions of low-resource language users. Traditional tokenization makes it two to three times more expensive for speakers of languages like Hindi or Faroese to use AI systems compared to English users. Our tokenization-free approach will democratize access to high-quality, affordable language technology for everyone, regardless of their language.

How?

A team of three PhD students and one postdoc will investigate how to build hybrid models that understand text as images and output readable byte sequences, ensuring accurate translation between visual and text formats. We will also test whether these models can grow to billions of parameters while remaining efficient and learning effectively from large amounts of multilingual data.

Back to listing page