Tokenization-free Language Models: Theory and Application

Navn på bevillingshaver

Desmond Elliott

Titel

Associate Professor

Institution

University of Copenhagen

Beløb

DKK 6,995,913

År

2025

Bevillingstype

Semper Ardens: Accelerate

Hvad?

We will develop a new type of language model that combines visual pixel representations with computer bytes to support thousands of languages. The model encodes rendered text images into vectors and generates byte sequences as output. Our goal is to match the performance of existing open models while reducing compute costs by up to 50% for low-resource languages.

Hvorfor?

Current language models poorly serve billions of low-resource language users. Traditional tokenization makes it two to three times more expensive for speakers of languages like Hindi or Faroese to use AI systems compared to English users. Our tokenization-free approach will democratize access to high-quality, affordable language technology for everyone, regardless of their language.

Hvordan?

A team of three PhD students and one postdoc will investigate how to build hybrid models that understand text as images and output readable byte sequences, ensuring accurate translation between visual and text formats. We will also test whether these models can grow to billions of parameters while remaining efficient and learning effectively from large amounts of multilingual data.

Tilbage til oversigtssiden