Tokenization-free Language Models: Theory and Application
Navn på bevillingshaver
Desmond Elliott
Titel
Associate Professor
Institution
University of Copenhagen
Beløb
DKK 6,995,913
År
2025
Bevillingstype
Semper Ardens: Accelerate
Hvad?
We will develop a new type of language model that combines visual pixel representations with computer bytes to support thousands of languages. The model encodes rendered text images into vectors and generates byte sequences as output. Our goal is to match the performance of existing open models while reducing compute costs by up to 50% for low-resource languages.
Hvorfor?
Current language models poorly serve billions of low-resource language users. Traditional tokenization makes it two to three times more expensive for speakers of languages like Hindi or Faroese to use AI systems compared to English users. Our tokenization-free approach will democratize access to high-quality, affordable language technology for everyone, regardless of their language.
Hvordan?
A team of three PhD students and one postdoc will investigate how to build hybrid models that understand text as images and output readable byte sequences, ensuring accurate translation between visual and text formats. We will also test whether these models can grow to billions of parameters while remaining efficient and learning effectively from large amounts of multilingual data.