Lost in Generation: Understanding the Loss of Linguistic Diversity in the Age of LLMs
Name of applicant
Esther Ploeger
Title
PhD student
Institution
KU Leuven, UCLouvain
Amount
DKK 2,410,221
Year
2025
Type of grant
Internationalisation Fellowships
What?
Despite being trained on vast and diverse datasets, large language models (LLMs) often produce text that lacks linguistic diversity. This project aims to investigate the phenomenon of linguistic homogenization in text generated by LLMs. It will focus on measuring, understanding and modelling this lack of diversity, particularly in smaller language communities such as in Denmark.
Why?
As LLMs become increasingly integrated into various aspects of daily life, their influence on language use grows. Yet, these effects are currently only being investigated monolingually, for English. The project therefore addresses the under-researched topic of LLM-induced lack of linguistic diversity for smaller language communities, for example applied to Danish.
How?
The project will develop cross-lingual metrics for measuring diversity, compare monolingual and multilingual LLM outputs, identify causes of reduced diversity, and explicitly model diversity. Collaboration with leading institutions in Denmark and Belgium will support the interdisciplinary approach, at the intersection of linguistics and computer science.