Lost in Generation: Understanding the Loss of Linguistic Diversity in the Age of LLMs
Navn på bevillingshaver
Esther Ploeger
Titel
PhD student
Institution
KU Leuven, UCLouvain
Beløb
DKK 2,410,221
År
2025
Bevillingstype
Internationalisation Fellowships
Hvad?
Despite being trained on vast and diverse datasets, large language models (LLMs) often produce text that lacks linguistic diversity. This project aims to investigate the phenomenon of linguistic homogenization in text generated by LLMs. It will focus on measuring, understanding and modelling this lack of diversity, particularly in smaller language communities such as in Denmark.
Hvorfor?
As LLMs become increasingly integrated into various aspects of daily life, their influence on language use grows. Yet, these effects are currently only being investigated monolingually, for English. The project therefore addresses the under-researched topic of LLM-induced lack of linguistic diversity for smaller language communities, for example applied to Danish.
Hvordan?
The project will develop cross-lingual metrics for measuring diversity, compare monolingual and multilingual LLM outputs, identify causes of reduced diversity, and explicitly model diversity. Collaboration with leading institutions in Denmark and Belgium will support the interdisciplinary approach, at the intersection of linguistics and computer science.