Lost in Generation: Understanding the Loss of Linguistic Diversity in the Age of LLMs

Navn på bevillingshaver

Esther Ploeger

Titel

PhD student

Institution

KU Leuven, UCLouvain

Beløb

DKK 2,410,221

År

2025

Bevillingstype

Internationalisation Fellowships

Hvad?

Despite being trained on vast and diverse datasets, large language models (LLMs) often produce text that lacks linguistic diversity. This project aims to investigate the phenomenon of linguistic homogenization in text generated by LLMs. It will focus on measuring, understanding and modelling this lack of diversity, particularly in smaller language communities such as in Denmark.

Hvorfor?

As LLMs become increasingly integrated into various aspects of daily life, their influence on language use grows. Yet, these effects are currently only being investigated monolingually, for English. The project therefore addresses the under-researched topic of LLM-induced lack of linguistic diversity for smaller language communities, for example applied to Danish.

Hvordan?

The project will develop cross-lingual metrics for measuring diversity, compare monolingual and multilingual LLM outputs, identify causes of reduced diversity, and explicitly model diversity. Collaboration with leading institutions in Denmark and Belgium will support the interdisciplinary approach, at the intersection of linguistics and computer science.

Tilbage til oversigtssiden