SRPP: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced Languages

Language and Communication Institute, UC Louvain (Belgium)
08 December 2023, 14h0015h30

Less-resourced languages are usually left out of comparative phonetic studies based on large corpora. We contribute to the recent efforts to fill this gap by assessing how to use open-access, crowd-sourced audio data from Lingua Libre for phonetic research. Lingua Libre is a participative linguistic library developed by Wikimedia France since 2015. This presentation has two main goals. First, I would like to present the data that Lingua Libre has to offer and how to use it, and second, I use this data in a typological study as a proof of concept. For this second part, I consider the Inventory Size Hypothesis, which predicts that, in a given system, variation in the realization of each vowel will be inversely related to the number of vowel categories. We investigate 10 languages with various numbers of vowel categories, i.e., German, Afrikaans, French, Catalan, Italian, Romanian, Polish, Russian, Spanish, and Basque. Audio files are extracted from Lingua Libre to be aligned and segmented using the Munich Automatic Segmentation System (WebMAUS). Information on the formants of the cardinal vowels /a, i, u/ is then extracted to measure how vowels expand in the acoustic space and whether this correlates with the number of oral vowel categories in the language. The results provide valuable insight into the question of vowel dispersion and demonstrate the wealth of information that crowd-sourced data has to offer.