In AI representation paradigms, speech is typically represented under the form of series of vector embeddings. In this talk, we address speech variation between native and non-native (Russian L1) speakers of French with a method based on a frame-wise comparison of wav2vec2 acoustic embeddings. The first part will focus on wav2vec2 parameterization aspects (Layer and z-normalization), and then the second part will deal with French L2-acquisition questions by comparing phonologically similar recordings using Dynamic Time Warping methods.
Wav2vec2 parameterization is carried out by assessing the response of wav2vec2/XLSR-53 model vis-à-vis intra-speaker vs inter-speaker variability on a controlled experiment of read speech. Then, using wav2vec2 embeddings without any supervision, we investigate the model’s ability to tell whether or not native speech is more stable that non-native speech.
Results indicate that the model allows phonetically meaningful correlative approaches using wav2vec2 frame-by-frame embeddings. By evidencing our ability to address time-dependent phonetic questions directly on wav2vec2 embeddings, this study explores an innovative research avenue combining speech sciences and neural approaches. Our analyses, which have been conducted at word level, should reap more benefits at phoneme level. We conclude by outlining the expected benefits in the future developments of the research.
Being a method largely in construction, Maxime’s talk will be mainly addressing the methodological aspects of the approach, but, giving credit where credit is due, Daria Dashkevich (LMSU) and Ekaterina Biteeva (LPP) will step in to introduce the corpus used and discuss the perspectives of the study.


