SRPP: Towards inclusive automatic speech recognition

Automatic speech recognition (ASR) is increasingly used, e.g., in emergency response centers, domestic voice assistants, and search engines. Because of the paramount relevance spoken language plays in our lives, it is critical that ASR systems are able to deal with the variability in the way people speak (e.g., due to speaker differences, demographics, different speaking styles, and differently abled users). ASR systems promise to deliver objective interpretation of human speech. Practice and recent evidence however suggests that the state-of-the-art SotA ASRs struggle with the large variation in speech due to e.g., gender, age, speech impairment, race, and accents. The overarching goal in our project is to uncover bias in ASR systems to work towards proactive bias mitigation in ASR. In this talk, I will present systematic experiments aimed at quantifying, identifying the origin of, and mitigating the bias of state-of-the-art ASRs on speech from different, typically low-resource, groups of speakers, with a focus on bias against gender, age, regional accents and non-native accents.

SRPP: Les retours somatosensoriels permettent une perception catégorielle des voyelles

Cet exposé sera essentiellement centré sur la présentation de résultats expérimentaux, qui étaient au coeur de la thèse de Jean-François Patri soutenue en 2020, que nous avons publiés dans PNAS en 2020 (Patri, J. F., Ostry, D. J., Diard, J., Schwartz, J. L., Trudeau-Fisette, P., Savariaux, C., & Perrier, P. (2020). Speakers are able to categorize vowels based on tongue somatosensation. Proceedings of the National Academy of Sciences, 117(11), 6255-6263), et qui mettaient en évidence la capacité des participants à catégoriser les voyelles du français en l’absence de retours auditifs, sur la seule base des retours somatosensoriels.
Nous avons pour cela mis au point une tâche originale de positionnement de la langue , dans laquelle les participants devaient atteindre et maintenir différentes postures de la langue dans la région des voyelles /e, Ɛ, a/, selon une procédure de guidage ne faisant aucune référence à la production de la parole, et ceci grâce à une représentation sur un écran de cibles dans un espace déformé de la langue, où la forme linguale effective n’était pas reconnaissable. Une fois la langue positionnée, les sujets devaient identifier la voyelle associée à la posture de la langue atteinte, en l’absence de tout retour sensoriel. Nos résultats indiquent que la catégorisation des voyelles est possible sur la base du seul feedback somatosensoriel, et ceci avec une précision similaire à celle de la perception auditive des sons chuchotés.
Nous discuterons des implications de ces résultats pour un modèle de contrôle de la production de la parole.

SRPP: An experimental investigation of Sevillian Spanish metathesis

Sevillian Spanish is undergoing a metathesis change in /s/-voiceless stop sequences, whereby coda /s/ debuccalizes to [h] and then metathesizes with the following stop (e.g. /pasta/: [pahta] –> [patha]).  While the phonetic reality of this change is well-established (e.g. Ruch & Peters 2016), the phonological behavior of the resulting [Ch] sequences has not been investigated.  In most languages, [Ch] sequences are aspirated stops.  It has been proposed that Sevillian [Ch] sequences may be coalescing into single segments (O’Neill 2009), and this would represent an oddity in the realm of sound change.  In this talk, I present results from a series of experiments testing the underlying representation and possible causes of this metathesis change.  Behavioral evidence from two perception tasks suggest that Sevillian listeners still treat [Ch] sequences as underlying /s/-voiceless stop sequences: they map [h] in [Ch] sequences to an /s/ on a preceding word, and they treat syllables preceding [Ch] as if they were still closed by [h] for the purposes of stress assignment.  Finally, results from a cross-linguistic ABX task show that the cause of laryngeal metathesis in Sevillian (or in other languages) is not likely to be perceptual.  The results have implications for our understanding of segments, clusters, laryngeal metathesis, and the relative rarity of preaspirated segments cross-linguistically.

SRPP: A whole tongue approach to gutturals in Levantine Arabic using Generalized Additive Mixed Modelling of Tongue surfaces

Guttural consonants (i.e., uvular, pharyngealized and pharyngeal) in Arabic are argued to form a natural class due to phonological patterning and use of a common oro-sensory zone in the pharynx (McCarthy, 1994; Sylak-Glassman, 2014a, 2014b). Yet, phonetic studies have failed to successfully find a single phonetic exponent to explain this patterning. In fact, many studies have tried to quantify this patterning by looking at changes within the root of the tongue. In this study, I Generalized Additive Mixed Modelling to quantify the whole tongue changes as obtained from Ultrasound Tongue Imaging. Using various quantification methods (2D and 3D difference splines), I show how gutturals use a common area in the vocal tract, which is indeed located at the tongue root, but also located at the tongue dorsum and body. The observed patterns point towards a gradient rather than categorical change. The phonological feature that can explain these patterns is predominantly the feature [+retracted], which is a subcomponent of the feature [+constricted epilaryngeal tube] (following the predictions of the Laryngeal Articulator Model” (LAM, Esling, 2005; Esling et al., 2019). However, tongue root, dorsum and body changes cannot be simply quantified by the feature [+retracted]. I discuss implications for an alternative formal account.

References
Esling, J. (2005). There Are No Back Vowels: The Laryngeal Articulator Model. The Canadian Journal of Linguistics, 50(1), 13–44. https://doi.org/10.1353/cjl.2007.0007
Esling, J., Moisik, S., Benner, A., & Crevier-Buchman, L. (2019). Voice Quality: The Laryngeal Articulator Model (Issue 1). Cambridge University Press. https://doi.org/10.1017/9781108696555
McCarthy, J. J. (1994). The phonetics and phonology of Semitic pharyngeals. In P. A. Keating (Ed.), Phonological Structure and Phonetic Form (pp. 191–233). Cambridge University Press. https://doi.org/10.1017/CBO9780511659461.012
Sylak-Glassman, J. (2014a). An Emergent Approach to the Guttural Natural Class. Proceedings of the Annual Meetings on Phonology, 1–12. https://doi.org/10.3765/amp.v1i1.44
Sylak-Glassman, J. (2014b). Deriving Natural Classes: The Phonology and Typology of Post-Velar Consonants. University of California, Berkeley.

SRPP: Are geminate consonants always long consonants? Acoustic-phonetic properties of Polish geminates

Polish is a language with true lexical geminates that form minimal pairs with their singleton counterparts. What is more, Polish, unlike many other geminating languages, allows both single-articulated and rearticulated geminate realisations. It seems to be a unique feature, since geminates are traditionally considered to be long counterparts of corresponding singletons and rearticulation is not discussed in any comprehensive accounts of gemination in the world’s languages. In this talk, I will demonstrate acoustic-phonetic characteristics of Polish geminates in order to trigger a discussion on how rearticulation may be incorporated into phonetic and phonological theories of gemination.

SRPP: Perceptually-motivated influences on nasal coarticulatory variation in French

The current study was designed to investigate specific research questions about whether the imitability of nasal coarticulation is affected by the phonological status of vowel nasality in French. First, does the phonological status of vowel nasality in French as contrastive lead speakers to different patterns of coarticulatory imitation for words that have a nasal vowel minimal pair relative to words that do not? Furthermore, imitation also has been viewed as a process that facilitates intelligibility. Prior work has found that American English speakers imitate an increased degree of nasal coarticulation for lexical items that pose particular challenges on perception (Zellou et al., 2016). Thus, we ask also: Do intelligibility factors influence patterns of coarticulatory nasality in French? Specifically, we compare phonetic imitation in trials where there is pressure to be intelligible, i.e., an interlocutor needs clarification about the target word to identify, to where there is less such pressure, i.e., correct response.

SRPP: Nature et origine des traits phonologiques: une perspective développementale

Certaines questions fondamentales ont parsemé les débats théoriques concernant le trait phonologique depuis ses premières conceptions, au début des années 1900, dans le cadre des travaux du Cercle Linguistique de Prague (p.ex. Jakobson 1941; voir aussi Dresher 2016 pour un survol historique). Ces questions ont aussi évolué en lien avec l’avènement de la théorique de la Grammaire Universelle (Chomsky 19757; Chomsky & Halle 1968), laquelle postule un ensemble de traits phonologiques inné à tout être humain. D’autres approches de la phonologie s’opposent de manière radicale à cette théorie et rejettent l’existence même du trait phonologique comme unité psychologiquement réelle à l’humain (Vihman & Croft, 2007).

Au cours de cette présentation, nous adopterons une approche émergentiste des représentations phonologiques (Pierrehumbert 2003; Mielke 2008), incluant le trait phonologique, à partir de deux hypothèse inter-reliées: les traits sont bien réels, mais ils ne sont pas innés; ils doivent être acquis par l’apprenant en langue première. À partir de considérations phonétiques (perceptuelles, articulatoires) et phonologiques (distributionnelles, prosodiques), aussi en fonction d’autres niveaux de représentation (p.ex. lexical), nous discuterons d’un ensemble de faits de développement phonologiques dans le parler enfantin. Ces observations soulignent, d’une part, l’importance du trait phonologique pour une compréhension des données phonologiques. D’autre part, les traits phonologiques ne peuvent pas être innés; ils représentent des connaissances spécifiques à chaque langue, et l’émergence de ces connaissances se reflète très clairement dans les données enfantines.

SRPP: R Three Ways: Capturing the dynamics of Scottish word-final /r/, using DCT and GAMMs

Sounds can be represented in terms of ‘static’ acoustic measures, e.g. from a single timepoint, or a summary mean, or through ‘dynamic’ trajectories taken across the course of a segment. Soskuthy1 outlines an effective continuum from static, through less dynamic methods, such as Discrete Cosine Transformation (DCT), which forces trajectories to fixed reference shapes and whose coefficients can be hard to interpret, to the more intuitive outputs of Generalized Additive Mixed Models (GAMMs) whose flexible reference points permit closer approximation and visualization of trajectories. As we might expect, dynamic analyses reveal further insights over static measures into social-phonological contrasts (e.g. vowels, sibilants2,3) though the inherently dynamic nature of rhotics means that dynamic analysis of /r/ has been used to characterise these sounds for a long time.4 However, comparison of different different dynamic techniques for interpreting the same feature is less usual.5

This paper considers the relative contribution of static, less and more dynamic acoustic representations, specifically mean, DCT and GAMM, in specifying the role of linguistic, social and regional factors for Scottish word-final /r/ over the 20th century. Largely auditory analyses of Scottish /r/ report changes from apical trills/taps to postalveolar, retroflex and now bunched approximants favoured by middle-class females; long-term coda /r/ weakening has also been observed for urban Central Belt vernaculars.6 The acoustic signature of a lowered third formant is found for approximant /r/; taps, trills, and weakened /r/ show high and/or rising F3.7

21-point F3 formant tracks (>49ms) were taken from all instances of pre-segmented Scottish word-final /r/, extracted from 711 speakers covering geographical, social and ethnic diversity across an apparent-/real-time span of 100+ years; likely erroneous measures were removed against existing hand-measures (36,845 tokens, 275 words). The first three DCT coefficients, capturing the trajectories’ mean, slope and curvature, were modelled for following context and lexical stress, and gender, dialect, ethnicity and decade of birth, using LME in R, controlling for speech rate, (log)/r/ duration, (log)lexical frequency, and speaker/word. GAMMs were fitted separately to male and female speaker subsets, with smooths by (log)duration, stress, following context, and dialect, ethnicity, and decade of birth, and random smooths for speaker/word.

All measures show that Scottish word-final /r/ is influenced by linguistic, regional, ethnic and social factors. DCT analysis provides robust identification of key differences and interactions for the whole dataset; GAMMs permit more refined examination of contrasts of interest. For example, DCT shows how gender interacts with decade of birth: those born most recently show lowered F3 trajectories, especially female speakers, likely reflecting a gendered shift from taps to (more bunched) approximants. GAMMs show a similar pattern, but enable better inspection of differences between groups in trajectory shapes and variability over time.

References  1. Sóskuthy, M. Evaluating generalised additive mixed modelling strategies for dynamic speech analysis. J. Phon. 84, (2021). 2. Watson, C. I. & Harrington, J. Acoustic evidence for dynamic formant trajectories in Australian English vowels. JASA. 106, 458–468 (1999). 3. Reidy, P. F. Spectral dynamics of sibilant fricatives are contrastive and language specific. JASA. 140, 2518–2529 (2016). 4. Plug, L. & Ogden, R. A parametric approach to the phonetics of postvocalic /r/ in Dutch. Phonetica 60, 159–186 (2003). 5. Tanner, J. Structured phonetic variation across dialects and speakers of English and Japanese. (McGill University, 2020).  6. Stuart-Smith, J. & Lawson, E. Scotland: Glasgow/the Central Belt. in Listening to the Past (ed. Hickey, R.) 171–98 (CUP, 2017). 7. Lawson, E., Stuart-Smith, J. & Scobbie, J. M. The role of gesture delay in coda /r/ weakening: An articulatory, auditory and acoustic study. JASA. 143 (2018).

SRPP: Russian assimilatory palatalization as incomplete neutralization

Incomplete neutralization refers to small but significant phonetic traces of underlying contrasts in phonologically neutralizing contexts. The present study examines whether Russian assimilatory palatalization in C+j sequences also results in incomplete neutralization with respect to underlying palatalized consonants. Russian contrasts plain and palatalized consonants, e.g., /p/ vs. /pj/ with the “plain” stops possibly having a secondary articulation, involving retraction of the tongue dorsum (velarization/uvularization). However, Russian also has stop-glide sequences that form near-minimal pairs with palatalized stops: e.g., /pjot/ ‘drink (3ps pres)’ vs. /pʲok/ ‘bake (3ps past).’ In the environment preceding palatal glides, the contrast between palatalized and plain consonants is neutralized, due to the palatalization of the plain stop: /pjot/à[pʲjot] (assimilatory palatalization). The purpose of the study is to explore whether the neutralization is complete. To do so, we conducted an electromagnetic articulography (EMA) experiment examining temporal coordination and the spatial position of the tongue body in derived and underlyingly palatalized consonants. Articulatory results from four native speakers of Russian (one male) revealed that gestures in both conditions are coordinated as complex segments; however, there are differences across conditions consistent with the residual presence of a tongue dorsum retraction gesture in the « plain » obstruents. We conclude that neutralization of the plain-palatal contrast in Russian is incomplete—consonants in the assimilatory palatalization exhibit inter-gestural coordination characteristic of palatalized consonants along with residual evidence of an underlying tongue dorsum retraction (velarization/uvularization) gesture.

SRPP de Boram Lee et de Jinyu Li

A one-year longitudinal study of development of L2 Korean for French learners – The role of cue weighting in L2

Boram Lee (Laboratoire de Phonétique et Phonologie)

L’acquisition d’une seconde langue nécessite d’apprendre quels indices sont pertinents pour les contrastes de la L2, ainsi que le poids relatif de ces indices, ou « cue weighting ». En français, le VOT (voice onset time) est l’indice principal qui permet la distinction entre des consonnes voisées et non-voisées bien qu’il existe des indices secondaires tels que l’intensité du bruit de relâchement ou la fréquence fondamentale (f0) sur la voyelle suivante (ex. Cho & Ladefoged, 1999). En revanche, en coréen, le VOT et la f0 sont tout aussi importants afin de distinguer les trois catégories de consonnes, soit les lenis, fortis et aspirées (ex. Kim. M, 2004). La question centrale de recherche est suivant. Est-ce que les apprenants francophones du coréen L2 modifient au cours de leur apprentissage le cue weighting, qui leur permet de distinguer le contraste de trois catégories ? Plus précisément, comment les apprenants francophones du coréen L2 adaptent leur « cue weighting » du  français L1 dans la production et la perception au fil du temps. Afin d’examiner ces questions, 21 étudiantes francophones en première année d’études coréennes ont participé à deux tâches de production ainsi que deux tâches de perception de manière longitudinale (tous les mois), pour un total de 8 sessions. Nous nous focaliserons sur les résultats de perception. Nous avons mené deux tâches d’identification avec des stimuli naturels puis avec des stimuli synthétisés. Les stimuli naturels étaient constitués d’une syllabe de type CV (C: lenis /t/, /tç/, fortis /t*/, /tç*/ et aspirée /tʰ/, / tçʰ/ et V: /a/, /i/, /o/). Les stimuli synthétisés ont été créés par resynthèse sur Praat (7 niveaux de VOT  × 5 niveaux de f0). Les résultats longitudinaux de perception montrent plus de difficulté à l’identification de la lenis comparée à l’aspirée et la fortis, et l’identification de la lenis n’augmente pas au cours de l’année alors que celui des deux autres catégorie s’améliore. En examinant le cue weighting, les apprenantes présentent un patron différent selon les catégories en termes de cue weighting : le VOT sert à distinguer aspirée vs. fortis/lenis, et la f0 sert à distinguer fortis vs lenis, suggérant une organisation en deux catégories au lieu de trois. Nous avons également observé que les apprenantes utilisent principalement l’indice VOT, qui est l’indice principal en français, et non la f0 afin de produire les trois catégories. En résumé, cette etude montre l’influence de la L1 sur la perception et la production de la L2 en termes de cue weighting.


Speech temporal control modulated by prosodic factors and sense of agency

Boram Lee (Laboratoire de Phonétique et Phonologie)

Flexibility of speech motor control in the temporal dimension, observed in the durations of speech gestures, is especially shown by the studies on time-delayed auditory feedback (DAF), in which speakers hear their speech with a certain delay and consequently respond to this temporal mismatch by decreasing their speech rate (i.e., lengthening syllables). However, given the complexity of the speech motor control system, we may expect that the temporal flexibility of speech production is modulated by various factors, including prosodic factors (e.g., the syllable position in the prosodic structure) and psycholinguist factors (e.g., the sense of agency during speech production, who is a determinant factor of controlling our own speech). To bring evidence in favor of these hypotheses, we conducted an experiment based on real-time perturbations of auditory feedback, in which 30 French speakers heard their speech with a certain delay, and/or with a shift of the F0. More precisely, we tested if with increased delay in the auditory feedback, accented syllabic nuclei were lengthened more than non-accented nuclei. Moreover, we tested if the constant F0 shift in the auditory feedback could alter the speakers’ sense of agency during speech production, and if this effect could modulate the speaker’s responses to DAF. The results show that speakers’ response to DAF depend on the syllabic status in the prosodic hierarchy. DAF lengthens more accented syllables and thus increases the saliency of accented syllables, leading to a reorganization of speech rhythm, as demonstrated by a strengthening of the coordination of syllabic and supra-syllabic amplitude modulations. The results also show that the constant F0 shift may affect the speakers’ sense of agency, reducing thus the effects of DAF. However, this reduction effect interacts with the effect of the prosodic structure.