SRPP: A Lexical Access Model for Italian: The LaMIT project – Modeling human speech processing: identification of words in running speech toward lexical access based on the detection of landmarks and other acoustic cues to features

Modelling the process that a listener actuates in deriving the words intended by a speaker requires setting a hypothesis on how lexical items are stored in memory. This work aims at developing a system that imitates humans when identifying words in running speech and, in this way, provide a framework to better understand human speech processing. We build a speech recognizer for Italian based on the principles of Stevens’ model of Lexical Access in which words are stored as hierarchical arrangements of distinctive features (Stevens, K. N. (2002). “Toward a model for lexical access based on acoustic landmarks and distinctive features,” J. Acoust. Soc. Am., 111(4):1872–1891). Over the past few decades, the Speech Communication Group at the Massachusetts Institute of Technology (MIT) developed a speech recognition system for English based on this approach. Italian is the first language beyond English to be explored; the extension to another language provides the opportunity to test the hypothesis that words are represented in memory as predicted by theory, and reveal which of the underlying mechanisms may have a language-independent nature. Future developments will test the hypothesis that specific acoustic discontinuities – called landmarks – that serve as cues to features, are language independent, while other cues may be language-dependent, with powerful implications for understanding how the human brain recognizes speech. A new Lexical Access corpus, the LaMIT database, that was created and labeled specifically for this work, will be described. It is provided freely to the speech research community. Furthermore, as will be presented, a legacy software, named xkl, with superior capabilities in performing detailed acoustic analysis of speech, that was developed in the 80’s by the late Dennis Klatt at MIT, was revamped and adapted to modern computing platforms. Finally, we will address a peculiar property of Italian, lexical vs. syntactic consonant gemination, as an exemplar case of the adopted research method.

SRPP: The assessment of speech disorders from the point of view of the automatic speech processing

This presentation will deal with a specific case of communication disorder, namely speech and voice disorders. After defining this specific context, we will focus on the assessment of this type of disorder, which is necessary in the clinical field, and on how automatic approaches can overcome the limitations of perceptual assessment, particularly in terms of subjectivity and reproducibility. We will briefly review the classical machine learning approaches used since the 90s and, more recently, the application of deep learning. At this point, we’ll look at the concept of interpretability in deep learning (as we define this concept) and how it can be used to provide useful information to clinicians.

SRPP: Speech factors from a ten-year aspect: Longitudinal perspective of young and middle aged adult speakers’ speech

Longitudinal studies of adult speakers often investigate large time intervals or elderly speakers, however, the variance or change in young and middle aged adults’ speech is understudied, although, this question is relevant in some applied fields, e.g., forensic phonetics, where the speech materials to be compared are often recorded years apart.
Our research deals with the question of changes or variance in young and middle aged speakers’ speech in a mid term, 10 year time interval. Speakers of a Hungarian data base (Neuberger et al. 2014) were invited to participate in follow-up recording after ten years (Gráczi et al. 2020). The protocol includes spontaneous, semi-spontaneous and read speech tasks that are studied via forensic speaker verification tools, and with acoustic phonetic methods. The studies to be introduced in the talk have been carried out simultaneously with the recordings, therefore the numbers of the subjects are different in the specific analyses. The f0, the first four formants of vowels, the four spectral moments of obstruents, speech tempo and pauses (filled, silent, and combined) have been studied so far. F0 has been analysed in various speech types, while the spectral features have been studied only in read speech yet to control for the effect of the context.
The results show that despite the speech samples of the same speaker recorded with a ten years time span can be detected as more similar to each other than speech samples from two different speakers, the scores of the verification test do not reach the standard value of similarity for most speakers, while the acoustic measures show a diverse picture. Most of these vary without showing group level tendencies, except for the average f0 value of young female speakers.
In the talk, we will address the specific results, their possible explanations, and the next steps for synthesizing the results and to draw the main conclusions.

Gráczi, Tekla Etelka, Huszár, Anna, Krepsz, Valéria, Száraz, Bettina, Damásdi, Nóra, Markó, Alexandra: Longitudinális korpusz magyar felnőtt adatközlőkről. [Longitudinal Speech Corpora of Hungarian adult speakers]. Proceedings of Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Hungary, 2020.
Neuberger, Tilda, Gyarmathy, Dorottya, Gráczi, Tekla Etelka, Horváth, Viktória, Gósy, Mária, Beke, András (2014) Development of a large spontaneous speech database of agglutinative Hungarian language. In: 17th International Conference, TSD 2014, September 8-12, 2014., Brno.

SRPP: Using Crowd-Sourced Data and Automatic Alignment to Investigate the Phonetics and Phonology of Less-Resourced Languages

Less-resourced languages are usually left out of comparative phonetic studies based on large corpora. We contribute to the recent efforts to fill this gap by assessing how to use open-access, crowd-sourced audio data from Lingua Libre for phonetic research. Lingua Libre is a participative linguistic library developed by Wikimedia France since 2015. This presentation has two main goals. First, I would like to present the data that Lingua Libre has to offer and how to use it, and second, I use this data in a typological study as a proof of concept. For this second part, I consider the Inventory Size Hypothesis, which predicts that, in a given system, variation in the realization of each vowel will be inversely related to the number of vowel categories. We investigate 10 languages with various numbers of vowel categories, i.e., German, Afrikaans, French, Catalan, Italian, Romanian, Polish, Russian, Spanish, and Basque. Audio files are extracted from Lingua Libre to be aligned and segmented using the Munich Automatic Segmentation System (WebMAUS). Information on the formants of the cardinal vowels /a, i, u/ is then extracted to measure how vowels expand in the acoustic space and whether this correlates with the number of oral vowel categories in the language. The results provide valuable insight into the question of vowel dispersion and demonstrate the wealth of information that crowd-sourced data has to offer.

SRPP: Handy prosody: how prosody in the voice, lips, and hands shapes the words you hear

Speech conveys both segmental information about vowels and consonants as well as suprasegmental information about for instance intonation, speech rate, and lexical stress; also known as the prosody of speech. In this talk, I will demonstrate that listeners are keenly sensitive to spoken prosody conveyed through both the auditory and the visual modality. Work from my group showcases the vast variability in how different talkers produce spoken prosody, while also unveiling the remarkable flexibility with which listeners can learn to strategically adapt to this between-talker variability. It also emphasizes that prosody is a multimodal linguistic phenomenon, with the voice, lips, and even hands conveying prosody in concert. For instance, evidence for a ‘manual McGurk effect’ provides a proof-of-concept of how even relatively simple ‘flicks of the hands’ can influence the perception of lexical stress. Moreover, human listeners are shown to actively weigh various multisensory cues to prosody depending on the listening conditions at hand. Thus, prosody – in all its multisensory forms – is a potent factor in speech perception, determining which words we hear.

SRPP: Recent improvements of the articulatory speech synthesizer VocalTractLab

Unlike mainstream neural speech synthesizers, articulatory speech synthesizers directly simulate the process of speech production at the articulatory and aero-acoustic levels. This talk presents the articulatory synthesizer VocalTractLab 2.3, its main model components, and its multiple control levels. We also present recent improvements in the synthesis of German diphthongs, the modelling of energy losses in the vocal tract, and new applications of the synthesizer. The synthesizer is available for download at www.vocaltractlab.de.

SRPP: Lexical-semantic organization in the developing brain

Until recently, there has been little evidence regarding how and when infants begin to integrate words into an inter-connected lexical-semantic system. Recent electrophysiological studies show that lexical-semantic system (activation of related words such as cat-horse) develops together with vocabulary during the second year of life in monolingual infants (Rämä et al., 2013; Rämä et al., 2018). Some evidence shows that lexical-semantic organization develops later in bilingual than in monolingual infants. Also, there is mixed evidence as to whether lexical-semantic activation occurs similarly in dominant and non-dominant languages in bilingual language learners (e.g., Sirri & Rämä, 2019). In my talk, I will present results regarding neurophysiological mechanisms underlying lexical-semantic development in monolingual and bilingual infants. I will also describe our recent findings on the effect of speaker familiarity on processing of word meanings.

References

Rämä, P., Sirri, L., & Serres, J. (2013). Development of lexical–semantic language system: N400 priming effect for spoken words in 18-and 24-month old children. Brain and language125(1), 1-10.

Rämä, P., Sirri, L., & Goyet, L. (2018). Event-related potentials associated with cognitive mechanisms underlying lexical-semantic processing in monolingual and bilingual 18-month-old children. Journal of Neurolinguistics47, 123-130.

Sirri, L., & Rämä, P. (2019). Similar and distinct neural mechanisms underlying semantic priming in the languages of the French–Spanish bilingual children. Bilingualism: Language and Cognition22(1), 93-102.

SRPP: Diversity of voices in our heads: phonology in the light of auditory verbal aphantasia

« [tbdrtnt], she mentioned them, the voices within. »

As you silently read this line, can you hear the voice of Rachid Ridouane explaining the phonetics of Tashlhiyt?

Endophasia, or inner speech, can take various formats depending on the individual or the situation. It is sometimes considered to be expanded and accompanied by auditory, somatosensory or visual sensations, while other descriptions highlight instead its condensed amodal nature. It can occur as a monologue or a dialogue. Finally, it can feel intentional, when we rehearse material in memory, or unintentional, during mind wandering or rumination. To account for variations along the three dimensions of condensation, dialogality and intentionality, we introduced ConDialInt, a neurocognitive model rooted in a predictive control framework. The inner voice phenomenon is seen as an exaptation of the sensory predictions involved in the control of overt speech. Speech production is considered to be hierarchically controlled, from conceptualisation to articulation, via formulation, motor planning and programming stages. At each stage, control is based on the comparison between initial input and prediction. Endophasia is viewed as an interruption in the speech production process. Condensed forms emerge when the interruption occurs early, before the formulation stage. Expanded forms, inner voices, recruit the full production process, interrupted only prior to articulation. Dialogal forms are taken to include indexical and perspective properties. The degree of intentionality is associated with the degree of control applied to the predictions. The ConDialInt model is compatible with neuroanatomical data obtained for a variety of inner speech situations. It also accounts for atypical forms of endophasia. In particular, auditory verbal aphantasia (lack of inner voice feeling) can be construed as an extreme on the condensation dimension. These propositions have implications for the nature of phonological representation postulated in theories of language processing.

SRPP de Philipp M. Buech et de Clémence Guieu-Grandsire

(1) Pharyngealization and Labialization in Tashlhiyt: Articulation and acoustics
Philipp M. Buech (LPP)

Secondary articulations occur in the phoneme inventories of approximately a quarter of the world’s languages (Buech et al., 2022). In terms of articulation, secondary articulations are produced by a gesture of a lesser degree in addition to a primary gesture (Trask, 1996). Acoustically, secondary articulations are signaled in the formant structure of adjacent vowels rather than on the consonants themselves (Maddieson & Ladefoged, 1996). Tashlhiyt is an Amazigh language that belongs to rare languages that have two secondary articulations in their phonological system. These are pharyngealization and labialization, a co-occurrence that is present in only 0.3% of the world’s languages. In general, pharyngealization is well-investigated especially on data from Arabic varieties, while labialization belongs to the under-investigated phenomena. In Tashlhiyt, pharyngealization is a feature of coronals (e.g., [izi] ‘fly’ vs [izˤi] ‘bile’), while labialization is present in the set of dorsals (e.g., [ngi] ‘flow!’ vs [ngʷi] ‘delouse!’). Another peculiarity of Tashlhiyt is its strong use of consonant sequences. Since information on secondary articulations is mainly on adjacent vowels, the question arises as to how pharyngealization and labialization are realized if the positions adjacent to the consonantal targets are partly or entirely occupied by other consonants.
This talk addresses this question and presents articulatory and acoustic data of pharyngealization and labialization and their realization in different contexts: V_V, VC_V, V_CV, and VC_CV.

(2) Perception and production of rhotics /ɾ/ and /ʁ/ by French-Greek bilinguals and monolinguals
Clémence Guieu-Grandsire (LPP)

Does bilingual phonological development resemble that of monolinguals? To address this question, we explored bilingual and monolingual phonological acquisition with a specific focus on the production and perception of French and Greek rhotics. Our longitudinal study involved 27 children aged between 2;8 and 6;0 years old, divided into four groups: eight French monolinguals, twelve simultaneous French-Greek bilinguals: six living in France and five in Greece, and seven Greek monolinguals. Our investigation involved two different tasks using the same words with rhotics as syllable onsets: a picture naming task and a mispronunciation detection task.
Results indicate that bilinguals and monolinguals exhibit similar behaviours when facing structural complexity inherent to the rhotic type, particularly in production. However, for bilinguals, accurate realization and identification of rhotics appear to be strongly influenced by language dominance. Error patterns of bilinguals also indicate a potential unidirectional transfer attesting of interferences between the two phonological systems.
In the context of bilingual phonological acquisition, the influence of structural complexity inherent to the learned language(s), as observed in monolingual acquisition, seems to be counterbalanced by external factors such as the quantity of language exposure.

SRPP: Kinematic and acoustic contributors to formant perturbation responses in individuals with and without Parkinson’s disease

Auditory perturbation tasks provide insight into the use of auditory feedback during speech, which is useful for understanding the nature of speech disruptions in individuals with Parkinson’s disease (IwPD). Prior studies1,2 have investigated acoustic responses to formant perturbations in IwPD, however no study has examined the articulatory kinematic correlates of these responses, raising the question of whether articulatory kinematics may demonstrate motor changes that are not captured acoustically. In this study, we assessed 33 IwPD and 25 control speakers (CS) on their acoustic and kinematic responses to a gradual perturbation in the first and second formant (F1 and F2). In the talk I will present the first results from the study, including how group (IwPD vs. CS) and other variables of interest (e.g., disease severity) impact adaptation as seen in acoustics (F1 and F2; captured with a microphone) and kinematics (tongue and jaw height and frontness; captured with electromagnetic articulography sensors).

1Abur, D., Subaciute, A., Daliri, A., Lester-Smith, R. A., Ashling, L., Cilento, L., Enos, N. M., Weerathunge, H. R., Tardif, M., & Stepp, C. E. (2021). Feedback and Feedforward Auditory-Motor Processes for Voice and Articulation in Parkinson’s Disease. Journal of Speech, Language, and Hearing Research, 64(12). https://doi.org/10.1044/2021_JSLHR-21-00153

2Mollaei, F., Shiller, D. M., & Gracco, V. L. (2013). Sensorimotor adaptation of speech in Parkinson’s disease: Speech Motor Adaptation. Movement Disorders, 8(12). https://doi.org/10.1002/mds.25588