SRPP: Talker identity from acoustic voice variability

What makes your voice yours? Human voices, our “auditory faces,” are inherently social, involving a speaker, a signal, a listener, and their interaction. Neither voice perception nor production can be understood without consideration of the dynamic, variable signals that shape utterances. Our team has identified a suite of measures that constitute a psycho-acoustic model of voice quality, paving the way for a long overdue refinement in characterizing talker voice variation. In this talk, I introduce a series of interdisciplinary studies of voice quality that tackle the challenge of identifying which of the model’s indices account for perceptually relevant acoustic variance within and among speakers. These studies investigate vocal and perceptual behaviors of many individuals from various backgrounds, employing computational tools to analyze large arrays of high-dimensional data to characterize voice variation within and across speakers and across voice qualities, speaking styles, emotions, and dialects or languages. The overarching hypothesis is that the same small set of acoustic variables characterizes acoustic variability across voices but that much of what characterizes individual speakers is idiosyncratic. Our investigation incorporates a broad range of language and communicative contexts, including identifying acoustic spaces for speech produced in different speaking styles and under different emotions, languages with and/or without tone and/or phonation contrasts, severely pathologic voices, and their perceptual consequences. Our findings serve as a basis for research on voice production and recognition, and for clinical diagnosis and/or treatment of deviation in voice quality.

SRPP: Paradigm uniformity effects in French liaison

In French, some words ending in a vowel use a consonant-final variant before vowel-initial words (e.g. grand [ɡʁɑ̃] ∼ [ɡʁɑ̃t] ‘little-MASC ’). The consonants occurring at the end of consonant-final variants are called liaison consonants. Liaison consonants are challenging for phonological theory because of evidence that they pattern ambiguously between stable word-final consonants and word-initial consonants. Some researchers have proposed specific phonological representations to account for this ambiguous behavior, including floating consonants and gradient underlying representations.
In this talk, I will propose an alternative account where the ambiguous patterning of liaison consonants is analyzed as a paradigm uniformity effect: in a word 1 – word 2 sequence, the liaison consonant ends up being ambiguous between a stable word-final consonant and a word-initial consonant because of a pressure to make contextual variants of word 1 and word 2 similar to their citation forms (i.e. words as pronounced in isolation).
I will use two case studies to support this analysis: (i) a study of liaison enchaînée in Swiss French using acceptability judgments, and (ii) a phonetic study of liaison consonants in affrication contexts (t#i) in Quebec French. I will show that the data of Study 1 and Study 2 can be modeled using a probabilistic grammar including independently motivated paradigm-uniformity constraints, without any need for special phonological representations.

SRPP: Assessing breathing individuality in the interaction between speech and limbs movement

In the late 80’s, several physiological studies observed a within-speaker consistency in breathing characteristics at rest across days (shea, 1987) or even years (Benchetrit, 1989). This wihtin-speaker consistency is characterized by breathing cycle patterns such as the duration, the volume of air inspired or the cycle shape, and was also found within activites involving breathing such as a physical effort, but not between contexts such as breathing at rest and breathing during physical activity. Indeed, ventilation can be highly modulated by physical exertion: the breath is increasingly stressed as the muscles’ demand for oxygen increases, the duration of cycles becomes shorter and their amplitude increases. Breathing is also intrinsic to speech: speech lengthens the breathing cycles via the control of exhalation, and shortens the inhalation duration. Speech breathing is a specific way to control ventilation while supporting speech planning and phonation constraints. It is highly variable between speakers but also for the same speaker, depending on utterance properties. Can we yet still observe consistency over time in speakers’ breathing profiles despite these variations? Is this potential speech breathing individuality modulated by limbs motion ? We addressed this question by analyzing the breathing profiles of 25 native speakers of German performing a narrative task on 2 days under different limb movement conditions. The individuality of breathing profiles over conditions and days was assessed by adopting methods used in physiological studies that investigated a ventilatory personality. Our results suggest that speaker-specific breathing profiles in a narrative task are maintained over days and that they stay consistent despite light physical activity. These results are discussed with a focus on better understanding what speech breathing individuality is, how it can be assessed, and the types of research perspectives that this concept opens up.

SRPP: Effects of conventions and social context on tune interpretation

Traditionally, a clear distinction has been drawn between the phonological and phonetic levels of intonation analysis, as they would convey linguistic (e.g., illocutionary) and paralinguistic (e.g., affective) meanings, respectively. However, a growing body of evidence reveals that tune meaning is multidimensional and flexible, with the choice of a tune depending both on linguistic and paralinguistic purposes. In this talk, I will present a collaborative work on the effects of tune choice on listeners’ interpretation of affective meanings. By means of two behavioral experiments, I will show that (1) listeners exploit their knowledge about the conventional association between illocutionary acts (requests, offers) and intonation (rising, falling) to infer certain kinds of affects (concerning speaker authority, mood or sincerity) and (2) this ‘inferential process’ is partially modulated by listeners’ knowledge about the speaker-addressee social relationships. Taken together, these results reinforce findings that the phonological contour is a fundamental cue for perlocutionary/affective meanings, and that such meanings are partly context-dependent.

SRPP: Interpretable comparison between auditory brainstem response and intermediate convolutional layers in deep neural networks

Can we build models of language acquisition from raw acoustic data in an unsupervised manner? Can deep convolutional neural networks learn to generate speech using linguistically meaningful representations? In this talk, I propose that language acquisition can be modeled with Generative Adversarial Networks (GANs) and that such modeling has implications both for the understanding of language acquisition and for the understanding of how deep neural networks learn internal representations.  I propose a technique that allows us to wug-test neural networks trained on raw speech. I further propose an extension of the GAN architecture in which learning of meaningful linguistic units emerges from a requirement that the networks output informative data. With this model, we can test what the networks can and cannot learn, how their biases match human learning biases (by comparing both behavioral and neural data with networks’ outputs), how they represent linguistic structure internally, and what GAN’s innovative outputs can teach us about productivity in human language. This talk also makes a more general case for probing deep neural networks with raw speech data, as dependencies in speech are often better understood than those in the visual domain and because behavioral data on speech acquisition are relatively easily accessible

SRPP: Long-distance coarticulation in Arabic: Vowels, pharyngealization and gemination

This study investigated anticipatory vowel-to-vowel coarticulation in Arabic, and sought to determine the degree to which it is affected by the pharyngealization and length of intervening consonants. Speakers of Egyptian Arabic were recorded saying sentences containing nonsense sequences of the form /baɁabaCV:/, where C was chosen from {/t/, /tˤ/, /t:/, /tˤ:/} and V was a long vowel /i:/, /a:/ or /u:/. Analysis of the first and second formants of the recorded vowels revealed that (a) vowel-to-vowel coarticulatory effects could sometimes extend to a distance of three vowels before the context vowel; (b) the consonant-to-vowel effects associated with pharyngealization were consistently seen at similar distances, while also decreasing in magnitude at greater distances from the triggering consonant; and (c) effects related to intervening consonant length were idiosyncratic, and in particular did not lead to consistent blocking of vowel-to-vowel effects. In contrast, one speaker showed significant vowel-to-vowel effects at all three measured distances that were effectively blocked in the pharyngealized consonant condition.

 

SRPP: Effects of background noise on speech communication across the lifespan

When conversing in less than ideal or “challenging” conditions, such as in background noise, talkers continuously monitor the success of the communication. In a case of communication break-down, they modify their speech in an attempt to make themselves more intelligible to the listener. These modifications include a range of acoustic-phonetic (e.g., slower, more intense and hyper-articulated speech) and linguistic adaptations (e.g., higher-frequency words, shorter and simpler sentences) often broadly referred to as “clear speech”. It has been shown that these speech modifications are modulated by complex interactions between various talker-related (e.g., age, regional accent), listener-related (e.g., age, hearing acuity, linguistic competence) and environment-related factors (e.g., room acoustics, background noise type; Mattys et al., 2012).

Our recent Economic and Social Research Council funded research project at University College London focused on few of these factors, namely how speech modifications and communication difficulty vary as a function of age and background noise type. In this project, we collected sensory, cognitive, speech production and perception data as well as self-evaluations of speaking and listening effort from 114 healthy Southern British English speaking participants aged between 8 to 80 years. For speech production, we recorded age- and sex-matched pairs while they carried out the “spot the difference” diapix task using the DiapixUK picture sets (Baker and Hazan, 2011) in conditions varying in the informational (three voices in the background) and energetic masking (speech-shaped noise) present. A secondary task (pressing a bell when hearing a dog barking but suppressing when hearing a car horn honking) was added to make the task more cognitively demanding, thus reflecting real-life multitasking situations. After completing each diapix task, both participants completed a paper-based questionnaire, answering questions about communicative difficulty and listening/speaking effort using an 11-point Likert scale. Baseline sensory and cognitive measures of hearing (pure tone audiogram), speech perception (coordinate response measure task, CCRM), cognitive function (tests of expressive vocabulary, letter-number sequencing, letter-digit substitution) and a standardised questionnaire of auditory disability (SSQ) were also collected.

In this talk, I will summarize the main results from the speech production and perception tasks (the diapix and CCRM tasks) and from the self-report measures (ratings of speaking and listening effort) that (some, not all!) show distinct developmental trajectories for different types of noise (speech vs. non-speech). Overall, our results suggest that when the background noise has a higher cognitive load, as in the case of other´s speech, children and older talkers need to exert more vocal effort to ensure successful communication. I will discuss these findings within the communication effort framework.

SRPP: The phonology of Zwara Berber and its silent stress

Berber is a typological treasure chest. While it has a conventional Afro-Asiatic syllable structure, the distribution of its segments over syllable positions is striking. In this talk, I will illustrate this on the basis of the dialect spoken in Zwara (Zuwarah), a coastal city in western Libya. While vowels only occupy syllable peaks, consonants appear in both C-positions and V-positions without exception. Since both /j w/ and /i u/ exist, this means that vowels and glides contrast in syllable peaks. In addition, it has geminate versions of all consonants. While always requiring a mora, geminates can appear in nearly all positions in syllable structure. Notably, they cannot appear in an onset-plus-peak position, making the beginning of the syllable rime an unbridgable boundary for them.

A frequent location for geminates is the rime-plus-onset location, whereby the rime favours a vowelless first half of the geminate. Rimes do not contrast /əC/ and /C/, where the realization of [ə] depends on the type of C, with voiceless obstruents typically lacking a preceding schwa. These first halves of geminates frequently occur in the stressed syllable, so that many words have ‘silent stress’. For instance, in /a.ˈws.su/ ‘humid period’ and /m.ˈmˁχ.χrˁ/ ‘late’ the stressed syllables are /ws/ and  /mˁχ/ respectively, where the obstruent is the syllable peak. The voicelessness of the syllable peak and the following onset will interrupt the f0 contour at a point where a pitch peak is expected.

A question that arises is whether speakers apply ‘segmental intonation’, a shift in the spectrum of voiceless fricative and plosive bursts, detected by Oliver Niebuhr in German. To this end, we recorded four repetitions of 12 words with /χ f s ʃ k q/ in stressed position in four carrier sentences intended to elicit three intonation conditions and a stress shift condition. In addition, we added one word as a control condition for word-final stressed /s/, a situation in which Niebuhr found ‘segmental intonation’ effects between declaratives and interrogatives. Each of the friction portions in words with /χ f s ʃ/ was segmented into three equal parts (giving 4 repetitions x 9 words x 3 friction portions x 4 conditions = 432 friction segments, while the bursts of /k q/ were segmented (4 repetitions x 4 words x 4 conditions = 64, or 496 friction/burst portions in all). These were rated for perceived pitch in an AX task using a 7-point scale by five judges in a pilot experiment. Results show segmental intonation effects that indicate these are non-automatic.

SRPP: Final obstruent voicing in Lakota: Phonetic evidence and phonological implications

Final obstruent voicing in Lakota: Phonetic evidence and phonological implications
Juliette Blevins (The Graduate Center, CUNY), Ander Egurtzegi (CNRS – IKER UMR-5478) & J. Ullrich (The Language Conservancy)

Final obstruent devoicing is a common sound pattern in the world’s languages, found in a wide range of languages including Catalan, Dutch, Lithuanian, and Zaza (Blevins 2006; Iverson & Salmons 2011). This sound pattern constitutes a clear case of parallel or convergent phonological evolution. In contrast, final obstruent voicing is claimed to be extremely rare, with some approaches explicitly predicting its non-existence (Kiparsky 2006, 2008). In contrast, phonetic-historical accounts explain skewed patterns of voicing in terms of common phonetically-based devoicing tendencies, allowing for rare cases of final-obstruent voicing under special conditions (Blevins 2006, 2015).

In this talk, phonetic and phonological evidence is offered for final-obstruent voicing in Lakota, an indigenous Siouan language of the Great Plains of North America. In Lakota, oral stops /p/, /t/, and /k/, are regularly pronounced as [b], [l], and [ɡ] in word- and syllable-final position when phrase-final devoicing and pre-obstruent devoicing do not occur (e.g. tópa ‘four’, tób ‘four (cont.)’, tóbtopa ‘by fours’). We first present a phonetic study that tests whether /p/ and /k/ show phonetic voicing in syllable-final position as well as properties of oral stops, in order to rule out interpretations of voicing as a secondary feature of lenition. Then, we offer a historical account of this unlikely sound pattern of final stop voicing, and an explanation for its rarity: final voicing is a consequence of an earlier, conditioned intervocalic voicing of *p, *t, *k to [b], [d], [ɡ], preserved only when the final vowel was devoiced or lost. Under this account, the historical origins for final stop voicing are tied to retiming of the final vowel gesture.

SRPP: Studying speech rate cross-linguistically: Resource building and case studies on final lengthening and pause probabilities

In the first part of this talk, I will introduce DoReCo, an initiative to create a multilingual reference corpus, consisting of at least 10,000 words for at least 50 languages. DoReCo extracts from fieldwork-based language documentation collections narrative texts that are already transcribed, translated into a major language, and morphologically analyzed. Within DoReCo, we convert these data to a common file format and time-align them at the phoneme level using the MAUS software. In the second part of this talk, I will present two cross-linguistic studies on a subset of this corpus: One study investigates word lengthening as a function of utterance-final position. Another, still ongoing study investigates pause probabilities before nouns vs. verbs and relates findings to the fact that, typologically, there are fewer prefixes on nouns vs. verbs.