What’s in a song ? A Comparison of the Form and Function of Infant-directed Speech and Song

In the last 30 years, many research studies have explored a special register of vocal communication primarily directed to infant listeners. However, infant–directed (ID) speech has received disproportionately more research attention than ID–singing, despite the fact that singing to preverbal infants is a universal caregiving practice around the world. Other studies show that singing to infants may regulate infant states and facilitate bonding and dyadic coordination better than speech.

To reach a better understanding of the role and structure of these two different ID inputs, I discuss some recent work related to how the acoustics and infants’ perception of ID singing and speech converge or differ. When comparing infants’ perception and attentional preferences to song and speech, we find that discriminate between infant-directed speech and song, and are highly attracted to properties of ID singing. Another recent study also found that song birds show discrimination between the same speech and song stimuli that infant listeners can discriminate, providing further support that infant-directed speech and song comprise two different acoustic categories. In a study examining infants’ attention to the two different stimulus levels in song (lyrics vs. melody), we found that infants are clearly able to differentiate between the two levels, and furthermore, the melodic level may actually facilitate infants’ memory for specific words in the lyrics.

In this talk, I explore the idea that ID singing should be considered a distinct vocal communication category, which may be optimally adapted to preverbal infants’ perceptual needs and capacities. Furthermore, I argue that ID singing complements ID speech as a critically important stimulus for infants’ perception and learning in the first year of life.

The contribution of dynamics to the perception of tonal alignment : a preliminary model

Whereas in classical models of speech perception the laws governing perceptual behaviour transform an input signal into an output corresponding to a percept, dynamical models operate via laws of change that modify the current state of the perceptual system based on its previous state and on the external input. Models following this basic principle proved to be quite successful at explaining how the perceptual system combines flexibility and stability as it is required by the interaction with a rich, ever-changing and noisy environment. Experimental evidence supporting this processing principle comes from experiments showing how the behaviour of the perceptual system changes as its exposure to particular stimuli and tasks is systematically manipulated trough time. This same principle is also at the heart of many classical neural network models of speech perception (as TRACE or the ART family models). In this presentation I will review this line of research and I will discuss ongoing work conducted in collaboration with C. Portes of the LPL (Aix-en-Provence) and aimed at extending this approach to the perception of intonation. More specifically, we propose that known effects of the shape of the f0 contour on the perception of tonal alignment are due to the dynamical nature of speech perception. To demonstrate our claim we replicated the results obtained in various published works (e.g. : D’Imperio, 2000 ; Barnes, Veilleux, Brugos and Shattuck-Hufnagel, 2012) through simulations of a very simple dynamical model addressing how the pitch curve is transformed into a sequence of discrete intonational categories : . In this model the current state of the perceptual system is mapped on the value of the variable in the following way : if exceeds a positive threshold an upward pitch movement is perceived, while if it crosses a negative thresholds a downward pitch movement is perceived. The model states that at a given instant the derivative of (i.e. its amount of change) is equal to the derivative of the value weighted by the free parameter , minus a damping term represented by the current value of weighted by the parameter (and determining how fast the perceptual system to returns to a neutral state when the input is removed), plus a random term (representing the noisy component of the perceptual process).

Ageing in speech motor control : An EMA study

Ageing is an inevitable natural process which entails changes at several physiological levels, including the central nervous system, the musculoskeletal system, the skeletal system, the cardiovascular system and the respiratory system. Crucially, increasing age affects motor control in general, involving a slowing down of, for example, the movements of the limbs (Brown, 1996). However, very little is known about how ageing affects speech production. Speech motor control almost exclusively involves fine motor control of the articulators, with the millimetre precision and split-second timing required to perform this highly complex task. As in motor control in general, a commonly reported effect of ageing on speech is that the tempo is slower, leading to a general slowing down of articulation rates (Amerman & Parnell, 1992). However, our knowledge of how ageing affects specific patterns of speech motor control, such as the coordination patterns within the oral system in the production of consonants and vowels, is limited by the fact that most studies are primarily based on acoustics. This study investigates ageing effects in speech motor control using Electromagnetic Articulography (EMA).

Machine ABX : agnostic, automatic, and large-scale measures of contrastiveness

Phonetic analyses often involve studies of contrastiveness, the degree to which two elements (such as atomic sounds or words) differ along some perceptually relevant dimension(s). To take just one example, when studying sound changes in progress, the diachronic linguist observes that a section of the population ceases to use an acoustic dimension, which may or may not reduce discriminability of a sound, leading to (never-ending) debates regarding the completeness of neutralization. 
When seeking to measure contrastiveness, phoneticians today typically have only two alternative approaches within their reach. One is to make an informed decision regarding which phonetic dimension may be involved in the contrast at hand, and then hand-annotate or find an automatic way of measuring this dimension in all the tokens under study. The second alternative is to fall back onto naïve humans’ judgments, and set up a discrimination or classification experiment. There are pro’s and con’s to both of these alternatives, including a common disadvantage : Both are fairly resource-intensive, either in terms of trained annotators or in terms of the effort required for collecting experimental judgments, a disadvantage that is particularly salient for large corpora.
I will present a third method that, like human judgments, can provide global contrastiveness estimates, but at a much lower cost, and in a manner that facilitates replication and extension. Specifically, we developed a machine-based ABX task (Schatz et al. 2013 ; Schatz 2016), which works as follows : a first stage identifies all possible ABX triplets in a corpus, where A and B are tokens from two different categories and X is another token of one of the two categories (e.g., /ta1-ti-ta2/). Each token is represented by a set of acoustic or articulatory dimensions. The algorithm then compares the representation of X against that for A and B, and returns an “A” response if X is closer in this (multidimensional) space to the A token than the B token, and “B” otherwise. This response is evaluated against the true category membership (in the example, the correct response is indeed A), and this is repeated for all possible triplets, and then averaged into an accuracy score.
The crucial innovative aspect of this task is that it is completely agnostic to the choice of input representation, the only requirement being that the user provide a reasonable way of measuring similarity between tokens. The task can thus be run on phonetically defined dimensions (such as VOT and f0, extracted by hand) as well as on more holistic acoustic measures (such as mel-based spectral representations). It is agnostic not only because it provides a contrastiveness measure that is mathematically well-defined for essentially any input format, but also because, in practice, given a finite sample of speech stimuli, our ability to reliably estimate this ideal measure is not affected by the particular choice of input format. To be more precise, we exhibited a computationally tractable estimator for our measure of contrastiveness whose form and rate of convergence do not depend on the choice of representation and dissimilarity function and which is unbiased and with minimal variance among all unbiased estimators (Schatz, 2016).
I will provide examples of application to the study of variability factors in speech such as speaker, phonetic context or speech register (Martin et al. 2015, Bergmann et al. 2016, Schatz et al. 2017) and for comparing speech representations (Schatz et al. 2013, Schatz et al. 2014, Schatz et al. in preparation).


References



Schatz, T., Bach, F., & Dupoux, E. (2017). Automatic speech recognition systems as quantitative models of phonetic category perception. Manuscript in preparation.

Schatz, T., Turnbull, R., Bach, F., & Dupoux, E. (2017). A Quantitative Measure of the Impact of Coarticulation on Phone Discriminability. Proc. Interspeech 2017.

Schatz, T. (2016). ABX-discriminability measures and applications. Doctoral dissertation, Université Paris 6 (UPMC).

Bergmann, C., Cristia, A., & Dupoux, E. (2016). Discriminability of sound contrasts in the face of speaker variation quantified. Proc. Cogsci 2017.

Martin, A., Schatz, T., Versteegh, M., Miyazawa, K., Mazuka, R., Dupoux, E., & Cristia, A. (2015). Mothers speak less clearly to infants than to adults : A comprehensive test of the hyperarticulation hypothesis. Psychological science, 26(3), 341-347.

Schatz, T., Peddinti, V., Cao, X. N., Bach, F., Hermansky, H., & Dupoux, E. (2014). Evaluating speech features with the Minimal-Pair ABX task (II) : Resistance to noise. Proc. Interspeech 2017.

Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H., & Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair ABX task : Analysis of the classical MFC/PLP pipeline. Proc. Interspeech 2017.

SRPP Multi Intervenants

Intervenant 1 : Amazouz Djegdjiga (avec Martine Adda-Decker, Lori Lamel)

Titre : Addressing Code-Switching in French/Algerian Arabic Speech

Résumé :
The presentation focuses on code-switching (CS) in French/Algerian Arabic bilingual communities and investigates how speech technologies, such as automatic data partitioning, language identification and automatic speech recognition (ASR) can serve to analyze and classify this type of bilingual speech. A preliminary study carried out using a corpus of Maghrebian broadcast data revealed a relatively high presence of CS Algerian arabic as compared to the neighboring countries Morocco and Tunisia. Therefore this study focuses on code switching produced by bilingual Algerian speakers who can be considered native speakers of both Algerian Arabic and French. A specific corpus of four hours of speech from 8 bilingual French Algerian speakers was collected. This corpus contains read speech and conversational speech in both languages and includes stretches of code-switching. We provide a linguistic description of the code-switching stretches in terms of intra-sentential and inter-sentential switches, the speech duration in each language. We report on some initial studies to locate French, Arabic and the code-switched stretches, using ASR system word posteriors for this pair of languages.

Intervenant 2 : Yaru WU (avec Martine Adda-Decker, Cecile Fougeron, Lori Lamel)

Titre : Schwa Realization in French : Using Automatic Speech Processing to Study Phonological and Socio-linguistic Factors in Large Corpora

Résumé :
The study investigates different factors influencing schwa realization in French : phonological factors, speech style, gender, and socio-professional status. Three large corpora, two of public journalistic speech (ESTER and ETAPE) and one of casual speech (NCCFr) are used. The absence/presence of schwa is automatically decided via forced alignment, which has a successful performance rate of 95%. Only polysyllabic words including a potential schwa in the word-initial syllable are studied in order to control for variability in word structure and position.
The effect of the left context, grouped into classes of a word final vowel or final consonant or a pause, is studied. Words preceded by a vowel (V#) tend to favor schwa deletion. Interestingly, words preceded by a consonant or a pause have similar behaviors : speakers tend to maintain schwa in both contexts.
As can be expected, the more casual the speech, the more frequently schwa is dropped. Males tend to delete more schwas than females, and journalists are more likely to delete schwa than politicians. These results suggest that beyond phonology, other factors such as gender, style and socio-professional status influence the realization of schwa.

Intervenant 3 : Giuseppina Turco, Karim Shoul, Rachid Ridouane

Titre : How are four-level length distinctions produced ? Evidence from Moroccan Arabic

Résumé :
We investigate the durational properties of Moroccan Arabic identical consonant sequences contrasting singleton (S) and geminate (G) dental fricatives, in six combinations of fourlevel length contrasts across word boundaries (#) (one timing slot for #S, two for #G and S#S, three for S#G and G#S, and four for G#G). The aim is to determine the nature of the mapping between discrete phonological timing units and phonetic durations. Acoustic results show that the largest and most systematic jump in duration is displayed between the singleton fricative on the one hand and the other sequences on the other hand. Looking at these sequences, S#S is shown to have the same duration as #G. When a geminate is within the sequence, a temporal reorganization is observed : G#S is not significantly longer than S#S and #G ; and G#G is only slightly longer than S#G. Instead of a four-way hierarchy, our data point towards a possible upper limit of three-way length contrasts for consonants : S < G=S#S=G#S < S#G=G#G. The interplay of a number of factors resulting in this mismatch between phonological length and phonetic duration are discussed, and a working hypothesis is provided for why duration contrasts are rarely ternary, and almost never quaternary.

Large(r) scale cross-linguistic study of speech sounds : case studies of laryngeal stop contrasts and segmental influences on pitch

Decades of work on speech variability by linguists and speech scientists has shed much light on its structure and sources, but has largely consisted of fine-grained studies of a handful of phonetic cues and languages (e.g. VOT, English), whose scope is limited by the fact that collecting and annotating speech data is time-consuming and expensive. This talk describes two studies of sound systems in large cross-linguistic corpora of read speech, which scale up relative to previous work, in terms of cross-linguistic coverage, sample size, and acoustic cues considered. We are able to scale up in part by using innovative speech analysis software enabling “large-scale studies” of sound systems, a direction also being pursued in a new Digging Into Data project on variation in English sounds across three countries which I briefly describe.

Study 1 : How stop consonants are realized in terms of acoustic cues—including VOT and closure voicing—differs greatly between languages, and by position within a language. This variability in phonetic realization of the “same” phonological contrasts — laryngeal contrasts, e.g. “voicing contrasts” — has long been of interest for reasoning about phonological representation, especially feature specification. For example, « laryngeal realism” theories hypothesize a close tie between phonological features and phonetic realization cross-linguistically, based on several phonetic criteria, such as speech rate correlations with VOT. These criteria have mostly been tested in isolation, on 1-2 languages. Using data from seven languages, we test whether these criteria hold and give convergent evidence. We find that they broadly do, supporting a close relationship between feature specification and phonetic realization, but with interesting exceptions.

Study 2 : Sound change commonly arises from « phonetic precursors » : small phonetic effects assumed to hold across languages and individuals, which evolve into full-blown contrasts over time. Relatively little is known about the robustness of most phonetic precursors : variability in their effect size across languages and speakers, which matters for establishing which precursors are robust enough to plausibly lead to change. Two widely-studied precursors, which also form a good test case for an automated analysis, are the effect of vowel height and preceding consonant voicing on F0 (VF0 and CF0). We assess the degree of cross-linguistic and interspeaker variability in VF0 and CF0 effects across 14 languages. We find that the existence of VF0 and CF0 effects are relatively robust across languages, confirming that they are possible phonetic precursors to sound changes, but their robustness across speakers is less clear, possibly helping explain why they rarely do lead to sound change. A methodological finding is that VF0 and CF0 effects can be detected in non-laboratory speech with minimal statistical controls, despite not accounting for many factors greatly affecting F0 (e.g. intonation).

Articulation of English prominence by L1 (English) and L2 (French) speakers

This study examines jaw and tongue blade (TB) articulation of prominence in two English sentences (one with all low vowels, and one with all mid front vowels) by L1 and L2 (French) speakers of English. The results show that even though the phonological target vowels are kept constant in the sentence, the amount of jaw lowering as well as the corresponding tongue position in the sentences varies for each word. This is true for both L1 and L2 speakers. However, the patterns of the L2 speakers can be different from those of the L1 speakers ; for instance, the L1 speakers show rather consistent patterns of Low-High jaw position for each word in a phrase in these sentences, with a step-wise lowering at TB position and one word produced at the lowest jaw position in the utterance, whereas L2 speakers generally do not have a consistent word with the lowest jaw position

The Fall and Rise of Vowel Length in Bantu

Although Proto-Bantu had a vowel length contrast on roots which survives in many daughter languages today, many other Bantu languages have modified the inherited system. In this talk I distinguish between four types of Bantu languages : (1) Those which maintain the free occurrence of the vowel length contrast inherited from the proto language ; (2) Those which maintain the contrast, but have added restrictions which shorten long vowels in pre-(ante-)penultimate word position and/or on head nouns and verbs that are not final in their XP ; (3) Those which have lost the contrast with or without creating new long vowels (e.g. from the loss of an intervocalic consonant flanked by identical vowels) ; (4) Those which have lost the contrast but have added phrase-level penultimate lengthening. I will propose that the positional restrictions fed into the ultimate loss of the contrast in types (3) and (4), with a concomitant shift from root prominence (at the word level) to penultimate prominence (at the intonational and phrase level). In the course of covering the above typology and historical developments in Bantu, I will show that there are some rather interesting Bantu vowel length systems that may or may not be duplicated elsewhere in the world.

Voir en ligne : https://lpp.ilpga.fr/annie/Hyman_Pa…

Acoustic cues is speech : linking (acoustic) form and (linguistic) substance

As for any communication system, the decoding of speech by the human auditory system relies on a code associating a physical input with some linguistic representations. Finding what auditory primitives (acoustic cues) human listeners rely on to decode speech sounds is an important step toward a better understanding of speech comprehension and acquisition.
In this talk I will describe two projects aiming at uncover perceptually relevant acoustic cues in speech. The first part will focus on the identification of the acoustic cues underpinning phoneme comprehension, through the example of a ba/da categorization task, using the newly developed Auditory Classification Image method (Varnet et al., 2013, 2015, 2016). Then, in a second part, we will turn to the encoding of higher-level linguistic properties in the speech signal, with a comparison of different language groups (stress-timed vs. syllable-timed languages and head-complement vs. complement-head languages) on the basis of their temporal modulation content (Varnet et al., 2017).