Creating a complete three-dimensional digital talking head including the vocal tract from the vocal folds to the lips, the face and integrating the digital simulation of aeroacoustic phenomena.

Responsable LPP  : Angelique Amelot

Financement : Programme ANR : CE23 – Intelligence Artificielle

Référence projet : ANR-20-CE23-0008

Partenaires :

  • Gipsa-lab Grenoble Images Parole Signal Automatique – UMR 5216
  • LPP Laboratoire de phonétique et phonologie – UMR 7018
  • LORIA Laboratoire Lorrain de Recherche en Informatique et ses applications – UMR 7503 (Coordinateur du projet : Yves Laprie)

Durée du projet : Mai 2021 – 42 mois


The objective is to create a complete three-dimensional digital talking head including the vocal tract from the vocal folds to the lips, the face and integrating the digital simulation of aeroacoustic phenomena. 

Our project is particularly aimed at learning articulatory gestures from corpora of data from the vocal tract (real-time MRI), the face (motion capture) and subglottic pressure, highlighting latent articulatory variables that are relevant from the point of view of speech production control, and aeroacoustic simulations that allow exploring speech production and learning control of a replica simulating the vocal tract. The project will make extensive use of deep learning techniques in interaction with physical simulations, which is an important innovation. 

The consortium is made up of 4 remarkably complementary research teams with internationally leading theoretical and practical experience in the fields of AI (particularly deep learning techniques in automatic speech processing), acoustics, experimental phonetics, MRI imaging and automatic speech processing.

The project is organized into 5 main tasks:
1) acquisition of a corpus of data covering 3 hours of speech (with several expressions) for one male and one female speaker (plus two speakers with less complete data) for dynamic MRI, facial deformation and subglottic pressure data.
2) corpus pre-processing to track the contour of articulators in MRI films, align modalities, denoise speech data, and reconstruct the vocal tract in 3D from dynamic 2D data and static 3D MRI data.
3) development of the control of the temporal evolution of the vocal tract shape, the face and the glottis opening based on the sequence of phonemes to be articulated and supra-segmental information. The approach will be based on in-depth learning using the corpus of the project and will aim in particular to bring out latent variables allowing the speaking head to be controlled and expressions to be rendered.
4) learning how to control a physical model of the simplified vocal tract using a large number of measurements. Deep learning will allow the development production strategies for plosives involving phenomena that are too rapid to be imaged with sufficient precision.
5) Adaptation of the talking head to other speakers based on anatomical landmarks and study of the acoustic impact of articulatory perturbation using the talking head.

The talking head will generate the temporal evolution of the complete shape of the vocal tract and face and the signal produced by acoustic simulation from a sentence to be pronounced. It will also be possible to produce the audio-visual signal without the acoustic simulation but losing the possibility of introducing perturbations into production and thus to study in depth the production of speech which is the main interest of this project.

The first result is the development of a radically new approach to the modelling of speech production. Until now, production models, and in particular those used for articulatory synthesis, exploit numerical models whose formal framework limits the possibility of accounting for real data such as real-time MRI.
The fields of application concern the exploitation of dynamic MRI data, the diagnosis of speech pathologies, real-time feedback inside the MRI machine, the rehabilitation of articulation gestures, the deployment of realistic talking heads for the entire vocal tract and the improvement of the rendering of lips in talking heads.