EPFL AI Center Research Seminar- Moshi: a foundation model for conversational speech - Edouard Grave
Автор: EPFL AI Center
Загружено: 2025-01-15
Просмотров: 512
Описание:
This talk is part of the Research Seminar series organized by the EPFL AI Center.
The seminar was held on December 16, 2024, on the EPFL campus.
Abstract
In this talk, I will present Moshi, a joint speech-text foundation model and full-duplex spoken dialogue system. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning—such as emotion or non-speech sounds—is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections.
Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model, Moshi generates speech as tokens from the quantizer of a neural audio codec, and separately models its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We extend the hierarchical semantic-to-acoustic token generation of previous work, by predicting time-aligned text tokens as a prefix to audio tokens. Our resulting model is the first real-time full-duplex spoken large language model, with a latency of around 200 ms in practice.
Bio
Edouard Grave is a researcher and a member of the founding team at Kyutai, where he works on artificial intelligence, natural language processing and large language models (LLMs). Before joining Kyutai, he spent eight years in industry, first at Facebook AI Research and then at Apple MLR. Edouard also completed a postdoc at Columbia University, where he worked with Noémie Elhadad and Chris Wiggins, and at UC Berkeley, where he worked with Laurent El Ghaoui. He received his PhD in computer science from Université Paris VI and graduated from École Polytechnique with a M.Sc. in machine learning and computer vision.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: