EPFL AI Center Research Seminar- Moshi: a foundation model for conversational speech - Edouard Grave

Автор: EPFL AI Center

Загружено: 2025-01-15

Просмотров: 512

Описание: This talk is part of the Research Seminar series organized by the EPFL AI Center.

The seminar was held on December 16, 2024, on the EPFL campus.

Abstract
In this talk, I will present Moshi, a joint speech-text foundation model and full-duplex spoken dialogue system. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning—such as emotion or non-speech sounds—is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections.

Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model, Moshi generates speech as tokens from the quantizer of a neural audio codec, and separately models its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We extend the hierarchical semantic-to-acoustic token generation of previous work, by predicting time-aligned text tokens as a prefix to audio tokens. Our resulting model is the first real-time full-duplex spoken large language model, with a latency of around 200 ms in practice.

Bio
Edouard Grave is a researcher and a member of the founding team at Kyutai, where he works on artificial intelligence, natural language processing and large language models (LLMs). Before joining Kyutai, he spent eight years in industry, first at Facebook AI Research and then at Apple MLR. Edouard also completed a postdoc at Columbia University, where he worked with Noémie Elhadad and Chris Wiggins, and at UC Berkeley, where he worked with Laurent El Ghaoui. He received his PhD in computer science from Université Paris VI and graduated from École Polytechnique with a M.Sc. in machine learning and computer vision.

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

EPFL AI Center Research Seminar- Moshi: a foundation model for conversational speech - Edouard Grave

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Магистратура по речевым технологиям: модели, которые слушают и отвечают

Магистратура по речевым технологиям: модели, которые слушают и отвечают

Моши Говорящий ИИ

Моши Говорящий ИИ

Можно ли использовать Whisper для потоковой передачи ASR в реальном времени?

Можно ли использовать Whisper для потоковой передачи ASR в реальном времени?

Decoding cellular systems:From observational atlases to generative interventions - Prof.Fabian Theis

Decoding cellular systems:From observational atlases to generative interventions - Prof.Fabian Theis

Why Do LLMs Struggle With Long Context? | Federico Barbero, Google DeepMind | BLISS e.V.

Why Do LLMs Struggle With Long Context? | Federico Barbero, Google DeepMind | BLISS e.V.

Moshi: a speech-text foundation model for real-time dialogue ( Paper Explained)

Moshi: a speech-text foundation model for real-time dialogue ( Paper Explained)

Билл Гейтс В ПАНИКЕ: Утечки Windows 12 ПОТРЯСЛИ Мир Технологий!

Билл Гейтс В ПАНИКЕ: Утечки Windows 12 ПОТРЯСЛИ Мир Технологий!

EPFL AI Center - A Physical perspective on Graph Neural Networks - Prof Michael Bronstein

EPFL AI Center - A Physical perspective on Graph Neural Networks - Prof Michael Bronstein

Китай, военный экспорт, K-pop: как Северная и Южная Корея влияют на мир? Интервью Андрея Ланькова

Китай, военный экспорт, K-pop: как Северная и Южная Корея влияют на мир? Интервью Андрея Ланькова

Unveiling of Moshi: the first voice-enabled AI openly accessible to all.

Unveiling of Moshi: the first voice-enabled AI openly accessible to all.

Как Гений Математик разгадал тайну вселенной

Как Гений Математик разгадал тайну вселенной

EPFL AI Center -

EPFL AI Center - "The Algebraic Geometry of Deep Learning" - Dr. Giovanni Marchetti

Самый востребованный учёный России о возвращении на Родину, науке и Боге

Самый востребованный учёный России о возвращении на Родину, науке и Боге

Общие ценности никогда не спасут и не объединят Россию — Виктор Вахштайн

Общие ценности никогда не спасут и не объединят Россию — Виктор Вахштайн

Откуда взялись различные расы если Адам был белым?

Откуда взялись различные расы если Адам был белым?

"Promises and Limitations of Causality for Machine Learning Interpretability" - Tiago Pimentel

Авторский стиль художника: мой путь от 30 до 200 тыс в месяц

Авторский стиль художника: мой путь от 30 до 200 тыс в месяц

2nd Open-Source LLM Builders Summit - Qwen: Open Foundation Models

2nd Open-Source LLM Builders Summit - Qwen: Open Foundation Models

"Moshi: a speech-text foundation model for real-time dialogue" - Alexandre Défossez

Учёные в Давосе 2026: жесткий спор о безопасности и AGI

Учёные в Давосе 2026: жесткий спор о безопасности и AGI