Moshi: a speech-text foundation model for real-time dialogue ( Paper Explained)

Автор: Julien Hauret

Загружено: 2024-10-30

Просмотров: 1427

Описание: Review of: https://arxiv.org/abs/2410.00037

Abstract:
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at this URL https://github.com/kyutai-labs/moshi

Authors:
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour

Chapters:
0:00 Introduction
2:38 Mimi, neural audio codec
14:54 Helium, 7B text LLM
20:05 RQ-Transformer, Temporal and Depth transformer
28:45 Inner Monologue setup
33:58 Conclusion
35:35 Live demo in local

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Moshi: a speech-text foundation model for real-time dialogue ( Paper Explained)

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Магистратура по речевым технологиям: модели, которые слушают и отвечают

Магистратура по речевым технологиям: модели, которые слушают и отвечают

EBEN: Extreme Bandwidth Extension Network

EBEN: Extreme Bandwidth Extension Network

Spoken Language Processing

Spoken Language Processing

End-to-End Adversarial Text-to-Speech (Paper Explained)

End-to-End Adversarial Text-to-Speech (Paper Explained)

"Moshi: a speech-text foundation model for real-time dialogue" - Alexandre Défossez

Новый язык программирования для эпохи ИИ

Новый язык программирования для эпохи ИИ

Real-time speech enhancement in noise using a throat microphone

Real-time speech enhancement in noise using a throat microphone

EPFL AI Center Research Seminar- Moshi: a foundation model for conversational speech - Edouard Grave

EPFL AI Center Research Seminar- Moshi: a foundation model for conversational speech - Edouard Grave

Что такое жидкие нейросети? Liquid neural networks. Объяснение.

Что такое жидкие нейросети? Liquid neural networks. Объяснение.

Как понять RAG за 18 минут, даже если ты никогда не слышал про эмбеддинги

Как понять RAG за 18 минут, даже если ты никогда не слышал про эмбеддинги

Почему AI генерит мусор — и как заставить его писать нормальный код

Почему AI генерит мусор — и как заставить его писать нормальный код

“Audio Language Models” - Neil Zeghidour

“Audio Language Models” - Neil Zeghidour

Fine-tune Text to Speech Models in 2025: CSM-1B and Orpheus TTS

Fine-tune Text to Speech Models in 2025: CSM-1B and Orpheus TTS

Пространство реально. И это проблема

Пространство реально. И это проблема

Моши Говорящий ИИ

Моши Говорящий ИИ

Двигатель Стирлинга: обогнать паровой век и покорить космос

Двигатель Стирлинга: обогнать паровой век и покорить космос

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

Самый востребованный учёный России о возвращении на Родину, науке и Боге

Самый востребованный учёный России о возвращении на Родину, науке и Боге

Whisper Paper Explained: Robust Speech Recognition via Large-Scale Weak Supervision

Whisper Paper Explained: Robust Speech Recognition via Large-Scale Weak Supervision

Кризис на Ближнем Востоке: кто платит за войну? - Марк Слебода

Кризис на Ближнем Востоке: кто платит за войну? - Марк Слебода