Atrass#7 : A multistream multimodal foundation model for real-time voice-based applications
Автор: European Trustworthy AI Association
Загружено: 2025-10-01
Просмотров: 41
Описание:
By Patrick Perez, Kyutai, France
A unique way for humans to seamlessly exchange information and emotion, speech should be a key means for us to communicate with and through machines. This is not yet the case. In an effort to progress toward this goal, we introduce a versatile speech-text decoder-only model that can serve a number of voice-based applications. It has in particular allowed us to
build Moshi, the first-ever full-duplex spoken-dialogue system (with no latency and no imposed speaker turns) as well as Hibiki, the first simultaneous voice-to-voice translation model with voice preservation to run on a mobile phone. This multistream multimodal model can also be
turned into a visual-speech model (VSM) via cross-attention with visual information, which allows Moshi to freely discuss about an image while maintaining its natural conversation style and low latency. This talk will provide an illustrated tour of this research.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: