tinyML EMEA - Baptiste Pouthier: Audio-Visual Active Speaker Detection on Embedded Devices

Автор: EDGE AI FOUNDATION

Загружено: 2023-07-13

Просмотров: 725

Описание: Audio-Visual Active Speaker Detection on Embedded Devices
Baptiste POUTHIER
PhD Student
NXP Semiconductors

Active Speaker Detection (ASD) is the task of identifying active speakers in a video by analyzing both visual and audio features. It is a key component in human-robot interactions, for speech enhancement, and for video re-targeting in video-conferencing systems. Over the last decade, advances in machine learning have paved the way for highly reliable ASD methods. However, since both the visual and audio signals must be
processed and analyzed, these methods are extremely computationally demanding and therefore impractical for micro-controllers. For instance, most ASD models have tens of millions of parameters. Moreover, in standard use-cases like video conferencing, the model needs to run in real-time (at least 25 video frames per second) while tracking and processing multiple potential talkers. To meet the challenge, we have developed
a set of state-of-the-art ASD models with a drastic cut of the computational costs. The originality of our approach is to leverage a multi-objective optimization and a novel modality fusion scheme. In particular, we focused on building two models featuring additional architectural and optimization changes to fit two hardware configurations:
– A model that runs on high-end NXP MPU featured with quad Arm Cortex-A53 processor and with a Neural Processing Unit (NPU)
– A tiny model that runs on the dual core i.MX RT1170 MCU with Arm Cortex-M7 core at 1GHz and Arm Cortex-M4 at 400 MHz
The models are end-to-end deep learning architectures following the same block diagram The network is based on a two-branch architecture with each branch processing either the audio or the visual signal. The audio and visual embeddings are finally combined within the “fusion” block that outputs the probability of an individual speaking. This information is used by downstream algorithms, such as speech enhancement and video re-targeting, which are beyond the scope of our presentation.

All the network components are designed for hardware requirements: the fusion block, convolutional layers and temporal sequence modeling are indeed modified to optimize the model performance. The input signals are also processed accordingly: the resolution of the data and the temporal contexts used by the network are adapted to the different hardware capabilities. Our presentation is about the whole optimization and porting process, from the model design changes to the quantization and the integration on NXP devices. For each model, we focus the performance analysis on the trade-off between the computational burden and the system accuracy.

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

tinyML EMEA - Baptiste Pouthier: Audio-Visual Active Speaker Detection on Embedded Devices

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

tinyML EMEA - Jonna Matthiesen: Sensitivity analysis of hyperparameters in deep neural-network...

tinyML EMEA - Jonna Matthiesen: Sensitivity analysis of hyperparameters in deep neural-network...

Speaker diarization -- Herve Bredin -- JSALT 2023

Speaker diarization -- Herve Bredin -- JSALT 2023

tinyML EMEA Monday Keynote Steve Furber: A Novel Mechanism for Edge ML

tinyML EMEA Monday Keynote Steve Furber: A Novel Mechanism for Edge ML

How Slow Can Slow-Motion Get?

How Slow Can Slow-Motion Get?

Teleste DAA: Market momentum & VDM product presentation - Joni Roinila Teleste (12.11.2025. - ENG)

Teleste DAA: Market momentum & VDM product presentation - Joni Roinila Teleste (12.11.2025. - ENG)

JSALT 2023 Workshop Summer School Talks

JSALT 2023 Workshop Summer School Talks

Я в опасности

tinyML EMEA - Prateek Tripathi: Brain Inspired ISFET Arrays – A Tiny ML approch to Lab-on-Chip...

tinyML EMEA - Prateek Tripathi: Brain Inspired ISFET Arrays – A Tiny ML approch to Lab-on-Chip...

Периферийный ИИ, энергетика и автономность с профессором Уинстоном Сю из Национального университе...

Периферийный ИИ, энергетика и автономность с профессором Уинстоном Сю из Национального университе...

Distributed and Integrated Sensing and Communication: Beamforming, Estimation, Learning and More

Distributed and Integrated Sensing and Communication: Beamforming, Estimation, Learning and More

Bloomberg Surveillance 1/21/2026

Bloomberg Surveillance 1/21/2026

EDGE AI Talks: Simple and Scalable Edge AI Acceleration

EDGE AI Talks: Simple and Scalable Edge AI Acceleration

Компания Salesforce признала свою ошибку.

Компания Salesforce признала свою ошибку.

SAMOBÓJ I 106. GOL LEWANDOWSKIEGO W LIDZE MISTRZÓW! | SLAVIA - FC BARCELONA, SKRÓT MECZU

SAMOBÓJ I 106. GOL LEWANDOWSKIEGO W LIDZE MISTRZÓW! | SLAVIA - FC BARCELONA, SKRÓT MECZU

Gary Marcus on the Massive Problems Facing AI & LLM Scaling | The Real Eisman Playbook Episode 42

Gary Marcus on the Massive Problems Facing AI & LLM Scaling | The Real Eisman Playbook Episode 42

Cluster Hardware Hierarchy and CPU Functionality

Cluster Hardware Hierarchy and CPU Functionality

Jak wpłynąć na premiera i ministra MSWiA? ,,Oni znów działają na niekorzyść Polski'' | A. Klarenbach

Jak wpłynąć na premiera i ministra MSWiA? ,,Oni znów działają na niekorzyść Polski'' | A. Klarenbach

Claude Code Clearly Explained (and how to use it)

Claude Code Clearly Explained (and how to use it)

Securing the Modern Power Grid Challenges in the 3D Energy Transition

Securing the Modern Power Grid Challenges in the 3D Energy Transition

Voice Activity Detection | The key to realtime voice chat - Silero VAD

Voice Activity Detection | The key to realtime voice chat - Silero VAD