Enhancing the Robustness of Speech Foundation Models Thru Adaptation on Large-Scale Diverse Speech

Автор: DOST-Advanced Science and Technology Institute

Загружено: 2024-08-20

Просмотров: 104

Описание: Enhancing the Robustness of Speech Foundation Models Through Adaptation on Large-Scale Diverse Speech
Jessan Rendell Belenzo
MEng AI Student
Artificial Intelligence Program
UP Diliman

Speech foundation models are reinventing the way humans and computers communicate due to their capability to understand inputs and respond with outputs in text, in speech or in both modalities, thanks to the advent of large language models. SpeechGPT is a speech foundation model based on the Llama architecture pre-trained on Libri-Light, a large open-source speech corpus collected from audiobook recordings and instruction-tuned on other speech datasets. However, since the model is pre-trained on narrated speech only, its performance on other speech domains and styles is suboptimal. We measure the effectiveness of continual pre-training and instruction tuning of SpeechGPT on four large-scale diverse speech datasets sampled at 16 kHz: GigaSpeech, a 10,000-hour multi-domain speech corpus with quality transcriptions collected from audiobooks, podcasts, and YouTube videos; LibriSpeech, a narrated speech dataset with 960 hours of recordings based on the audiobooks from the LibriVox project; VoxPopuli, a collection of oratory speech samples sourced from the European Parliament event recordings; and SPGISpeech, a large-scale dataset with 5,000 hours of labelled financial audio derived from earnings calls. We convert the speech samples to discrete tokens using Multilingual HuBERT to generate the datasets, perform continual pre-training and instruction-tuning of SpeechGPT, and evaluate model performance on the automatic speech recognition (ASR) task using word error rate (WER). The experimental results show that the model’s WER improved by 80.8% on SPGISpeech, 76.7% on LibriSpeech, 59.3% on GigaSpeech, and 52.6% on VoxPopuli. The results suggest that continual training on diverse speech helps speech foundational models adapt to different speaking styles and domains.

As part of the National Electrical, Electronics and Computer Engineering Conference (NEECECON 2024), this technical session is organized by the UP Electrical and Electronics Engineering Institute with the theme "National Development through Sustainable Industrialization."

NEECECON 2024 is co-located with the Advanced Science, Technology, and Innovation Convention (ASTICON) 2024, held from 18 to 19 July 2024 at the Novotel Manila Araneta City in Quezon City.

ASTICON 2024 showcased DOST-ASTI and UP EEEI's pioneering contributions to the ICT landscape while celebrating the partnerships that drive technological advancement and societal progress in the country.

For more info about the event, visit https://neececon2024.eee.upd.edu.ph.

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Enhancing the Robustness of Speech Foundation Models Thru Adaptation on Large-Scale Diverse Speech

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

AIoT Aided Farm Management for the Optimized Production of Selected High Value Crop

AIoT Aided Farm Management for the Optimized Production of Selected High Value Crop

Elmer Peramo | ACABAI PH | ASTICON 2025

Elmer Peramo | ACABAI PH | ASTICON 2025

Game-Changing Technologies in Health, Energy and Environment Monitoring | ASTICON and NEECECON 2024

Game-Changing Technologies in Health, Energy and Environment Monitoring | ASTICON and NEECECON 2024

ОСНОВЫ ЭЛЕКТРОНИКИ: Самый простой и понятный курс для начинающих

ОСНОВЫ ЭЛЕКТРОНИКИ: Самый простой и понятный курс для начинающих

Afternoon - Humans and AI Collaboration by Po-Shen Loh

Afternoon - Humans and AI Collaboration by Po-Shen Loh

Как Создавать ИИ-Агентов: Полное Руководство для Начинающих

Как Создавать ИИ-Агентов: Полное Руководство для Начинающих

Почему нейросети постоянно врут? (и почему этого уже не исправить)

Почему нейросети постоянно врут? (и почему этого уже не исправить)

Как вылечить БЕЗ операций Близорукость,Дальнозоркость,Астигматизм,Косоглазие.Упражнения проф.Жданова

Как вылечить БЕЗ операций Близорукость,Дальнозоркость,Астигматизм,Косоглазие.Упражнения проф.Жданова

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

Как написать картину профессионально. Мастер-класс

Как написать картину профессионально. Мастер-класс

Новый китайский ИИ DuClaw сделал OpenClaw мгновенным и непобедимым.

Новый китайский ИИ DuClaw сделал OpenClaw мгновенным и непобедимым.

Умные очки Топ-5 лучших моделей в 2026 - Meta Ray-Ban, Rokid и Even Realities

Умные очки Топ-5 лучших моделей в 2026 - Meta Ray-Ban, Rokid и Even Realities

Новая Nano Banana Pro: с нуля до ПРО. Полный курс по Nano Banana Pro 2

Новая Nano Banana Pro: с нуля до ПРО. Полный курс по Nano Banana Pro 2

ЛУЧШАЯ БЕСПЛАТНАЯ НЕЙРОСЕТЬ Google, которой нет аналогов

ЛУЧШАЯ БЕСПЛАТНАЯ НЕЙРОСЕТЬ Google, которой нет аналогов

Россия и Китай спасли Иран от попытки переворота: удалось только совместно

Россия и Китай спасли Иран от попытки переворота: удалось только совместно

КЛАССИЧЕСКАЯ МУЗЫКА ДЛЯ ВОССТАНОВЛЕНИЯ НЕРВНОЙ СИСТЕМЫ🌿 Нежная музыка успокаивает нервную систему 22

КЛАССИЧЕСКАЯ МУЗЫКА ДЛЯ ВОССТАНОВЛЕНИЯ НЕРВНОЙ СИСТЕМЫ🌿 Нежная музыка успокаивает нервную систему 22

Авторский стиль художника: мой путь от 30 до 200 тыс в месяц

Авторский стиль художника: мой путь от 30 до 200 тыс в месяц

18 КРУТЫХ способов для ChatGPT (что кажется нелегально)

18 КРУТЫХ способов для ChatGPT (что кажется нелегально)

Запуск нейросетей локально. Генерируем - ВСЁ

Запуск нейросетей локально. Генерируем - ВСЁ

Плачу $100 за Claude. Он автоматизировал весь мой YouTube

Плачу $100 за Claude. Он автоматизировал весь мой YouTube