Enhancing the Robustness of Speech Foundation Models Thru Adaptation on Large-Scale Diverse Speech
Автор: DOST-Advanced Science and Technology Institute
Загружено: 2024-08-20
Просмотров: 104
Описание:
Enhancing the Robustness of Speech Foundation Models Through Adaptation on Large-Scale Diverse Speech
Jessan Rendell Belenzo
MEng AI Student
Artificial Intelligence Program
UP Diliman
Speech foundation models are reinventing the way humans and computers communicate due to their capability to understand inputs and respond with outputs in text, in speech or in both modalities, thanks to the advent of large language models. SpeechGPT is a speech foundation model based on the Llama architecture pre-trained on Libri-Light, a large open-source speech corpus collected from audiobook recordings and instruction-tuned on other speech datasets. However, since the model is pre-trained on narrated speech only, its performance on other speech domains and styles is suboptimal. We measure the effectiveness of continual pre-training and instruction tuning of SpeechGPT on four large-scale diverse speech datasets sampled at 16 kHz: GigaSpeech, a 10,000-hour multi-domain speech corpus with quality transcriptions collected from audiobooks, podcasts, and YouTube videos; LibriSpeech, a narrated speech dataset with 960 hours of recordings based on the audiobooks from the LibriVox project; VoxPopuli, a collection of oratory speech samples sourced from the European Parliament event recordings; and SPGISpeech, a large-scale dataset with 5,000 hours of labelled financial audio derived from earnings calls. We convert the speech samples to discrete tokens using Multilingual HuBERT to generate the datasets, perform continual pre-training and instruction-tuning of SpeechGPT, and evaluate model performance on the automatic speech recognition (ASR) task using word error rate (WER). The experimental results show that the model’s WER improved by 80.8% on SPGISpeech, 76.7% on LibriSpeech, 59.3% on GigaSpeech, and 52.6% on VoxPopuli. The results suggest that continual training on diverse speech helps speech foundational models adapt to different speaking styles and domains.
As part of the National Electrical, Electronics and Computer Engineering Conference (NEECECON 2024), this technical session is organized by the UP Electrical and Electronics Engineering Institute with the theme "National Development through Sustainable Industrialization."
NEECECON 2024 is co-located with the Advanced Science, Technology, and Innovation Convention (ASTICON) 2024, held from 18 to 19 July 2024 at the Novotel Manila Araneta City in Quezon City.
ASTICON 2024 showcased DOST-ASTI and UP EEEI's pioneering contributions to the ICT landscape while celebrating the partnerships that drive technological advancement and societal progress in the country.
For more info about the event, visit https://neececon2024.eee.upd.edu.ph.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: