Multimodal Pretraining for Dense Video Captioning
Автор: Gabriel Huang
Загружено: 2020-12-05
Просмотров: 350
Описание:
Presentation of our AACL 2020 paper "Multimodal Pretraining for Dense Video Captioning".
Slides: https://tinyurl.com/multimodal-pretra...
arXiv: https://arxiv.org/abs/2011.11760
More on the project:
https://gabrielhuang.github.io/#multi...
Abstract:
Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: