ycliper

Популярное

Музыка Кино и Анимация Автомобили Животные Спорт Путешествия Игры Юмор

Интересные видео

2025 Сериалы Трейлеры Новости Как сделать Видеоуроки Diy своими руками

Топ запросов

смотреть а4 schoolboy runaway турецкий сериал смотреть мультфильмы эдисон
Скачать

DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

Автор: Yannic Kilcher

Загружено: 2021-02-25

Просмотров: 22612

Описание: #deberta #bert #huggingface

DeBERTa by Microsoft is the next iteration of BERT-style Self-Attention Transformer models, surpassing RoBERTa in State-of-the-art in multiple NLP tasks. DeBERTa brings two key improvements: First, they treat content and position information separately in a new form of disentangled attention mechanism. Second, they resort to relative positional encodings throughout the base of the transformer, and provide absolute positional encodings only at the very end. The resulting model is both more accurate on downstream tasks and needs less pretraining steps to reach good accuracy. Models are also available in Huggingface and on Github.

OUTLINE:
0:00 - Intro & Overview
2:15 - Position Encodings in Transformer's Attention Mechanism
9:55 - Disentangling Content & Position Information in Attention
21:35 - Disentangled Query & Key construction in the Attention Formula
25:50 - Efficient Relative Position Encodings
28:40 - Enhanced Mask Decoder using Absolute Position Encodings
35:30 - My Criticism of EMD
38:05 - Experimental Results
40:30 - Scaling up to 1.5 Billion Parameters
44:20 - Conclusion & Comments

Paper: https://arxiv.org/abs/2006.03654
Code: https://github.com/microsoft/DeBERTa
Huggingface models: https://huggingface.co/models?search=...

Abstract:
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

Authors: Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube:    / yannickilcher  
Twitter:   / ykilcher  
Discord:   / discord  
BitChute: https://www.bitchute.com/channel/yann...
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn:   / yannic-kilcher-488534136  
BiliBili: https://space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannick...
Patreon:   / yannickilcher  
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...
DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

Поделиться в:

Доступные форматы для скачивания:

Скачать видео

  • Информация по загрузке:

Скачать аудио

Похожие видео

ALiBi — обучение по короткому маршруту, тестирование по длинному маршруту: внимание с линейными с...

ALiBi — обучение по короткому маршруту, тестирование по длинному маршруту: внимание с линейными с...

Dreamer v2: Mastering Atari with Discrete World Models (Machine Learning Research Paper Explained)

Dreamer v2: Mastering Atari with Discrete World Models (Machine Learning Research Paper Explained)

Лекция. BERT и его вариации. Masked Language Modelling

Лекция. BERT и его вариации. Masked Language Modelling

Дарио Амодеи — «Мы близки к концу экспоненты»

Дарио Амодеи — «Мы близки к концу экспоненты»

Визуализация внимания, сердце трансформера | Глава 6, Глубокое обучение

Визуализация внимания, сердце трансформера | Глава 6, Глубокое обучение

Can the US challenge China’s dominance in critical minerals? | Counting the Cost

Can the US challenge China’s dominance in critical minerals? | Counting the Cost

КОЛМАНОВСКИЙ:

КОЛМАНОВСКИЙ: "Это просто чудо". Где "проваливается" ИИ, что не так с ядом из кожи лягушки, азарт

Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)

Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)

GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)

KANADA - nowy SOJUSZNIK Polski? Chodzi o ENERGIĘ

KANADA - nowy SOJUSZNIK Polski? Chodzi o ENERGIĘ

Лучший документальный фильм про создание ИИ

Лучший документальный фильм про создание ИИ

Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)

Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

Как LLM могут хранить факты | Глава 7, Глубокое обучение

Как LLM могут хранить факты | Глава 7, Глубокое обучение

Encoder-Only Transformers (like BERT) for RAG, Clearly Explained!!!

Encoder-Only Transformers (like BERT) for RAG, Clearly Explained!!!

157. Как складываются спины в квантовой механике? Теория групп и законы сохранения НЕпростым языком.

157. Как складываются спины в квантовой механике? Теория групп и законы сохранения НЕпростым языком.

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)

Управление поведением LLM без тонкой настройки

Управление поведением LLM без тонкой настройки

© 2025 ycliper. Все права защищены.



  • Контакты
  • О нас
  • Политика конфиденциальности



Контакты для правообладателей: [email protected]