Lec 06 | Tokenization
Автор: LCS2
Загружено: 2025-09-19
Просмотров: 439
Описание:
How do language models understand text? It all starts with tokenization! In this lecture from August 13, 2025, we explore the fundamental step of breaking down text into smaller units (tokens) that a model can process. We'll move beyond simple word splitting to cover powerful subword tokenization algorithms that are essential for modern LLMs. Specifically, we'll dive into the mechanics of Byte-Pair Encoding (BPE), Google's WordPiece, and the probabilistic Unigram model, understanding how each one helps models efficiently handle vast vocabularies and rare words. 🧩
Resources 📚
For slides and other course materials, please visit the website:
Course Website (lcs2.in/llm2501)
Suggested Readings 📖
(BPE) Neural Machine Translation of Rare Words with Subword Units (https://arxiv.org/abs/1508.07909)
(WordPiece) Japanese and Korean Voice Search (https://static.googleusercontent.com/...)
(Unigram) Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (https://arxiv.org/abs/1804.10959)
#Tokenization #BPE #WordPiece #Unigram #Subword #NLP #LargeLanguageModels
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: