Lec 06 | Tokenization

Автор: LCS2

Загружено: 2025-09-19

Просмотров: 439

Описание: How do language models understand text? It all starts with tokenization! In this lecture from August 13, 2025, we explore the fundamental step of breaking down text into smaller units (tokens) that a model can process. We'll move beyond simple word splitting to cover powerful subword tokenization algorithms that are essential for modern LLMs. Specifically, we'll dive into the mechanics of Byte-Pair Encoding (BPE), Google's WordPiece, and the probabilistic Unigram model, understanding how each one helps models efficiently handle vast vocabularies and rare words. 🧩

Resources 📚
For slides and other course materials, please visit the website:
Course Website (lcs2.in/llm2501)

Suggested Readings 📖
(BPE) Neural Machine Translation of Rare Words with Subword Units (https://arxiv.org/abs/1508.07909)
(WordPiece) Japanese and Korean Voice Search (https://static.googleusercontent.com/...)
(Unigram) Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (https://arxiv.org/abs/1804.10959)

#Tokenization #BPE #WordPiece #Unigram #Subword #NLP #LargeLanguageModels

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Lec 06 | Tokenization

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео