DeepOCR: Reproduction of Optical Context Compression. vision-language model - VLM. VILA based.

Автор: AI Podcast Series. Byte Goose AI.

Загружено: 2025-11-17

Просмотров: 27

Описание: DeepOCR: Reproduction of Optical Context Compression

The podcast provides the technical overview of the DeepSeek-OCR / DeepOCR, a vision-language model designed to explore and validate the concept of contexts optical compression for long documents. This innovative approach compresses large amounts of text into visual representations, achieving compression ratios between 7× and 20× while maintaining high Optical Character Recognition (OCR) accuracy. The core technology is the DeepEncoder, a novel architecture that combines a window attention component (SAM-base) for high-resolution perception and a global attention component (CLIP-large), bridged by a 16× convolutional compressor to efficiently reduce vision tokens. One source details the original research and performance metrics, demonstrating state-of-the-art results on benchmarks like OmniDocBench with fewer vision tokens than competing models. The other sources present DeepOCR, an open-source reproduction of the architecture using the VILA framework and a Qwen2-7B decoder, confirming the feasibility and efficiency of the compression hypothesis for addressing long-context challenges in Large Language Models.

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

DeepOCR: Reproduction of Optical Context Compression. vision-language model - VLM. VILA based.

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

[EEML'24] Jovana Mitrović - Vision Language Models

[EEML'24] Jovana Mitrović - Vision Language Models

π0: A Foundation Model for Robotics with Sergey Levine - 719

π0: A Foundation Model for Robotics with Sergey Levine - 719

GLiNER2: Компактное универсальное распознавание сущностей через BiLM. Извлечение сущностей, класс...

GLiNER2: Компактное универсальное распознавание сущностей через BiLM. Извлечение сущностей, класс...

How Did They Do It? DeepSeek V3 and R1 Explained

How Did They Do It? DeepSeek V3 and R1 Explained

ESP32: распознавание речи нейросетью (TensorFlow Lite)

ESP32: распознавание речи нейросетью (TensorFlow Lite)

LLMs Meet Robotics: What Are Vision-Language-Action Models? (VLA Series Ep.1)

LLMs Meet Robotics: What Are Vision-Language-Action Models? (VLA Series Ep.1)

Context Rot: How Increasing Input Tokens Impacts LLM Performance

Context Rot: How Increasing Input Tokens Impacts LLM Performance

Вложенное обучение: архитектуры как вложенная ассоциативная память. Технические аспекты.

Вложенное обучение: архитектуры как вложенная ассоциативная память. Технические аспекты.

Модели действий языка видения для автономного вождения в Wayve

Модели действий языка видения для автономного вождения в Wayve

Краткий обзор новой версии n8n 2.0 🚀

Краткий обзор новой версии n8n 2.0 🚀

Где и как спасаться от мировой войны?

Где и как спасаться от мировой войны?

What Are Vision Language Models? How AI Sees & Understands Images

What Are Vision Language Models? How AI Sees & Understands Images

Порталы не создают вечный двигатель, если телепортировать гравитацию

Порталы не создают вечный двигатель, если телепортировать гравитацию

OpenVLA: LeRobot Research Presentation #5 by Moo Jin Kim

OpenVLA: LeRobot Research Presentation #5 by Moo Jin Kim

Reinforcement Learning Tutorial - RLVR with NVIDIA & Unsloth

Reinforcement Learning Tutorial - RLVR with NVIDIA & Unsloth

Implement and Train VLMs (Vision Language Models) From Scratch - PyTorch

Implement and Train VLMs (Vision Language Models) From Scratch - PyTorch

A Visual Guide to Mixture of Experts (MoE) in LLMs

A Visual Guide to Mixture of Experts (MoE) in LLMs

Дорожная карта по математике для машинного обучения: линейная алгебра, теория вероятностей, автом...

Дорожная карта по математике для машинного обучения: линейная алгебра, теория вероятностей, автом...

Pi0 - generalist Vision Language Action policy for robots (VLA Series Ep.2)

Pi0 - generalist Vision Language Action policy for robots (VLA Series Ep.2)

Vision Transformer Basics

Vision Transformer Basics