LTX-2: New Joint Text-to-Audiovisual Model
Автор: AI Research Roundup
Загружено: 2026-01-07
Просмотров: 49
Описание:
In this AI Research Roundup episode, Alex discusses the paper: 'LTX-2: Efficient Joint Audio-Visual Foundation Model' LTX-2 is a new open-source foundation model designed to generate high-quality video with perfectly synchronized audio from text prompts. The researchers utilize an asymmetric dual-stream Diffusion Transformer architecture that couples a 14B-parameter video stream with a 5B-parameter audio stream. By using bidirectional cross-attention and temporal Rotary Positional Embeddings, the model achieves precise alignment for complex tasks like lip-syncing. This approach moves beyond sequential pipelines by capturing the bidirectional dependencies between visual cues and acoustics. LTX-2 also leverages Gemma 3-12B as a multilingual text encoder to improve overall prompt understanding. Paper URL: https://arxiv.org/abs/2601.03233 #AI #MachineLearning #DeepLearning #VideoGeneration #Audiovisual #DiffusionTransformer #Gemma3 #OpenSource
Resources:
GitHub: https://github.com/Lightricks/LTX-2
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: