How I Tamed 2 × RTX 5090 + 2 × 4090 with Llama.cpp fork
Автор: Mukul Tripathi
Загружено: 2025-06-20
Просмотров: 816
Описание:
In this video, I tackle the challenge of setting up a heterogeneous multi-GPU system with two NVIDIA RTX 5090s and two RTX 4090s (100GB+ VRAM total). We dive deep into running 200B+ parameter models like DeepSeek R1 and Qwen3 using two frameworks:
🦙 llama.cpp (82k stars)
🦙 ik-llama.cpp (fork with insane multi-GPU support)
Key Highlights:
ik-llama.cpp Setup: How to clone, build, and configure for mixed GPUs (CUDA arch flags, VRAM allocation).
Performance Benchmarks:
700 tokens/sec prompt processing with ik-llama.cpp (vs 400-450 on vanilla llama.cpp).
10-23 tokens/sec generation across frameworks.
80K context length support (vs 24K on k-transformers).
Multi-GPU Layer Offloading: Custom scripts to distribute model layers across RTX 5090s/4090s.
Live Crash Demo: Lessons on VRAM limits and avoiding OOM errors.
Benchmarking Tools: Use llama-bench to test your config.
Timestamps:
0:00 Intro & hardware overview
1:17 Why multi-GPU with mixed cards is painful in K-Transformers
2:25 Llama.cpp vs ik_llama.cpp at a glance (stars aren’t everything)
3:55 Live VRAM read-out: 2×5090 + 2×4090 (more than 100 GB)
7:23 First speed test: 120 TPS → 700 TPS after tuning
14:09 Building ik_llama.cpp for Ada-Lovelace & Blackwell (-DCMAKE_CUDA_ARCHITECTURES=86;89;120)
18:00 Regex-based layer off-loading explained (-ot "blk\+\.ffn=CUDA")
29:40 Crash & recover: finding the VRAM sweet spot
38:02 llama-sweep-bench: automate prompt/gen benchmarks
41:55 Context length show-down: 24 K (K-Trans) vs 40 K / 80 K / 128 K (IK/Llama.cpp)
48:10 Single-GPU fallback test (one 4090)
51:15 Community resources & my startup scripts
53:14 Final thoughts & when to stick with vanilla Llama.cpp (function calling)
Resources:
ik-llama.cpp GitHub: https://github.com/ikawrakow/ik_llama...
HuggingFace Models: https://huggingface.co/ubergarm/Qwen3...
My GPU Layer Offloading Strategy: https://github.com/ikawrakow/ik_llama...
Tags: #AI #MachineLearning #MultiGPU #RTX5090 #llama.cpp #ikllama #LargeLanguageModels #DL #TechTutorial
Повторяем попытку...

Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: