TensorRT vs vLLM on DGX Spark: Why Benchmarks Alone Don’t Work
Автор: Superhuman Unlocked
Загружено: 2026-01-22
Просмотров: 343
Описание:
40 tokens per second is useless if you lose your train of thought waiting 4 minutes for the model to load.**
Project Gepetto: Lock Entry 02: We push the NVIDIA DGX Spark to its absolute limits. With the new Christmas 2025 software update, NIVIDIAS DGX Spark finally got native support for **NVFP4 quantization**. The promise? Massive speed and reduced memory usage.
I wanted to floor it. I wanted to replace my reliable Ollama setup with a high-performance TensorRT-LLM stack.
The benchmarks looked incredible: 39.5 tok/s on a 30B model.
But then reality hit.
We discovered that raw speed comes with a massive "commitment tax." We ran into the "Configuration Wall," struggled with the open *MXFP4* standard on the massive **GPT-OSS-120B**, and learned a hard lesson about software maturity vs. hardware capability.
*In this video, we debug the assumptions of Local AI:*
*The Productive Stack:* Why we use Qwen3, Phi-4, and Llama-3.3 for different cognitive gears.
*The Crash:* How running 3 TensorRT containers in parallel collapsed performance by 300%.
*The vLLM Surprise:* Why the "industry darling" failed at first (110GB VRAM leak) but redeemed itself with the 120B Architect model.
This is not a benchmark review. This is a field report on engineering a thinking environment that actually works for me.
---
*⏱️ Timestamps*
0:00 - Intro: Explorer vs. Caretaker
0:19 - Act I. - The Itch
0:55 - INTERMEZZO - The New Landscape
1:35 - Act II. - One human, many gears
4:21 - Act IIa. - The Euphoric Part
7:10 - Act 2b. - The Clash of the Architects
9:10 - Act 3. - The configuration wall
10:57 - Final Curtain
---
*🛠️ The Stack & Hardware*
*System:* NVIDIA DGX Spark (Blackwell Architecture, 128GB Unified Memory)
*Worker Fast:* Qwen3-30B-A3B (NVFP4) - MoE Throughput King
*Worker Heavy:* Qwen3-32B (NVFP4) - Dense Anchor
*Thinker:* Phi-4-Reasoning-Plus (NVFP4) - Logic Specialist
*Architect:* GPT-OSS-120B (MXFP4) & Llama-3.3-70B(NVFP4)
*Runtimes tested:* TensorRT-LLM (v0.12.0rc6), vLLM (v25.12.post1-py3)
---
*🔗 Links & Resources*
NVIDIA Spark Playbook vLLM: https://build.nvidia.com/spark/vllm
NVIDIA Spark Playbook Tensor RT: https://build.nvidia.com/spark/trt-llm
Previous Episode (Building Stability): • Running Local LLMs on NVIDIA DGX Spark – A...
#LocalLLM #AI #NVIDIA #MachineLearning #Engineering #DevLog
#TensorRT #vLLM #DGXSpark #Blackwell #NVFP4 #MXFP4 #Qwen #Llama3 #Phi4 #GPTOSS #Ollama
#ProjectGepetto #SystemArchitecture #Benchmark #MadScientist
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: