Fight fire with fire AI assisted test debug flow and log analysis for AI GPU systems
Автор: Open Compute Project
Загружено: 2025-10-22
Просмотров: 54
Описание:
Wenxi Yan Pri System Engineer - Microsoft, Anna Mary Mathew Director, Hardware Engineering - Microsoft
This talk presents AI-powered test and debug workflows for modern AI GPU datacenters. We capture exhaustive logs during test cycles, decode mixed formats, and apply AI-driven pre-analysis (sanitization, deduplication, enrichment) to isolate error signatures. By integrating project docs, bug databases, and fleet-wide repositories via retrieval-augmented generation, we enable contextual matching, whitelisting, and deep post-analysis using domain specs, historical bugs, and occurrence metrics. Enriched signatures feed large language models for precise root cause analysis, while stop-on-fail triggers forward key alerts and diagnostics to team channels. Case studies—GPU link integrity failures, transient power inrush PGOOD anomalies, and driver reset issues—highlight new AI systems failure modes. Were planning to open source core log_analyze code too.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: