From High Performance Computing To AI Workloads on Kubernetes: M... Andrey Velichkevich, & Yuki Iwai
Автор: CNCF [Cloud Native Computing Foundation]
Загружено: 2025-04-15
Просмотров: 333
Описание:
Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Hong Kong, China (June 10-11); Tokyo, Japan (June 16-17); Hyderabad, India (August 6-7); Atlanta, US (November 10-13). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io
From High Performance Computing To AI Workloads on Kubernetes: MPI Runtime in Kubeflow TrainJob - Andrey Velichkevich, Apple & Yuki Iwai, CyberAgent, inc
Message Passing Interface (MPI) is a foundational technology in distributed computing essential for ML frameworks like MLX, DeepSpeed, and NVIDIA NeMo. It powers efficient communication for large-scale AI workloads using high-speed interconnects via InfiniBand. However, running MPI on Kubernetes presents challenges, such as ensuring high-throughput pod-to-pod communication, managing MPI Job initialization in containerized environments, and supporting diverse MPI implementations, including OpenMPI, IntelMPI, and MPICH.
This talk will introduce the Kubeflow MPI Runtime integrated with Kubeflow TrainJob, featuring distributed training with MLX and LLMs fine-tuning with DeepSpeed on Kubernetes. Speakers will highlight SSH-based optimization to boost MPI performance. Attendees will discover how this solution simplifies, scales, and optimizes AI workloads while addressing key challenges and combining MPI's efficiency with Kubernetes' orchestration power.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: