Breaking the Host Memory Bottleneck: How Peer Direct Transformed Gaudi’s Cloud Performance | Towards Data Science

Towards Data Science

by Maria Piterberg

February 25, 2026

AI-Generated Deep Dive Summary

Introducing Gaudi accelerators to Amazon’s EC2 DL1 instances initially revealed a critical performance bottleneck. When scaling across multiple nodes for distributed training, models experienced up to 50% performance degradation due to reliance on standard host NICs. Unlike Gaudi’s built-in RDMA-capable network interfaces, these NICs forced all data through host memory, creating latency and bandwidth constraints that undermined the accelerators’ efficiency. This issue was particularly problematic for large-scale AI training, where even minor performance losses translate into significant time and cost increases. To address this challenge, Peer Direct was developed as a breakthrough solution. By leveraging technologies like libfabric, DMA-BUF, and HCCL, Peer Direct enabled direct memory access between Gaudi devices without involving host memory or CPU processing. This innovation restored RDMA-like performance over cloud NICs, bypassing the traditional bottlenecks of data duplication, TCP/IP overhead, and host CPU involvement. The result was a dramatic improvement in scalability for distributed training workloads, allowing models to achieve near-optimal performance when running across multiple nodes. The importance of this advancement cannot be overstated. In the world of AI, where large language models like GPT-5 require weeks or even months to train on massive clusters, efficiency is paramount. Peer Direct’s solution not only halves training time but also reduces energy consumption and carbon emissions, aligning with growing demands for sustainable AI practices. Moreover, faster iteration cycles give organizations a competitive edge in developing advanced models, potentially accelerating innovation across industries. For readers interested in AI, this breakthrough highlights the critical

Verticals

aidata-science

Originally published on Towards Data Science on 2/25/2026