filmov
tv
Optics in AI Clusters - Meta Perspective
Показать описание
Andrew Alduino, Optical Engineer, Meta
AI workloads continue to scale aggressively in complexity and size, requiring more accelerators, as well as more high speed memory capacity and flops per accelerator. The wide range of newly evolving AI model requirements drives the desire for optionality and potentially composability in our future hardware designs to address this uncertainty in model requirements. Meta has recently announced two 24k GPU clusters to support the training of our next generation LLama 3 LLM model. We see that AI cluster design is becoming more challenging, with some evolving requirements pointing towards high power racks with large GPU counts, very high GPU power and very high IO BW demands; all creating challenges for DC deployments.
In this talk we describe the AI workload requirements driving larger GPU clusters and connect those to the IO demands for future accelerator packages. Effectively escaping IO from future accelerator packages is a critical technology challenge. The scaling of electrical signaling solutions is becoming more challenging and integrated optics solutions with their high bandwidth and high bandwidth density show promise to address these package and rack scale challenges. Future AI cluster architectures will require co-design of GPU hardware, system IO demands, rack designs, power delivery solutions, cooling technologies, memory architectures, software paradigms, etc. We see optical interconnects as part of this optimization effort.
AI workloads continue to scale aggressively in complexity and size, requiring more accelerators, as well as more high speed memory capacity and flops per accelerator. The wide range of newly evolving AI model requirements drives the desire for optionality and potentially composability in our future hardware designs to address this uncertainty in model requirements. Meta has recently announced two 24k GPU clusters to support the training of our next generation LLama 3 LLM model. We see that AI cluster design is becoming more challenging, with some evolving requirements pointing towards high power racks with large GPU counts, very high GPU power and very high IO BW demands; all creating challenges for DC deployments.
In this talk we describe the AI workload requirements driving larger GPU clusters and connect those to the IO demands for future accelerator packages. Effectively escaping IO from future accelerator packages is a critical technology challenge. The scaling of electrical signaling solutions is becoming more challenging and integrated optics solutions with their high bandwidth and high bandwidth density show promise to address these package and rack scale challenges. Future AI cluster architectures will require co-design of GPU hardware, system IO demands, rack designs, power delivery solutions, cooling technologies, memory architectures, software paradigms, etc. We see optical interconnects as part of this optimization effort.
Комментарии