filmov
tv
NSDI '20 - Measuring Congestion in High Performance Datacenter Interconnects
Показать описание
Measuring Congestion in High Performance Datacenter Interconnects
Saurabh Jha and Archit Patke, University of Illinois at Urbana-Champaign; Jim Brandt and Ann Gentile, Sandia National Lab; Benjamin Lim, University of Illinois at Urbana-Champaign; Mike Showerman and Greg Bauer, National Center for Supercomputing Applications; Larry Kaplan, Cray Inc.; Zbigniew Kalbarczyk, University of Illinois at Urbana-Champaign; William Kramer, University of Illinois at Urbana-Champaign and National Center for Supercomputing Applications; Ravi Iyer, University of Illinois at Urbana-Champaign
While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect networks. We present a methodology for detecting, extracting, and characterizing regions of congestion in networks. We have implemented the methodology in a deployable tool, Monet, which can provide such analysis and feedback at runtime. Using Monet, we characterize and diagnose congestion in the world's largest 3D torus network of Blue Waters, a 13.3-petaflop supercomputer at the National Center for Supercomputing Applications. Our study deepens the understanding of production congestion at a scale that has never been evaluated before.
Saurabh Jha and Archit Patke, University of Illinois at Urbana-Champaign; Jim Brandt and Ann Gentile, Sandia National Lab; Benjamin Lim, University of Illinois at Urbana-Champaign; Mike Showerman and Greg Bauer, National Center for Supercomputing Applications; Larry Kaplan, Cray Inc.; Zbigniew Kalbarczyk, University of Illinois at Urbana-Champaign; William Kramer, University of Illinois at Urbana-Champaign and National Center for Supercomputing Applications; Ravi Iyer, University of Illinois at Urbana-Champaign
While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect networks. We present a methodology for detecting, extracting, and characterizing regions of congestion in networks. We have implemented the methodology in a deployable tool, Monet, which can provide such analysis and feedback at runtime. Using Monet, we characterize and diagnose congestion in the world's largest 3D torus network of Blue Waters, a 13.3-petaflop supercomputer at the National Center for Supercomputing Applications. Our study deepens the understanding of production congestion at a scale that has never been evaluated before.