AI/ML DC Design - Part 2

preview_player
Показать описание
The 0x2 Nerds continue with Petr Lapukov for part 2 of their discussion on AI/ML data center design.
Рекомендации по теме
Комментарии
Автор

thanks for hosting this series Jeff, legendary :)

LibertypopUK
Автор

### Key Discussion Points:

#### 1. Upcoming IETF Meeting in Vancouver (00:00:34 - 00:00:42)
- The hosts mention an upcoming IETF meeting in Vancouver, highlighting that many relevant topics will be discussed, including in-network computing for AI training, congestion control, and other new working group topics. (00:00:34 - 00:01:01)
- It is noted that one of the routing sessions will likely occur during the same time they would normally record their show. (00:01:14 - 00:01:22)
- The hosts encourage listeners to check the agenda and participate online for free if they cannot attend in person. (00:01:36 - 00:01:49)

#### 2. Focus on Fundamentals of Routing (00:03:54 - 00:04:00)
- The discussion shifts to routing, a fundamental aspect of networking. (00:03:45 - 00:03:54)
- They decide to revisit first principles, rather than diving into advanced topics right away. (00:04:00 - 00:04:13)

#### 3. The Criticality of Packet Loss in Clusters (00:04:34 - 00:04:49)
- Unlike enterprise networks where packet loss is often tolerable, dropping packets in a cluster environment has severe consequences. (00:04:34 - 00:04:49)
- This leads to a discussion about load balancing. (00:04:58 - 00:05:05)

#### 4. Load Balancing and Routing (00:05:05 - 00:05:22)
- Load balancing is typically a forwarding function, but routing can be leveraged for more than just loop freeness. (00:05:05 - 00:05:10)
- In a classical leaf-spine architecture, routing is mainly used to provide a set of equal-cost multi-paths (ECMP). (00:05:22 - 00:05:35)
- Routing is usually unaware of load or other semantics. (00:05:50 - 00:06:00)

#### 5. BGP Metadata and Congestion Signaling (00:06:00 - 00:07:54)
- There's work to introduce metadata into BGP to convey information about the quality of reachability, not just reachability itself. (00:06:00 - 00:06:18)
- A proposal exists for BGP to signal congestion beyond the next hop using a new path attribute called "next-next hop" in a Clos architecture.(00:06:37 - 00:07:54)

#### 6. Congestion Control in AI Workloads (00:07:55 - 00:10:28)
- Congestion control is crucial for AI workloads, helping to adjust traffic transmission rates. (00:07:55 - 00:08:03)
- Classical Data Center Quantized Congestion Notification (DCQCN) involves a round trip plus processing time. (00:08:22 - 00:08:32)
- BGP's feedback loop for congestion is much slower compared to DCQCN. (00:09:39 - 00:10:28)

#### 7. Importance of Testing with Realistic Workloads (00:10:28 - 00:11:30)
- Workload in AI is not uniform. (00:10:28 - 00:10:50)
- The network's behavior is entirely different depending on message sizes and other factors. (00:10:50 - 00:11:02)
- Theoretical assumptions about network design are often incorrect, so testing with real equipment is essential. (00:11:02 - 00:11:30)

#### 8. BGP Scalability vs. Speed (00:11:30 - 00:12:08)
- BGP is immensely scalable but also notoriously slow, which poses a challenge in environments where microsecond latency matters. (00:11:30 - 00:11:53)
- Separation of concerns is key, with reachability handled differently from the quality of reachability. (00:12:08 - 00:12:18)

#### 9. IGP Extensions for Signaling (00:12:18 - 00:13:40)
- There are new extensions to IGP to signal various attributes like available bandwidth and latency, but they require careful management due to frequent fluctuations. (00:12:18 - 00:12:53)
- Updating IGP too often can overload the network, and updating it too slowly might render the information irrelevant. (00:13:17 - 00:13:40)

#### 10. Traffic Engineering (00:16:58 - 00:18:03)
- Modern traffic engineering involves using IGP within the network, BGP for the controller, and segment routing. (00:17:01 - 00:17:17)
- RSVP is considered too fiddly and complex for most modern implementations. (00:16:58 - 00:17:19)
- Reoptimization of network paths is done in tens of seconds, which is a long time in most networks. (00:17:20 - 00:18:03)

#### 11. Cooperation Between Network and Host Technologies (00:18:03 - 00:18:30)
- The cooperation between network-based and host-based technologies is complex. (00:18:03 - 00:18:20)
- The best solution varies based on transmit rates, message sizes, and other factors. (00:18:20 - 00:18:30)

#### 12. Data Center Solutions and AI (00:18:30 - 00:20:00)
- Good data center design is generally applicable, but its importance is heightened in AI environments. (00:18:30 - 00:19:00)
- Poor network performance in AI/ML can lead to significant costs and inefficiencies. (00:19:00 - 00:20:00)

#### 13. Current Solutions and Development Areas (00:20:00 - 00:22:08)
- Congestion control signaling needs improvement because current methods are designed for storage, not AI. (00:20:00 - 00:20:42)
- There is a lot of development in the area of network-assisted telemetry. (00:21:04 - 00:21:14)
- Combining basic marking with round-trip time (RTT) measurements can help with detecting incast congestion. (00:22:08 - 00:22:30)
- It\'s crucial to run networks at high utilization in AI environments. (00:22:30 - 00:24:34)

#### 14. Key Principles for AI Networks (00:25:10 - 00:27:14)
- AI networks are highly optimized for RDMA over IP, specifically RoCEv2 which is routed traffic. (00:25:10 - 00:25:39)
- They require low latency, minimal buffering, and fast convergence. (00:25:41 - 00:26:54)
- Avoiding deep buffers is crucial, as data spends a significant amount of time traveling through different memory tiers. (00:25:55 - 00:26:30)
- Follow best practices, avoid loop hunting, and let protocols handle convergence. (00:27:00 - 00:27:14)

#### 15. Hyperscale and GPU as a Service (00:27:14 - 00:28:30)
- Large AI clusters are mainly built by hyperscalers or those offering GPU as a service. (00:27:14 - 00:28:30)
- These companies prioritize high performance infrastructure and leverage the experience of those coming from large tech companies. (00:28:30 - 00:29:00)

#### 16. Importance of Following Best Practices (00:28:30 - 00:30:00)
- It’s vital to follow best practices and learn from the mistakes of others. (00:28:30 - 00:29:00)
- It is important to understand what is marketing and what is truth. (00:30:00 - 00:30:09)
- Deep buffer switches are commonly sold, but they don't work for all applications, and you must choose the correct vendor. (00:30:09 - 00:31:52)

#### 17. Buffer Sizing and Vendor Information (00:31:52 - 00:33:00)
- There\'s a lot of misinformation about buffer sizing. (00:31:52 - 00:32:08)
- Vendors often promote their specific solutions and the importance of deep buffers for all use cases. (00:30:00 - 00:33:00)
- A discussion with outside, unbiased parties would be useful to determine the right approach. (00:33:00 - 00:33:30)

#### 18. Focus on Transport and Evolution (00:33:59 - 00:34:08)
- The evolution of transport is more interesting than hardware. (00:33:59 - 00:34:08)

#### 19. Self-Contained Building Blocks and Scalability (00:40:37 - 00:41:53)
- Networks should be built using self-contained, repeatable building blocks that allow for clear abstraction of details. (00:40:37 - 00:41:53)
- Reducing the amount of state is essential, which is achieved through summarization and avoiding randomness in IP allocation. (00:41:19 - 00:41:53)

#### 20. Overlay and Underlay Separation (00:41:53 - 00:43:01)
- The overlay, which is dynamic and involves tenants and virtual functions, should be completely decoupled from the immutable underlay. (00:41:53 - 00:42:01)
- Separate route distribution schemas for overlay and underlay can improve reliability and stability. (00:42:01 - 00:43:01)

#### 21. Importance of Routability (00:43:01 - 00:43:30)
- Workloads must be routed with IP, and Layer 2 solutions should be avoided. (00:43:01 - 00:43:30)
- Summarize wherever possible to reduce the amount of state. (00:43:15 - 00:43:30)

#### 22. Network Design Principles for Scalable Data Centers (00:43:30 - 00:48:30)
- Abstract details as you move up in the network. (00:43:30 - 00:44:17)
- Use ports (self-contained deployment units) for managing upgrades and summarization. (00:44:17 - 00:46:32)
- The aggregation on different tiers of the network, using best practices such as summarization, are very important for scaling the network. (00:46:32 - 00:48:30)
- Self-contained, repeatable building blocks are key to scalability, avoiding snowflake configurations. (00:48:30 - 00:49:15)

#### 23. Key Takeaways (00:49:15 - 00:50:00)
- Networks are critical and require careful design to avoid severe performance issues and ensure high availability. (00:49:15 - 00:50:00)
- Focus on building good networking solutions that are repeatable. (00:50:00 - 00:50:09)

#### 24. Show Summary (00:50:09 - 00:51:24)
- The discussion emphasized fundamental principles, which the hosts have discussed in previous episodes. (00:50:09 - 00:51:24)

#### 25. Importance of Troubleshooting and Operational Experience (00:51:24 - 00:51:36)
- Actual troubleshooting helps crystallize first principles and emphasizes the need for clear, organized network design. (00:51:24 - 00:51:36)

#### 26. Mental Picture of the Network (00:51:36 - 00:52:07)
- It is important to have a mental picture of the network to aid in troubleshooting. (00:51:36 - 00:52:07)

#### 27. Next Steps and IETF (00:52:07 - 00:54:49)
- The hosts will be attending the IETF meeting in Vancouver, where they will present "BGP over Quick". (00:53:43 - 00:54:09)
- Listeners are encouraged to participate virtually at IETF meetings to stay updated on networking developments. (00:54:09 - 00:54:49)

samyogdhital
Автор

I wish there is a virtual whiteboard or something to map the idea of discussion.
because the discussion has a lot of information that is too much for viewers's brain like me.

imnothingyouarebetter