Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Показать описание

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Mayank Bansal, Bo Yang

A presentation from ApacheCon @Home 2020

Zeus is an efficient, highly scalable and distributed shuffle as a service which is powering all Data processing (Spark and Hive) at Uber. Uber runs one of the largest Spark and Hive clusters on top of YARN in industry which leads to many issues such as hardware failures (Burn out Disks), reliability and scalability challenges. Zeus is built ground up to support hundreds of thousands of jobs and millions of containers which shuffles petabytes of shuffle data. Zeus has changed the paradigm of current external shuffle which resulted in far better performance for shuffle. Although the shuffle data is getting written Remote, the performance is better or the same for most of the Jobs. In this talk we’ll take a deep dive into the Zeus architecture and describe how it’s deployed at Uber. We will then describe how it’s integrated to run shuffle for Spark, and contrast it with Spark’s built-in sort-based shuffle mechanism. We will also talk about future roadmap and plans for Zeus.

Mayank Bansal:
Mayank Bansal is currently working as a Staff engineer at Uber in data infrastructure team. He is co-author of Peloton. He is Apache Hadoop Committer and Oozie PMC and Committer. Previously he was working at ebay in hadoop platform team leading YARN and MapReduce effort. Prior to that he was working at Yahoo and worked on Oozie.
Bo Yang:
Bo is Sr. Software Engineer II in Uber and working on Spark team. In the past he worked on many streaming technologies.