Spark SQL 2 0 Experiences Using TPC DS (Berni Schiefer)

Показать описание

This talk summarizes the results of using the TPC-DS workload to characterize the SQL capability, performance and scalability of Apache Spark SQL 2.0 at the multi-Terabyte scale in both single user dedicated and multi-user concurrent execution modes. We track the evolution of Spark SQL across versions 1.5, 1.6 and 2.0 to underscore the pace of improvement in Spark SQL capability and performance. We also provide best practices and configuration tuning parameters to support the concurrent execution of the 99 TPC-DS queries at scale. The key takeaways include 1) See the substantial progress made by Spark SQL 2.0 2) Understand what TPC-DS is and why it has become the preferred workload of SQL on Hadoop systems. 3) Experimental results supporting the optimized execution of multi-user, multi-terabyte TPC-DS-based workloads 4) Tuning and configuration changes used to attain excellent performance of Spark SQL.