Scale By The Bay 2019: Jacques Nadeau, Vectorized Query Processing for CPUs using Apache Arrow

Показать описание

Title: Vectorized Query Processing for CPUs and GPUs using Apache Arrow

Query processing technology has seen rapid development since the iconic C-Store paper was published in 2005. The focus has been on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. In this talk we will explore different types of vectorized query processing in Dremio using Apache Arrow. Abstract Columnar data has become the de facto format for building high performance query engines that run analytical workloads. Apache Arrow is an in-memory columnar data format that houses canonical in-memory representations for both flat and nested data structures. It is a natural complement to on-disk formats like Apache Parquet and Apache ORC. Data stored in a columnar format is amenable to processing using vectorized instructions (SIMD) available on all modern architectures. Query processing algorithms can implement simple and efficient code that operates on the columnar values in a tight-loop, providing fast and CPU cache-friendly access patterns. Operations like SUM, FILTER, COUNT, MIN, MAX etc on columnar data can be made more efficient by leveraging the data-level parallelism property of SIMD instructions. Columnar data can be encoded using lightweight algorithms like dictionary encoding, run length encoding, bit packing and delta encoding that are far more CPU efficient than general purpose compression algorithms like LZO and ZLIB. Furthermore, vectorized query processing algorithms can be written in a manner that are aware of column level encoding and can easily operate on the compressed column values in some cases. This saves CPU-memory bandwidth since we need only decompress the necessary column values. Columnar format allows us to efficiently utilize CPU and GPU cache by filling cache lines with related data (column values from an in-memory vector). With the increasing use of GPUs and FPGAs, efficient use of the smaller on-chip memory available in these architectures is especially important. In addition, Apache Arrow allows for zero-copy, shared access to buffers so that multiple processes can more efficiently operate on the same data. On the storage side, columnar representation of on-disk data makes a good case for efficient utilization of disk I/O bandwidth for analytical queries. Dremio’s query processing engine leverages columnar format of Apache Arrow and Parquet for in-memory and on-disk representations respectively. We have vectorized implementations of operators like hash join and hash aggregation to name a few.

FunctionalTV

Рекомендации по теме

Scale By The Bay 2019: Jacques Nadeau, Vectorized Query Processing for CPUs using Apache Arrow

Scale By The Bay 2019: Bill Venners, In Types We Trust

Scale By The Bay 2019: Justin Heyes-Jones, A Gentle Introduction to Comonads

Scale By The Bay 2019: Ahir Reddy & Li Haoyi, Speedy Scala Builds at Databricks

Scale By The Bay 2019: Evan Chan, Rust and Scala, Sitting in a Tree….

Scale By The Bay 2019: David Andrzejewski, Reliable Machine Learning

Scale By The Bay 2019 Highlights

Scale By The Bay 2019: James Earl Douglas, Functional Electromagnetism

Scale By The Bay 2019: Jeremy Smith & Jonathan Indig, Solving the Scala Notebook Experience

Scale By The Bay 2019: Jason Swartz, High Performance Serverless Functions in Scala

Scale By The Bay 2019: Tikhon Jelvis, What is Functional Reactive Programming?

Scale By The Bay 2019: Oli Makhasoeva & Andy Scott, Recursion schemes with Higherkindness

Scale By The Bay 2019: Oscar Boykin Interview

Scale By The Bay 2019: Alexander Ioffe Interview

Scale By The Bay 2019: Yifan Xing, Growing the Scala Community

Scale By The Bay 2019: Oli Makhasoeva, The Art of Asking Questions

Scale By The Bay 2019: Ville Tuulos Interview

Scale By The Bay 2019: Heather Miller Interview

Scale By The Bay 2019: Jacques Nadeau, Vectorized Query Processing for CPUs using Apache Arrow

Scale By The Bay 2019: Thursday Keynote, Heather Miller, The Times Are A-Changin'

Scale By The Bay 2019: Paul Cleary, Re-programming the programmer, from Actors to FP

Scale By The Bay 2019: Kavita Laddad, Taming complex webapps with Scala and React

Scale By The Bay 2019: Bryan Cantrill Interview

Scale By The Bay 2019: Bryan Cantrill, Was He Wright All Along? Software After Moore's Law

Scale By The Bay 2018: Yunsup Lee, Leveraging Scala to Build Hardware at Scale