filmov
tv
How Generic Programming and C++ Portability Give Great Performance and Reveals What Performs Well

Показать описание
---
How Generic Programming and C++ Portability Give Great Performance and Knowledge of What Performs Well - Eduardo Madrid, Scott Bruce - CppNow 2023
---
This presentation is about how Generic Programming Paradigm support in C++ and its portability has enabled us to achieve very good performance and accidentally create a powerful analysis instrument to observe programming technique performance and real "silicon" execution capabilities.
We will start by showing how our Hash Tables are "adaptive". We mean by "adaptive" that there are template parameters to modify memory layouts, hashing options, structure organization, and -crucially- bit widths of substructures. Then, at runtime, a table can transform itself into another table with other parameters. We applied the Generic Programming Paradigm from the start, resulting in stunning configurabiity, this is the instrument that allows us to explore which configuration is superior in which execution regime. In contrast, most other high performance code of any type hardcodes almost all performance-crucial parameters.
Then, we will show what we get by using Generic Programming:
1. Helps us write better code: The effort put into more abstract ways to express the design results in a better understanding of what the code should do, then it is easier to write the code, yielding better results for no extra effort.
2. Avoids premature commitment: not just to "harcoded" parameters, but to programming technique. Our case finds this especially useful: each execution regime has many near optimal choices, there is no global best choice; we have the luxury of deferring these choices to after we have gained a solid understanding of the merits of the options, actually, even defer to runtime!
Then, we will answer questions like
- "At what point does multiplication become the bottleneck?" (critical to decide how many bit sizes that are not powers of two we can get away with)
- "If we space out multiplications, do we get better performance?"(obtaining the best compromise between entropy of key distribution and latencies)
- "Are we flooding the CPU with very simple bitwise ops, making the decoder the bottleneck?"
- "What seems to be the best memory layout for these two choices? How does each memory layout affect cache behaviors?"
Finally, we will share insights we've gotten:
1. The configurability of our hash tables allows wide exploration, and with ease and good level of detail, of the space of choices and their impact on performance.
2. Hash table operations require different types of work. With regard to memory access patterns, the operations range from unpredictable, as in accessing the home slot for a key, to all of the other activities that ought to be very predictable. With regards to assembler instruction complexity: they range from SIMD instructions, multiplications, to the simplest: bitwise instructions. Thus, we are confident these workloads are representative of real computation, then our performance observations are not misleading.
3. We lean hard on C++ portability, enabling us to observe different micro-architectures with the same code. This allows apples to apples comparisons, matching code to silicon capabilities, and 'reversing the lens' and verifying the silicon has the expected general performance.
Benchmarking (especially microbenchmarking) involves some level of synthetic load generation and (potentially excessive) focus on unrepresentative use cases.
We've seen prior projects get misleading microbenchmarking results entirely too easily. We provide a reliable foundation for performance analysis because we use hash table workloads, the Generic Programming and portability further enhances the value by providing an easy way to explore all of the space of choices!
---
Scott Bruce
Scott has been writing software professionally for 20+ years. He worked in distributed systems, search, Adwords, Adsense and ML at Google for 13 years. He worked 4.5 years at Snap, on monetization systems, performance, and advanced mobile telemetry systems. He is currently an engineer at Rockset, working on realtime analytics. Presenter of a production software talk at Google, Snap, UCLA, USC, Caltech and others over last ten years.
Eduardo Madrid
Author of open source libraries and components such as the Zoo type-erasure framework. Prior experience as tech Lead at Snap, Inc., Fintech, including automated trading. Several times presenter at CPPCon, C++ Now, C++ London, once at C++ Munich
---
Video Sponsors: think-cell and Bloomberg Engineering
Audience Audio Sponsors: Innoplex and Maryland Research Institute
---
---
CppNow 2024
---
#boost #cpp
How Generic Programming and C++ Portability Give Great Performance and Knowledge of What Performs Well - Eduardo Madrid, Scott Bruce - CppNow 2023
---
This presentation is about how Generic Programming Paradigm support in C++ and its portability has enabled us to achieve very good performance and accidentally create a powerful analysis instrument to observe programming technique performance and real "silicon" execution capabilities.
We will start by showing how our Hash Tables are "adaptive". We mean by "adaptive" that there are template parameters to modify memory layouts, hashing options, structure organization, and -crucially- bit widths of substructures. Then, at runtime, a table can transform itself into another table with other parameters. We applied the Generic Programming Paradigm from the start, resulting in stunning configurabiity, this is the instrument that allows us to explore which configuration is superior in which execution regime. In contrast, most other high performance code of any type hardcodes almost all performance-crucial parameters.
Then, we will show what we get by using Generic Programming:
1. Helps us write better code: The effort put into more abstract ways to express the design results in a better understanding of what the code should do, then it is easier to write the code, yielding better results for no extra effort.
2. Avoids premature commitment: not just to "harcoded" parameters, but to programming technique. Our case finds this especially useful: each execution regime has many near optimal choices, there is no global best choice; we have the luxury of deferring these choices to after we have gained a solid understanding of the merits of the options, actually, even defer to runtime!
Then, we will answer questions like
- "At what point does multiplication become the bottleneck?" (critical to decide how many bit sizes that are not powers of two we can get away with)
- "If we space out multiplications, do we get better performance?"(obtaining the best compromise between entropy of key distribution and latencies)
- "Are we flooding the CPU with very simple bitwise ops, making the decoder the bottleneck?"
- "What seems to be the best memory layout for these two choices? How does each memory layout affect cache behaviors?"
Finally, we will share insights we've gotten:
1. The configurability of our hash tables allows wide exploration, and with ease and good level of detail, of the space of choices and their impact on performance.
2. Hash table operations require different types of work. With regard to memory access patterns, the operations range from unpredictable, as in accessing the home slot for a key, to all of the other activities that ought to be very predictable. With regards to assembler instruction complexity: they range from SIMD instructions, multiplications, to the simplest: bitwise instructions. Thus, we are confident these workloads are representative of real computation, then our performance observations are not misleading.
3. We lean hard on C++ portability, enabling us to observe different micro-architectures with the same code. This allows apples to apples comparisons, matching code to silicon capabilities, and 'reversing the lens' and verifying the silicon has the expected general performance.
Benchmarking (especially microbenchmarking) involves some level of synthetic load generation and (potentially excessive) focus on unrepresentative use cases.
We've seen prior projects get misleading microbenchmarking results entirely too easily. We provide a reliable foundation for performance analysis because we use hash table workloads, the Generic Programming and portability further enhances the value by providing an easy way to explore all of the space of choices!
---
Scott Bruce
Scott has been writing software professionally for 20+ years. He worked in distributed systems, search, Adwords, Adsense and ML at Google for 13 years. He worked 4.5 years at Snap, on monetization systems, performance, and advanced mobile telemetry systems. He is currently an engineer at Rockset, working on realtime analytics. Presenter of a production software talk at Google, Snap, UCLA, USC, Caltech and others over last ten years.
Eduardo Madrid
Author of open source libraries and components such as the Zoo type-erasure framework. Prior experience as tech Lead at Snap, Inc., Fintech, including automated trading. Several times presenter at CPPCon, C++ Now, C++ London, once at C++ Munich
---
Video Sponsors: think-cell and Bloomberg Engineering
Audience Audio Sponsors: Innoplex and Maryland Research Institute
---
---
CppNow 2024
---
#boost #cpp
Комментарии