How Generic Programming and C++ Portability Give Great Performance and Reveals What Performs Well

preview_player
Показать описание
---

How Generic Programming and C++ Portability Give Great Performance and Knowledge of What Performs Well - Eduardo Madrid, Scott Bruce - CppNow 2023
---

This presentation is about how Generic Programming Paradigm support in C++ and its portability has enabled us to achieve very good performance and accidentally create a powerful analysis instrument to observe programming technique performance and real "silicon" execution capabilities.

We will start by showing how our Hash Tables are "adaptive". We mean by "adaptive" that there are template parameters to modify memory layouts, hashing options, structure organization, and -crucially- bit widths of substructures. Then, at runtime, a table can transform itself into another table with other parameters. We applied the Generic Programming Paradigm from the start, resulting in stunning configurabiity, this is the instrument that allows us to explore which configuration is superior in which execution regime. In contrast, most other high performance code of any type hardcodes almost all performance-crucial parameters.

Then, we will show what we get by using Generic Programming:
1. Helps us write better code: The effort put into more abstract ways to express the design results in a better understanding of what the code should do, then it is easier to write the code, yielding better results for no extra effort.
2. Avoids premature commitment: not just to "harcoded" parameters, but to programming technique. Our case finds this especially useful: each execution regime has many near optimal choices, there is no global best choice; we have the luxury of deferring these choices to after we have gained a solid understanding of the merits of the options, actually, even defer to runtime!

Then, we will answer questions like
- "At what point does multiplication become the bottleneck?" (critical to decide how many bit sizes that are not powers of two we can get away with)
- "If we space out multiplications, do we get better performance?"(obtaining the best compromise between entropy of key distribution and latencies)
- "Are we flooding the CPU with very simple bitwise ops, making the decoder the bottleneck?"
- "What seems to be the best memory layout for these two choices? How does each memory layout affect cache behaviors?"

Finally, we will share insights we've gotten:
1. The configurability of our hash tables allows wide exploration, and with ease and good level of detail, of the space of choices and their impact on performance.
2. Hash table operations require different types of work. With regard to memory access patterns, the operations range from unpredictable, as in accessing the home slot for a key, to all of the other activities that ought to be very predictable. With regards to assembler instruction complexity: they range from SIMD instructions, multiplications, to the simplest: bitwise instructions. Thus, we are confident these workloads are representative of real computation, then our performance observations are not misleading.
3. We lean hard on C++ portability, enabling us to observe different micro-architectures with the same code. This allows apples to apples comparisons, matching code to silicon capabilities, and 'reversing the lens' and verifying the silicon has the expected general performance.

Benchmarking (especially microbenchmarking) involves some level of synthetic load generation and (potentially excessive) focus on unrepresentative use cases.
We've seen prior projects get misleading microbenchmarking results entirely too easily. We provide a reliable foundation for performance analysis because we use hash table workloads, the Generic Programming and portability further enhances the value by providing an easy way to explore all of the space of choices!
---

Scott Bruce

Scott has been writing software professionally for 20+ years. He worked in distributed systems, search, Adwords, Adsense and ML at Google for 13 years. He worked 4.5 years at Snap, on monetization systems, performance, and advanced mobile telemetry systems. He is currently an engineer at Rockset, working on realtime analytics. Presenter of a production software talk at Google, Snap, UCLA, USC, Caltech and others over last ten years.

Eduardo Madrid

Author of open source libraries and components such as the Zoo type-erasure framework. Prior experience as tech Lead at Snap, Inc., Fintech, including automated trading. Several times presenter at CPPCon, C++ Now, C++ London, once at C++ Munich

---

Video Sponsors: think-cell and Bloomberg Engineering
Audience Audio Sponsors: Innoplex and Maryland Research Institute
---

---

CppNow 2024
---

#boost #cpp
Рекомендации по теме
Комментарии
Автор

Eduardo here.
Thanks for the comments about sort requiring random access:
The point I was making is that there is no need to require random access to guarantee n*log(n) complexity in the worst case.
In any good implementation of the concept of sort there ought to be some metaprogramming to identify useful characteristics to use better algorithms, including identifying that the iterators are random access and activating some better mechanism to sort.
Otherwise, what the standard has done, is not to implement the concept of sorting in the best way, but to implement a different concept, of "sorting when we can take advantage of random access", leaving the plainer concept of "sort" weakened, hence what I called a "compounding of the error" of not having the concept of sorting implemented.

eduardomadrid
Автор

@00:40:00 From my limited understanding std::sort requires a random access type iterator because not all types passed to it as a non random access type parameter will support all of the arithmetic operations of addition subtraction and comparison which are important operations in essential sorting algorithms operations such as partitioning. Here are a few other reasons I found online: For efficiency of the introsort algorithm which powers std::sort; to facilitate constant time access; for consistency within the C++ standard to allow developers guaranteed performance characteristics for the implementation across all valid types.

abdulshabazz
Автор

Thanks for the comments about sort requiring random access:
The point I was making is that there is no need to require random access to guarantee n*log(n) complexity in the worst case.
In any good implementation of the concept of sort there ought to be some metaprogramming to identify useful characteristics to use better algorithms, including identifying that the iterators are random access and activating some better mechanism to sort.
Otherwise, what the standard has done, is not to implement the concept of sorting in the best way, but to implement a different concept, of "sorting when we can take advantage of random access", leaving the plainer concept of "sort" weakened, hence what I called a "compounding of the error" of not having the concept of sorting implemented.

agudo
Автор

In one of his master classes at A9 (available in Youtube), Alex Stepanov talks about his experience implementing sort generically for random- a non-random-access sequence; not sure now if he was talking about bidirectional or on a linked list in particular. I can't find it now because I would have to go through all the lectures to search for it, but what I remember is that he said he was excited after implementing it to some degree at the time... but eventually, he found out that the performance was unacceptable and for a surprising reason (surprising for himself at least). The reason was not about iterator increment cost, but instead, that, as the sequence approached sorted-ness, the comparisons between "adjacent" elements became relatively more and more expensive! ... because the reads had to still be done of distant data. The idea was that efficient sorting depends on data access (reads for comparison) becoming cheaper and cheaper due to data proximity. One of the attendants in the video here points out that `std::list` has a `.sort()` method to what the speaker, Eduardo, says makes his point further. However, in my opinion, even if `list.sort()` has `n log n` complexity (as documented), I would claim that it is a *completely different beast* than the `std::sort` algorithm. The former exploits the feature that elements do not need the elements to be copied (or moved), and the latter exploits the fact that reads become cheaper as the sequence approaches the sorted state. This completely orthogonal requirement on the element type and the data structure itself might justify keeping the algorithms separate, even though both have the same formal complexity and both are called `sort`; maybe the list-container method (or a free function for linked-list-based iterators) should be called `.order` to make the distinction. (an important point also is that bidirectional does not necessarily imply a linked structure; therefore, the refinement does not precisely matches with the somewhat oversimplified STL categories of iterators, and that is the further reason I believe `.sort` is a method of the container and not a free function on iterators of a certain category)

AlfredoCorrea
join shbcf.ru