CppCon 2015: Chandler Carruth 'Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!'

preview_player
Показать описание

A primary use case for C++ is low latency, low overhead, high performance code. But C++ does not give you these things for free, it gives you the tools to control these things and achieve them where needed. How do you realize this potential of the language? How do you tune your C++ code and achieve the necessary performance metrics?

This talk will walk through the process of tuning C++ code from benchmarking to performance analysis. It will focus on small scale performance problems ranging from loop kernels to data structures and algorithms. It will show you how to write benchmarks that effectively measure different aspects of performance even in the face of advanced compiler optimizations and bedeviling modern CPUs. It will also show how to analyze the performance of your benchmark, understand its behavior as well as the CPUs behavior, and use a wide array of tools available to isolate and pinpoint performance problems. The tools and some processor details will be Linux and x86 specific, but the techniques and concepts should be broadly applicable.
--
Chandler Carruth leads the Clang team at Google, building better diagnostics, tools, and more. Previously, he worked on several pieces of Google’s distributed build system. He makes guest appearances helping to maintain a few core C++ libraries across Google’s codebase, and is active in the LLVM and Clang open source communities. He received his M.S. and B.S. in Computer Science from Wake Forest University, but disavows all knowledge of the contents of his Master’s thesis. He is regularly found drinking Cherry Coke Zero in the daytime and pontificating over a single malt scotch in the evening.
--

*-----*
*-----*
Рекомендации по теме
Комментарии
Автор

1:29:40 PGO(Profile Guided Optimization) is what you're looking for - it lets the program gather data for how likely each branch is, and this data can be used for optimization on the second compilation

syferpolski
Автор

For my own notes (and anyone else's!) Chandler's recommended flags for record and report are at 32:52

jgyb
Автор

BTW. There is a shortcut in `perf` to use inverted callgraph. `perf report -G`. You still need to record with `perf record -g` of course and have some debug info present in the binary (if needed use -Og and/or -ggdb and/or -fno-omit-frame-pointer, or some combination, but be careful as some of these options might impact measurements itself).

movaxh
Автор

I agree with perf's decision to display the most expensive callees first (see the discussion at ~32 minutes). It's what you need to initially know. If your reaction is "WTF this function isn't supposed to show up in the list at all", you immediately know that it's way too expensive or called in places you don't expect.

SuareHead
Автор

This is a great talk, even while I don't understand 50% or more of it.

kamilziemian
Автор

counts for "cycles" often seem to get attributed to the instruction that gets stuck waiting for the result (reads that register), not just the next instruction in program order. At least that's true for cache-miss loads. At 1:09:30 we can see high counts on jmp instructions back into the main loop, and they directly follow idiv.

Out-of-order execution makes it hard to blame any specific instruction for a cycle because multiple instructions can be in flight at once. To make matters worse, the "cycles" even doesn't usually use PEBS.


The blame might go to the oldest instruction in the ROB, which would explain cache-miss loads blaming the instruction that's stuck waiting. The load itself may be able to retire, just leaving the load incoming via a load buffer without blocking retirement until an instruction that needs the load result value. Or not, I forget if that's true.


BTW, the "cache miss" explanation for the benefit of using UNLIKELY when unrolling is implausible in these tight loops. More likely the benefit is taken-branch throughput effects on the front-end vs. not-taken. When idiv doesn't have to run, instructions per clock is potentially quite high.

Peter_Cordes
Автор

I'm on the way to watch all Chandler Carruth's talks on YT.

kamilziemian
Автор

Imagine being at google for the second week and you get Ken Thompson's code that was basically addressed to you. I'm sure it was beautiful.

Isn't it fascinating how you need a long time to explain all the stuff you need to do to get some numbers from the performance benchmarking that should sort of come from just running the program as default? I'm sure this perf is made by and for lvl 99 linux wizard engineers to have a secret language to cast incantations and keep their secrets safe from normal engineers and programmers. It is not meant to provide a nice tool for the "normal" people but just a as an accidental byproduct give them some useful features once you figure out the spell to call them out.

Yupppi
Автор

This is what I want. Superb in it's diving the matter deeply.

joripiira
Автор

If we are interested in the performance of vector::push_back() alone, for we normally donot call vector::reserve() before hand, then bench marking vector::push_back() after vector::reserve() in the sample given, are not we kind of only measured the performance of vector::push_back() partially?

XiongZou
Автор

Go to C++ talk and ask what kind of vim setup he uses :D

dreadlock
Автор

Can anyone suggest a book that talks about the stuff mentioned in the video in C++? Thanks.

EngBandar
Автор

Why does my perf (ver 3.15.5, running on Gentoo) look totally different? Cannot select the line item to expand(already expanded). Therefore, cannot hit 'a' to see the assembly.

topgun
Автор

I wonder which optimization passes deleted the v.reserve(1); push_back(42); calls at 48:13. Actually, gcc doesn't optimize that away.

Peregringlk
Автор

Can't find perf for MSYS2. Is this not available for it?

jackadrianzappa
Автор

Fast forward. I like his speed as i drink beer while watching.

llothar
Автор

What is that Vim plugin he's using ?

julienl
Автор

Anyone knows what kind of iTerm2/vim setup he uses?

Espicen
Автор

i never get idea of what piece of c++ code doing at first look

mithunkumar-hsni
Автор

How is he running perf on osx? Is he remoting into a Linux box?

jakeflynn