CppCon 2015: Chandler Carruth 'Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!'

Показать описание

—
A primary use case for C++ is low latency, low overhead, high performance code. But C++ does not give you these things for free, it gives you the tools to control these things and achieve them where needed. How do you realize this potential of the language? How do you tune your C++ code and achieve the necessary performance metrics?

This talk will walk through the process of tuning C++ code from benchmarking to performance analysis. It will focus on small scale performance problems ranging from loop kernels to data structures and algorithms. It will show you how to write benchmarks that effectively measure different aspects of performance even in the face of advanced compiler optimizations and bedeviling modern CPUs. It will also show how to analyze the performance of your benchmark, understand its behavior as well as the CPUs behavior, and use a wide array of tools available to isolate and pinpoint performance problems. The tools and some processor details will be Linux and x86 specific, but the techniques and concepts should be broadly applicable.
--
Chandler Carruth leads the Clang team at Google, building better diagnostics, tools, and more. Previously, he worked on several pieces of Google’s distributed build system. He makes guest appearances helping to maintain a few core C++ libraries across Google’s codebase, and is active in the LLVM and Clang open source communities. He received his M.S. and B.S. in Computer Science from Wake Forest University, but disavows all knowledge of the contents of his Master’s thesis. He is regularly found drinking Cherry Coke Zero in the daytime and pontificating over a single malt scotch in the evening.
--

*-----*
*-----*

Рекомендации по теме

Комментарии

1:29:40 PGO(Profile Guided Optimization) is what you're looking for - it lets the program gather data for how likely each branch is, and this data can be used for optimization on the second compilation

syferpolski

For my own notes (and anyone else's!) Chandler's recommended flags for record and report are at 32:52

jgyb

BTW. There is a shortcut in `perf` to use inverted callgraph. `perf report -G`. You still need to record with `perf record -g` of course and have some debug info present in the binary (if needed use -Og and/or -ggdb and/or -fno-omit-frame-pointer, or some combination, but be careful as some of these options might impact measurements itself).

movaxh

I agree with perf's decision to display the most expensive callees first (see the discussion at ~32 minutes). It's what you need to initially know. If your reaction is "WTF this function isn't supposed to show up in the list at all", you immediately know that it's way too expensive or called in places you don't expect.

SuareHead

This is a great talk, even while I don't understand 50% or more of it.

kamilziemian

counts for "cycles" often seem to get attributed to the instruction that gets stuck waiting for the result (reads that register), not just the next instruction in program order. At least that's true for cache-miss loads. At 1:09:30 we can see high counts on jmp instructions back into the main loop, and they directly follow idiv.

Out-of-order execution makes it hard to blame any specific instruction for a cycle because multiple instructions can be in flight at once. To make matters worse, the "cycles" even doesn't usually use PEBS.

The blame might go to the oldest instruction in the ROB, which would explain cache-miss loads blaming the instruction that's stuck waiting. The load itself may be able to retire, just leaving the load incoming via a load buffer without blocking retirement until an instruction that needs the load result value. Or not, I forget if that's true.

BTW, the "cache miss" explanation for the benefit of using UNLIKELY when unrolling is implausible in these tight loops. More likely the benefit is taken-branch throughput effects on the front-end vs. not-taken. When idiv doesn't have to run, instructions per clock is potentially quite high.

Peter_Cordes

I'm on the way to watch all Chandler Carruth's talks on YT.

kamilziemian

Imagine being at google for the second week and you get Ken Thompson's code that was basically addressed to you. I'm sure it was beautiful.

Isn't it fascinating how you need a long time to explain all the stuff you need to do to get some numbers from the performance benchmarking that should sort of come from just running the program as default? I'm sure this perf is made by and for lvl 99 linux wizard engineers to have a secret language to cast incantations and keep their secrets safe from normal engineers and programmers. It is not meant to provide a nice tool for the "normal" people but just a as an accidental byproduct give them some useful features once you figure out the spell to call them out.

Yupppi

This is what I want. Superb in it's diving the matter deeply.

joripiira

If we are interested in the performance of vector::push_back() alone, for we normally donot call vector::reserve() before hand, then bench marking vector::push_back() after vector::reserve() in the sample given, are not we kind of only measured the performance of vector::push_back() partially?

XiongZou

Go to C++ talk and ask what kind of vim setup he uses :D

dreadlock

Can anyone suggest a book that talks about the stuff mentioned in the video in C++? Thanks.

EngBandar

Why does my perf (ver 3.15.5, running on Gentoo) look totally different? Cannot select the line item to expand(already expanded). Therefore, cannot hit 'a' to see the assembly.

topgun

I wonder which optimization passes deleted the v.reserve(1); push_back(42); calls at 48:13. Actually, gcc doesn't optimize that away.

Peregringlk

Can't find perf for MSYS2. Is this not available for it?

jackadrianzappa

Fast forward. I like his speed as i drink beer while watching.

llothar

What is that Vim plugin he's using ?

julienl

Anyone knows what kind of iTerm2/vim setup he uses?

Espicen

i never get idea of what piece of c++ code doing at first look

mithunkumar-hsni

How is he running perf on osx? Is he remoting into a Linux box?

jakeflynn

CppCon 2015: Chandler Carruth 'Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!'

CppCon 2015: Chandler Carruth 'Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!'

CppCon 2015: Moderator: Chandler Carruth 'Technical Specifications & C++17'

Understanding Compiler Optimization - Chandler Carruth - Opening Keynote Meeting C++ 2015

CppCon 2014: Chandler Carruth 'Efficiency with Algorithms, Performance with Data Structures&apo...

CppCon 2017: Chandler Carruth “Going Nowhere Faster”

CppCon 2016: Chandler Carruth “High Performance Code 201: Hybrid Data Structures'

CppCon 2018: Chandler Carruth “Spectre: Secrets, Side-Channels, Sandboxes, and Security”

CppCon 2015: Milian Wolff 'Modern User Interfaces for C++”

CppCon 2015: “Grill the Committee”

What is C++ - Chandler Carruth, Titus Winters - CppCon 2019

CppCon 2016: Chandler Carruth “Garbage In, Garbage Out: Arguing about Undefined Behavior...'

C++Now 2019: Chandler Carruth “A clean and minimal map API”

CppCon 2019: Chandler Carruth “There Are No Zero-cost Abstractions”

CppCon 2015: Neil MacIntosh “Static Analysis and C++: More Than Lint'

CppCon 2015: Sean Parent 'Better Code: Data Structures'

CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”...

Modernizing Compiler Design for Carbon Toolchain - Chandler Carruth - CppNow 2023

A live demo about giving live demos - Chandler Carruth

Spectre/C++ - Zola Bridges, Devin Jeanpierre - CppCon 2019

CppCon 2015: B. Geller & A. Sermersheim 'Compile-Time Counter Using Template & Constexp...

CppCon 2015: Bryce Adelstein-Lelbach “Benchmarking C++ Code'

CppCon 2015: Greg Law ' Give me 15 minutes & I'll change your view of GDB'

On 'simple' Optimizations - Chandler Carruth - Secret Lightning Talks - Meeting C++ 2016

CppCon 2015: Louis Dionne “C++ Metaprogramming: A Paradigm Shift'