'Performance Matters' by Emery Berger

Показать описание

Performance clearly matters to users. For example, the most common software update on the AppStore is "Bug fixes and performance enhancements." Now that Moore's Law has ended, programmers have to work hard to get high performance for their applications. But why is performance hard to deliver?

I will first explain why current approaches to evaluating and optimizing performance don't work, especially on modern hardware and for modern applications. I then present two systems that address these challenges. Stabilizer is a tool that enables statistically sound performance evaluation, making it possible to understand the impact of optimizations and conclude things like the fact that the -O2 and -O3 optimization levels are indistinguishable from noise (sadly true).

Since compiler optimizations have run out of steam, we need better profiling support, especially for modern concurrent, multi-threaded applications. Coz is a new "causal profiler" that lets programmers optimize for throughput or latency, and which pinpoints and accurately predicts the impact of optimizations. Coz's approach unlocks previously unknown optimization opportunities. Guided by Coz, we improved the performance of Memcached (9%), SQLite (25%), and accelerated six other applications by as much as 68%; in most cases, this involved modifying less than 10 lines of code and took under half an hour (without any prior understanding of the programs!). Coz now ships as part of standard Linux distros (apt install coz-profiler).

Emery Berger
University of Massachusetts Amherst
@emeryberger

Emery Berger is a Professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst, the flagship campus of the UMass system. He graduated with a Ph.D. in Computer Science from the University of Texas at Austin in 2002. Professor Berger has been a Visiting Scientist at Microsoft Research (where he is currently on sabbatical), the University of Washington, and at the Universitat Politècnica de Catalunya (UPC) / Barcelona Supercomputing Center (BSC). Professor Berger's research spans programming languages, runtime systems, and operating systems, with a particular focus on systems that transparently improve reliability, security, and performance. He and his collaborators have created a number of influential software systems including Hoard, a fast and scalable memory manager that accelerates multithreaded applications (used by companies including British Telecom, Cisco, Crédit Suisse, Reuters, Royal Bank of Canada, SAP, and Tata, and on which the Mac OS X memory manager is based); DieHard, an error-avoiding memory manager that directly influenced the design of the Windows 7 Fault-Tolerant Heap; and DieHarder, a secure memory manager that was an inspiration for hardening changes made to the Windows 8 heap. His honors include a Microsoft Research Fellowship, an NSF CAREER Award, a Lilly Teaching Fellowship, the Distinguished Artifact Award for PLDI 2014, Most Influential Paper Awards at OOPSLA, PLDI, and ASPLOS, three CACM Research Highlights, a Google Research Award, a Microsoft SEIF Award, and Best Paper Awards at FAST, OOPSLA, and SOSP; he was named an ACM Distinguished Member in 2018. Professor Berger is currently serving his second term as an elected member of the SIGPLAN Executive Committee; he served for a decade (2007-2017) as Associate Editor of the ACM Transactions on Programming Languages and Systems, and was Program Chair for PLDI 2016.

Рекомендации по теме

Комментарии

This is definitely one of the best conference talks I've ever seen!

azymohliad

Extremely interesting research and a great presentation, thanks!

jotun

That guy just explained what the p-value is and how it works in just few seconds - that was an entire lecture at uni. Wow

ralph

I'm not even a programmer and enjoyed this presentation. I wish I had teachers like him!

Eduardo

Concerning O1, O2 optimization, fit in L1 and L2 cache is a big deal. If O1 binary happens to fit in L1/L2 and O2 does not, then the O1 binary could perform better than O2.
The big thing today is that memory round-trip access time is a couple of hundred CPU-cycles. Try to avoid too much pointer-chasing code. Prefetch memory when possible.

Note, Intel Core iX processors up to generation 9 have 256K L2. The Xeon SP lines have 1M L2 at 2 additional cycles access time. 10th gen Core have 512K L2.

Be aware that Intel processors since about mid-2000 had cache line size of 64 bytes. Prior to that, it was 32-bytes.

My view, too many software people have purist view of the world, thinking they can achieve great performance without consideration for the details of the underlying hardware.

joechang

This is an awesome talk, with some great, novel information (at least to me). The name of the program, "Stabilizer" is humorous, as it is actually more of an "unstabilizer". Excellent work Emery. I would love to see a example program that demonstrates significant performance delta between memory layouts.

jknight

This talk is incredible, great job to everyone involved.

RaidenFreeman

The SQLite example surprises me a little. Indirect calls seems like something I would expect the compiler to optimize already.

swapode

This was an amazing talk! Lots of counterintuitive things to correct in our mental models about performance, thank you for the knowledge! Haha, I loved the "eyeball statistics". Wonderful.

georgepantazes

I've never done anything with programming, but still understood almost everything Emery said. Great video!

xonarofficial

This is the only helpful performance analysis talk I have ever seen. Spectacular work, and thank you for making the tools available. I think it'd be spectacular to integrate the layout randomization and causal profiling directly into the Rust toolchain, and I can't see why not.
Edit: seems somebody has ported or begun porting Coz to Rust, very cool. :- )

microcolonel

I really like the causal analysis technique! I might be misunderstanding what "layout" is, but I was confused as to why you would randomize it every .5 seconds. It seems like this could wash out optimizations that are actually valid, e.g. optimizations that reduce the probability of cache misses, because out in the wild layout isn't being randomized all the time. It seems like the fact that you get unexpected distributions when only randomizing once per execution could be indicating that different codes do have different performance characteristics across different layouts, meaning that there are potentially useful code-level optimizations. An extreme example of this could be a data structure that monitors its own timing information and adapts to optimize latency assuming static memory layout, because then randomizing the layout could make that structure look way worse than a more naive approach that doesn't bias itself for any particular layout.

GeoffreyChurchilley

Wow. He went from theory, background knowledge, to full blown applied uses at a really nice pace. This is a great lecture for any student in software engineering. Love the reminders that certain optimizations can cause slowdown--as well as the reminder that rolling your own naive hash table can have disastrous consequences for performance (37:23)

ehhhhhhhhhh

To install coz on Debian or Ubuntu:
% sudo apt-get install coz-profiler

Papers:
* "Stabilizer: Statistically Sound Performance Evaluation" [ASPLOS 13]

* "Coz: Finding Code that Counts with Causal Profiling" [SOSP 15 Best Paper, CACM Research Highlight]

* Mentioned during talk: "Producing Wrong Data Without Doing Anything Obviously Wrong!" [ASPLOS 09]
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter Sweeney

Slides are here (Keynote):

EmeryBerger

This is one of the best talks I have ever seen, full stop

seanmacfoy

This was a bloody amazing speech and great content. Now the coz-profiler just needs to be ported to macOS. If it works on Linux already probably isn't too big a leap.

casperes

One of the better talks on "Memory Layout" within a given program or application that has a direct effect on Performance. The jokes and puns are great! No more malloc! Run it as me and it's faster... lol! Great content!

skilz

As a data scientist I take personal offence to the R slander, but on a serious note this was a great talk and I really enjoyed listening!

AdamGaffney

Excellent talk. Informative, engaging, clear.

ianchristensen

Enlightening talk and amazing results. If only coz existed for every language.

kkiller

'Performance Matters' by Emery Berger

'Performance Matters' by Emery Berger

'Python Performance Matters' by Emery Berger (Strange Loop 2022)

Plenary: Performance Matters - Emery Berger - CppCon 2020

CppCast Episode 267: Performance Matters with Emery Berger

Python Performance Matters: Emery Berger | Adobe Tech Summit 2022

SYSTOR 2017 - Prof. Emery Berger - 'Performance Matters'

'Performance (Really) Matters' with Emery Berger

Performance Really Matters

TALK / Emery Berger / Scalene: A high-performance, high-precision CPU+GPU+memory profiler for Python

01 Why Performance Matters

CppCon 2019: Emery Berger “Mesh: Automatically Compacting Your C++ Application's Memory”

Grading/Importing Data Using Performance Matters

USENIX ATC '19 - Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code

How To Get Your Research Adopted - Emery Berger PLDI 2022 keynote

Denys Mishunov: Why Performance Matters - JSConf Budapest 201

Performance Matters Teacher Training

Scalene: a high-performance, high-precision CPU+GPU+memory profiler for Python (PyCon US 2021)

Programming Technology for the Sciences - Prof. Emery Berger (UMass)

Measuring Multiple Facets of Python Performance With Scalene | Real Python Podcast #172

Ask Me Anything with Michelle Strout, hosted by Emery Berger

Powered by AI: A Cambrian Explosion for C++ Software Development Tools - Emery Berger - CppCon 2023

On the Correctness of Spreadsheets - Emery Berger

OSDI '23 - Triangulating Python Performance Issues with SCALENE

AnyDSL: A Partial Evaluation Framework for Programming High-Performance Libraries