'Performance Matters' by Emery Berger

preview_player
Показать описание
Performance clearly matters to users. For example, the most common software update on the AppStore is "Bug fixes and performance enhancements." Now that Moore's Law has ended, programmers have to work hard to get high performance for their applications. But why is performance hard to deliver?

I will first explain why current approaches to evaluating and optimizing performance don't work, especially on modern hardware and for modern applications. I then present two systems that address these challenges. Stabilizer is a tool that enables statistically sound performance evaluation, making it possible to understand the impact of optimizations and conclude things like the fact that the -O2 and -O3 optimization levels are indistinguishable from noise (sadly true).

Since compiler optimizations have run out of steam, we need better profiling support, especially for modern concurrent, multi-threaded applications. Coz is a new "causal profiler" that lets programmers optimize for throughput or latency, and which pinpoints and accurately predicts the impact of optimizations. Coz's approach unlocks previously unknown optimization opportunities. Guided by Coz, we improved the performance of Memcached (9%), SQLite (25%), and accelerated six other applications by as much as 68%; in most cases, this involved modifying less than 10 lines of code and took under half an hour (without any prior understanding of the programs!). Coz now ships as part of standard Linux distros (apt install coz-profiler).

Emery Berger
University of Massachusetts Amherst
@emeryberger

Emery Berger is a Professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst, the flagship campus of the UMass system. He graduated with a Ph.D. in Computer Science from the University of Texas at Austin in 2002. Professor Berger has been a Visiting Scientist at Microsoft Research (where he is currently on sabbatical), the University of Washington, and at the Universitat Politècnica de Catalunya (UPC) / Barcelona Supercomputing Center (BSC). Professor Berger's research spans programming languages, runtime systems, and operating systems, with a particular focus on systems that transparently improve reliability, security, and performance. He and his collaborators have created a number of influential software systems including Hoard, a fast and scalable memory manager that accelerates multithreaded applications (used by companies including British Telecom, Cisco, Crédit Suisse, Reuters, Royal Bank of Canada, SAP, and Tata, and on which the Mac OS X memory manager is based); DieHard, an error-avoiding memory manager that directly influenced the design of the Windows 7 Fault-Tolerant Heap; and DieHarder, a secure memory manager that was an inspiration for hardening changes made to the Windows 8 heap. His honors include a Microsoft Research Fellowship, an NSF CAREER Award, a Lilly Teaching Fellowship, the Distinguished Artifact Award for PLDI 2014, Most Influential Paper Awards at OOPSLA, PLDI, and ASPLOS, three CACM Research Highlights, a Google Research Award, a Microsoft SEIF Award, and Best Paper Awards at FAST, OOPSLA, and SOSP; he was named an ACM Distinguished Member in 2018. Professor Berger is currently serving his second term as an elected member of the SIGPLAN Executive Committee; he served for a decade (2007-2017) as Associate Editor of the ACM Transactions on Programming Languages and Systems, and was Program Chair for PLDI 2016.
Рекомендации по теме
Комментарии
Автор

This is definitely one of the best conference talks I've ever seen!

azymohliad
Автор

Extremely interesting research and a great presentation, thanks!

jotun
Автор

That guy just explained what the p-value is and how it works in just few seconds - that was an entire lecture at uni. Wow

ralph
Автор

I'm not even a programmer and enjoyed this presentation. I wish I had teachers like him!

Eduardo
Автор

Concerning O1, O2 optimization, fit in L1 and L2 cache is a big deal. If O1 binary happens to fit in L1/L2 and O2 does not, then the O1 binary could perform better than O2.
The big thing today is that memory round-trip access time is a couple of hundred CPU-cycles. Try to avoid too much pointer-chasing code. Prefetch memory when possible.

Note, Intel Core iX processors up to generation 9 have 256K L2. The Xeon SP lines have 1M L2 at 2 additional cycles access time. 10th gen Core have 512K L2.


Be aware that Intel processors since about mid-2000 had cache line size of 64 bytes. Prior to that, it was 32-bytes.

My view, too many software people have purist view of the world, thinking they can achieve great performance without consideration for the details of the underlying hardware.

joechang
Автор

This is an awesome talk, with some great, novel information (at least to me). The name of the program, "Stabilizer" is humorous, as it is actually more of an "unstabilizer". Excellent work Emery. I would love to see a example program that demonstrates significant performance delta between memory layouts.

jknight
Автор

This talk is incredible, great job to everyone involved.

RaidenFreeman
Автор

The SQLite example surprises me a little. Indirect calls seems like something I would expect the compiler to optimize already.

swapode
Автор

This was an amazing talk! Lots of counterintuitive things to correct in our mental models about performance, thank you for the knowledge! Haha, I loved the "eyeball statistics". Wonderful.

georgepantazes
Автор

I've never done anything with programming, but still understood almost everything Emery said. Great video!

xonarofficial
Автор

This is the only helpful performance analysis talk I have ever seen. Spectacular work, and thank you for making the tools available. I think it'd be spectacular to integrate the layout randomization and causal profiling directly into the Rust toolchain, and I can't see why not.
Edit: seems somebody has ported or begun porting Coz to Rust, very cool. :- )

microcolonel
Автор

I really like the causal analysis technique! I might be misunderstanding what "layout" is, but I was confused as to why you would randomize it every .5 seconds. It seems like this could wash out optimizations that are actually valid, e.g. optimizations that reduce the probability of cache misses, because out in the wild layout isn't being randomized all the time. It seems like the fact that you get unexpected distributions when only randomizing once per execution could be indicating that different codes do have different performance characteristics across different layouts, meaning that there are potentially useful code-level optimizations. An extreme example of this could be a data structure that monitors its own timing information and adapts to optimize latency assuming static memory layout, because then randomizing the layout could make that structure look way worse than a more naive approach that doesn't bias itself for any particular layout.

GeoffreyChurchilley
Автор

Wow. He went from theory, background knowledge, to full blown applied uses at a really nice pace. This is a great lecture for any student in software engineering. Love the reminders that certain optimizations can cause slowdown--as well as the reminder that rolling your own naive hash table can have disastrous consequences for performance (37:23)

ehhhhhhhhhh
Автор

To install coz on Debian or Ubuntu:
% sudo apt-get install coz-profiler

Papers:
* "Stabilizer: Statistically Sound Performance Evaluation" [ASPLOS 13]

* "Coz: Finding Code that Counts with Causal Profiling" [SOSP 15 Best Paper, CACM Research Highlight]

* Mentioned during talk: "Producing Wrong Data Without Doing Anything Obviously Wrong!" [ASPLOS 09]
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter Sweeney

Slides are here (Keynote):


EmeryBerger
Автор

This is one of the best talks I have ever seen, full stop

seanmacfoy
Автор

This was a bloody amazing speech and great content. Now the coz-profiler just needs to be ported to macOS. If it works on Linux already probably isn't too big a leap.

casperes
Автор

One of the better talks on "Memory Layout" within a given program or application that has a direct effect on Performance. The jokes and puns are great! No more malloc! Run it as me and it's faster... lol! Great content!

skilz
Автор

As a data scientist I take personal offence to the R slander, but on a serious note this was a great talk and I really enjoyed listening!

AdamGaffney
Автор

Excellent talk. Informative, engaging, clear.

ianchristensen
Автор

Enlightening talk and amazing results. If only coz existed for every language.

kkiller