ops-class.org | Performance and Benchmarking

Technical Women

Figure 1. Vicki Hanson

Catch by Dresses

Highway to Hell by AC/DC

Today

Performance and benchmarking.

$ cat announce.txt

Questions about the collaboration policy?
ASST3.3 is due Friday May 6th @ 5PM
The final exam is two weeks from today: Monday, May 9 @ 3:30PM.
Recitations resume this week covering
- ASST3 swapping (this week)
- Exam review (next week)

Operating System Performance

Why do we care?

Understandable but interesting fact: people prefer a mostly-correct but extremely fast system to a completely correct and slow system.
Thought experiment: would crashes bother you at all if your system rebooted instantaneously?

THE Last Correct Operating System

We have found it is possible to design a refined multiprogramming system in such a way that its logical soundness can be proved a priori and that its implementation admits exhaustive testing. The only errors that showed up during testing were trivial coding errors (occurring with a density of only one error per 500 instructions), each of them located with 10 minutes (classical) inspection at the machine and each of them correspondingly easy to remedy.

— Edsger Dijkstra

Operating System Performance

OK, so we care about performance… but operating system performance is not so different from improving the performance of any other computer system or program.

Here’s what to do:

Measure your system.
Analyze the results.
Improve the slow parts.
Drink celebratory beer.
Goto 1.

Next Time

Hints for improving operating system performance.

But Seriously…

What’s so hard here?

Measure your system. How And doing what?
Analyze the results. You mean statistics? Ugh.
Improve the slow parts. How And which slow parts?
Drink celebratory beer. Which beer? And where? (But I’m ready for one after the statistics.)

Measuring Your System: How?

Should be easy to measure time passed on a single computer, right?

Welcome to the wonderful world of hardware:

High-level software counters may not have fine enough resolution to measure extremely fast events.
Low-level hardware counters may have extremely device-specific interfaces, making cross-platform measurement more difficult.
All counters roll eventually, and the higher-resolution the faster.

Measurements should be repeatable, right?

Wrong!

You are trying to measure the present, but the rest of the system is trying to use the past to predict the future.
In general real systems are almost never in the exact same state that they were the last time you measured whatever you are trying to measure.

Couldn’t understand results

Blamed cache effects

Measuring Real Systems

Yet another problem: measurement tends to affect the thing you are trying to measure!

This has three results:

Measurement may destroy the problem you are trying to measure.
Must separate results from the noise produced by measurement.
Measurement overhead may limit your access to real systems.
- Vendor: "No way am I running your instrumented binary. Your software is slow enough already!"

(Aside) Measuring Operating Systems

You can imagine that this is even more fraught given how central the operating system is to the operation of the computer itself.

Difficult to find the appropriate places to insert debugging hooks.
Operating systems can generate a lot of debugging output: imagine tracing every page fault. (And imagine how slow that system would be!)

Let’s Measure Something Else…

OK, benchmarking real systems seem real hard.

What else can we do?

Build a model: abstract away all of the low-level details and reason analytically.
Build a simulator: write some additional code to perform a simplified simulation of more complex parts of the system—particularly hardware.

Distinguishing models from simulations is pretty easy:

"I see equations." That’s a model.
"I see code." That’s a simulation.

Choosing the Right Virtual Reality

Models:

Pro: can make strong mathematical guarantees about system performance…
Con: …usually after making a bunch of unrealistic assumptions.

Simulations:

Pro: in the best case, experimental speedup outweighs lack of hardware details…
Con: …and in the worst case bugs in the simulator lead you in all sorts of wrong directions.

Measuring Your System: What?

What metric do I use to compare:

Two disk drives?
Two scheduling algorithms?
Two page replacement algorithms?
Two file systems?

Benchmarks: Definitions

Microbenchmarks:

try to isolate one aspect of system performance.

Macrobenchmarks:

measure one operation involving many parts of the system working together.

Application benchmarks:

focus on the performance of the system as observed by one application.

Benchmarks: Examples

Let’s say we are interested in improving our virtual memory system. (Just a random choice.)

Microbenchmarks:

Time to handle single page fault. (Micro enough?)
Time to look up page in the page table?
Time to choose a page to evict?

Macrobenchmarks:

Aggregate time to handle page faults on a heavily-loaded system?
Page fault rate?

Application benchmarks:

triplesort?
parallelvm?

Benchmarks: Problems

Microbenchmarks:

Tree, meet forest. May not be studying the right thing!

Macrobenchmarks:

Forest, meet trees. Introduces many, many variables that can complicate analysis.

Application benchmarks:

Who cares about your stupid application? Improvements for it may harm others!

Benchmark Bias

Bigger problem with benchmarking in practice.

People choosing and running the benchmarks may be trying to justify some change by making their system look faster.
Alternatively, the people chose a benchmark and did a lot of work to improve its performance while ignoring other effects on the system.

Benchmarking Blues

This isn’t (all) just human folly. There’s a fundamental tension here.

The most useful system is a general-purpose system.
The fastest system is a single-purpose system.

What TO Do

Have a goal in mind more specific than "I want to make this blob of code faster." This helps choose measurement techniques and benchmarks.
Validate your models and simulator before you start changing things.
- Do their results match your intuition? If not, something is wrong.
- Do their results match reality? If not, something is really wrong.
Use modeling, simulation, and real experiments as appropriate.
- If you can’t convince yourself analytically that a new approach is an improvement, don’t bother simulating.
- If your simulator doesn’t show improvement, don’t bother implementing.

Next Time

Butler Lampson on how to make things fast.

← →

Technical Women

Today

$ cat announce.txt

Operating System Performance

THE Last Correct Operating System

Operating System Performance

Next Time

But Seriously…​

Measuring Your System: How?

Measuring Real Systems

(Aside) Measuring Operating Systems

Let’s Measure Something Else…​

Choosing the Right Virtual Reality

Measuring Your System: What?

Benchmarks: Definitions

Benchmarks: Examples

Benchmarks: Problems

Benchmark Bias

Benchmarking Blues

What TO Do

Next Time

But Seriously…

Let’s Measure Something Else…