Performance and Benchmarking
Operating System Performance
-
Understandable but interesting fact: people prefer a mostly-correct but extremely fast system to a completely correct and slow system.
-
Thought experiment: would crashes bother you at all if your system rebooted instantaneously?
THE Last Correct Operating System
We have found it is possible to design a refined multiprogramming system in such a way that its logical soundness can be proved a priori and that its implementation admits exhaustive testing. The only errors that showed up during testing were trivial coding errors (occurring with a density of only one error per 500 instructions), each of them located with 10 minutes (classical) inspection at the machine and each of them correspondingly easy to remedy.
But Seriously…
-
Measure your system.
-
Analyze the results.
-
Improve the slow parts.
-
Drink celebratory beer.
Measuring Your System: How?
Should be easy to measure time passed on a single computer, right?
-
High-level software counters may not have fine enough resolution to measure extremely fast events.
-
Low-level hardware counters may have extremely device-specific interfaces, making cross-platform measurement more difficult.
-
All counters roll eventually, and the higher-resolution the faster.
Measurements should be repeatable, right?
-
You are trying to measure the present, but the rest of the system is trying to use the past to predict the future.
-
In general real systems are almost never in the exact same state that they were the last time you measured whatever you are trying to measure.
Measuring Real Systems
Yet another problem: measurement tends to affect the thing you are trying to measure!
-
Measurement may destroy the problem you are trying to measure.
-
Must separate results from the noise produced by measurement.
-
Measurement overhead may limit your access to real systems.
-
Vendor: "No way am I running your instrumented binary. Your software is slow enough already!"
-
(Aside) Measuring Operating Systems
You can imagine that this is even more fraught given how central the operating system is to the operation of the computer itself.
-
Difficult to find the appropriate places to insert debugging hooks.
-
Operating systems can generate a lot of debugging output: imagine tracing every page fault. (And imagine how slow that system would be!)
Let’s Measure Something Else…
OK, benchmarking real systems seem real hard.
-
Build a model: abstract away all of the low-level details and reason analytically.
-
Build a simulator: write some additional code to perform a simplified simulation of more complex parts of the system—particularly hardware.
-
"I see equations." That’s a model.
-
"I see code." That’s a simulation.
Choosing the Right Virtual Reality
-
Pro: can make strong mathematical guarantees about system performance…
-
Con: …usually after making a bunch of unrealistic assumptions.
-
Pro: in the best case, experimental speedup outweighs lack of hardware details…
-
Con: …and in the worst case bugs in the simulator lead you in all sorts of wrong directions.
Measuring Your System: What?
-
Two disk drives?
-
Two scheduling algorithms?
-
Two page replacement algorithms?
-
Two file systems?
Benchmarks: Definitions
-
try to isolate one aspect of system performance.
-
measure one operation involving many parts of the system working together.
-
focus on the performance of the system as observed by one application.
Benchmarks: Examples
Let’s say we are interested in improving our virtual memory system. (Just a random choice.)
-
Time to handle single page fault. (Micro enough?)
-
Time to look up page in the page table?
-
Time to choose a page to evict?
-
Aggregate time to handle page faults on a heavily-loaded system?
-
Page fault rate?
-
triplesort
? -
parallelvm
?
Benchmarks: Problems
-
Tree, meet forest. May not be studying the right thing!
-
Forest, meet trees. Introduces many, many variables that can complicate analysis.
-
Who cares about your stupid application? Improvements for it may harm others!
Benchmark Bias
Bigger problem with benchmarking in practice.
-
People choosing and running the benchmarks may be trying to justify some change by making their system look faster.
-
Alternatively, the people chose a benchmark and did a lot of work to improve its performance while ignoring other effects on the system.
Benchmarking Blues
This isn’t (all) just human folly. There’s a fundamental tension here.
-
The most useful system is a general-purpose system.
-
The fastest system is a single-purpose system.
What TO Do
-
Have a goal in mind more specific than "I want to make this blob of code faster." This helps choose measurement techniques and benchmarks.
-
Validate your models and simulator before you start changing things.
-
Do their results match your intuition? If not, something is wrong.
-
Do their results match reality? If not, something is really wrong.
-
-
Use modeling, simulation, and real experiments as appropriate.
-
If you can’t convince yourself analytically that a new approach is an improvement, don’t bother simulating.
-
If your simulator doesn’t show improvement, don’t bother implementing.
-