A Scheduling Story
Oracular Spectacular
-
Control flow is unpredictable.
-
Users are unpredictable.
-
What did the thread do recently? It will probably keep doing that.
Multi-Level Feedback Queues
-
Choose a scheduling quantum. This is the maximum amount of time any thread will be able to run at one time.
-
Establish some number of queues; each represents a level.
-
Threads from the highest-level queues are chosen first.
-
Then:
-
Choose and run a thread from the highest-level non-empty queue.
-
If the thread blocks or yields, promote it to a higher level queue.
-
If the thread must be preempted at the end of a quantum, demote it to a lower level queue.
-
-
CPU-bound threads? They descend to the depths.
-
I/O-bound threads? They rise to the heights!
-
Starvation! Threads trapped in the lower queues may never have a chance to run.
Establishing Priorities
Priorities are a scheduling abstraction that allows the user, or the system itself, to assign relative important to different tasks on the system.
-
Backup task: low priority.
-
Video encoding: low priority.
-
Video playback: high priority.
-
Interactive applications: medium priority.
Priority Starvation
Strict priorities can lead to starvation when low-priority threads are constantly blocked by high-priority threads with work to do.
-
Give each thread a number of tickets proportional to their priority.
-
Choose a ticket at random. The thread holding the ticket gets to run.
The Semi-Recent Recent History of the Linux Scheduler
The recent evolution of the Linux scheduler provides a fascinating view into:
-
the difficulties in supporting a huge and growing number of hardware platforms. (Servers, desktops, Android, embedded devices, etc.)
-
the tensions that arise in collaborative development.
-
the difficulties is steering large software projects fueled by volunteers.
-
the challenges of interactive benchmarking.
-
incredibly contributions made by hackers with day jobs.
Linux By The Numbers
-
9.2 million lines of code. (55% device drivers.)
-
2,400 developers.
-
New kernel every 2.75 months.
Linux Development Process
Big caveat: I am not speaking from experience!
-
Each file has a maintainer.
-
Each subsystem has a maintainer.
-
And then there’s Linus/Andrew.
Kernel Development
-
Linus, Andrew, and others maintain the mainline Linux kernel tree used to generate official releases.
-
Other developers may maintain their own trees to experiment with new features. Some code will eventually move into the mainline, other things may not.
-
(Git was built to facilitate this working model. And to not suck.)
Desktop v. Server
Just one example of tensions in a large community that now supports many devices.
-
Well-defined goals and well-established benchmarks.
-
Corporate clients, many who pay for support.
-
Regular contact with the Linux community.
-
Poorly-defined goals, a scarcity of good benchmarks, and few people savvy enough to run them.
-
Users too cheap to buy Windows.
-
Development? What’s that?
The Linux Kernel Mailing List
-
I tried reading it once. I had no idea what was going on 99% of the time.
-
This is an intimidating place to go for "help", and you probably won’t get it.
And, it’s sometimes a male-dominated mean environment:
Seriously, guys? Is this what we need in order to get improve -stable? Linus Torvalds is advocating for physical intimidation and violence. Ingo Molnar and Linus are advocating for verbal abuse.
Not fucking cool. Violence, whether it be physical intimidation, verbal threats or verbal abuse is not acceptable. Keep it professional on the mailing lists.
Let’s discuss this at Kernel Summit where we can at least yell at each other in person. Yeah, just try yelling at me about this. I’ll roar right back, louder, for all the people who lose their voice when they get yelled at by top maintainers. I won’t be the nice girl anymore.
Linux Scheduling Pre-2.6
-
Pre-2.6 Linux systems used a scheduler that scaled poorly, requiring O(n) time to schedule tasks where n is the number of runnable threads.
-
The Linux 2.6 scheduler aimed to fix this.
The Linux 2.6 O(1) Scheduler
-
For Linux 2.6, Ingo Molnar, the Linux kernel scheduler maintainer, implemented a new O(1) scheduler to address the scalability problems inherent to the earlier approach.
-
The O(1) scheduler combines a static and dynamic priority.
-
Static priority: set by the user or system using nice.
-
Dynamic priority: a potential "boost" to the static priority intended to reward interactive threads.
-
Problems With The O(1) Scheduler
-
The interactivity boost code got very complex and difficult to understand. It was not clear that it was doing the right thing.
-
Not surprisingly, the scheduler was also difficult to model, useful in order to convince yourself that it is doing "the right thing."
Enter Con Kolivas
-
An Australian anaesthetist turned Linux kernel programmer.
-
Con became interested in scheduling, and in particular scheduling for interactive workloads and systems.
Interactivity, what is it?
There has been a lot of talk about what makes up a nice feeling desktop under linux. It comes down to two different but intimately related parameters which are not well defined. We often use the terms responsiveness and interactivity in the same sentence, but I’d like to separate the two. As there is no formal definition I prefer to define them as such:
Responsiveness: The rate at which your workloads can proceed under different load conditions.
Interactivity: The scheduling latency and jitter present in tasks where the user would notice a palpable deterioration under different load conditions.
Responsiveness would allow you to continue using your machine without too much interruption to your work, whereas interactivity would allow you to play audio or video without any dropouts, or drag a gui window across the screen and have it render smoothly across the screen without jerks.
2004: Kolivas Releases The Rotating Staircase Scheduler
-
Removes 498 lines of "black magic" aimed at improving interactivity and replaces them with 200 lines of new code implementing a simple, rank-based approach.
-
Some similarities to MLFQ.
2007: Kolivas Releases The Rotating Staircase Deadline Scheduler
A starvation free, strict fairness O(1) scalable design with interactivity as good as the above restrictions can provide. There is no interactivity estimator, no sleep/run measurements and only simple fixed accounting. The design has strict enough a design and accounting that task behaviour can be modelled and maximum scheduling latencies can be predicted by the virtual deadline mechanism that manages runqueues. The prime concern in this design is to maintain fairness at all costs determined by nice level, yet to maintain as good interactivity as can be allowed within the constraints of strict fairness.
The Rotating Staircase Deadline Scheduler (RSDL)
(Complete description here.)
-
One parameter: the round-robin interval (
RR_INTERVAL
). -
One input: a thread priority.
-
Priority defines the levels at which each task can run.
-
High priority tasks: more levels, more chances to run.
-
Low priority tasks: fewer levels, fewer chances to run.
-
-
Tasks can run for at most a fixed amount of time per level.
-
Each level can also run for at most a fixed amount of time—ensures bounded latency.
RSDL
-
Put all threads in a queue determined by their priority.
-
Then:
-
Run threads from the highest non-empty queue round-robin
-
If a thread blocks or yields, it remains at that level
-
If a thread runs out of its quota, it moves to the next level down.
-
If the level runs out of its quota, all threads move to the next level down.
-
Continue until all quotas are exhausted or no threads are runnable, then restart another epoch.
-
Nice Features of RSDL
-
Can easily calculate how long it will be before a thread at a certain priority level runs given the number of runnable tasks in the system and their priorities. Easy to model.
-
Simple, fixed accounting. Scheduling can be done in O(1).
-
More recent versions use interleaving to further reduce the delay between tasks scheduling with different priorities:
Achieving Interactivity
This design relies on the fact that interactive tasks, by their nature, sleep often. Most fair scheduling designs end up penalising such tasks indirectly giving them less than their fair possible share because of the sleep, and have to use a mechanism of bonusing their priority to offset this based on the duration they sleep.
Sounds Great! So Now What?
-
Kolivas struggles to demonstrate that his scheduler is better than the native scheduler. (Remember about desktop benchmarks?)
-
Despite popularity with users, RSDL never leaves Con’s private -ck Linux kernel tree.
-
Simultaneously, Ingo Molnar (remember him?) develops and integrates a new Completely Fair Scheduler based on Kolivas' work.
-
Con withdraws from the Linux community, citing the difficulty in having multiple patches, including RSDL, accepted.
To Rule Them All
-
Interesting design question: does it make sense to rely on a single scheduler given the diversity of devices Linux runs on?
-
Kolivas proposed a pluggable scheduler architecture, parts of which were later adopted.
Moral?
-
Find something to do that’s fun.
-
Get good at it.
-
If you have good ideas, they will carry the day, even if your implementation does not.
…He’s Baaaccckkk!
-
In 2009 Kolivas returns with the Brain Fuck Scheduler, which continues to try and achieve interactivity without sacrificing simplicity.
BFS
BFS is the Brain Fuck Scheduler. It was designed to be forward looking only, make the most of lower spec machines, and not scale to massive hardware. ie it is a desktop orientated scheduler, with extremely low latencies for excellent interactivity by design rather than "calculated", with rigid fairness, nice priority distribution and extreme scalability within normal load levels.
Why Brain Fuck?
-
Because it throws out everything about what we know is good about how to design a modern scheduler in scalability.
-
Because it’s so ridiculously simple.
-
Because it performs so ridiculously well on what it’s good at despite being that simple.
-
Because it’s designed in such a way that mainline would never be interested in adopting it, which is how I like it.
-
Because it will make people sit up and take notice of where the problems are in the current design.
-
Because it throws out the philosophy that one scheduler fits all and shows that you can do a lot better with a scheduler designed for a particular purpose. I don’t want to use a steamroller to crack nuts
-
Because it actually means that more CPUs means better latencies.
-
Because I must be fucked in the head to be working on this again.