YouTube placeholder

A Scheduling Story

Oracular Spectacular

Normally we cannot predict the future
  • Control flow is unpredictable.

  • Users are unpredictable.

Instead, use the past to predict the future.
  • What did the thread do recently? It will probably keep doing that.

Multi-Level Feedback Queues

  • Choose a scheduling quantum. This is the maximum amount of time any thread will be able to run at one time.

  • Establish some number of queues; each represents a level.

  • Threads from the highest-level queues are chosen first.

  • Then:

    1. Choose and run a thread from the highest-level non-empty queue.

    2. If the thread blocks or yields, promote it to a higher level queue.

    3. If the thread must be preempted at the end of a quantum, demote it to a lower level queue.

mlfq 1
mlfq 2
mlfq 3
mlfq 4
mlfq 5
mlfq 6
mlfq 7
mlfq 8
mlfq 9
mlfq 10
mlfq 11
mlfq 12
mlfq 13
mlfq 14
mlfq 15
mlfq 16
What happens to:
  • CPU-bound threads? They descend to the depths.

  • I/O-bound threads? They rise to the heights!

Can anyone spot any problems with this approach?
  • Starvation! Threads trapped in the lower queues may never have a chance to run.

One solution is to periodically rebalance the levels by periodically tossing everyone back to the top level. (Seems like kind of a hack.)

Establishing Priorities

Priorities are a scheduling abstraction that allows the user, or the system itself, to assign relative important to different tasks on the system.

  • Backup task: low priority.

  • Video encoding: low priority.

  • Video playback: high priority.

  • Interactive applications: medium priority.

Priorities are always relative, and so host to a variety of different problems.

Priority Starvation

Strict priorities can lead to starvation when low-priority threads are constantly blocked by high-priority threads with work to do.

One solution: lottery scheduling:
  1. Give each thread a number of tickets proportional to their priority.

  2. Choose a ticket at random. The thread holding the ticket gets to run.

Priorities may also be used to determine how long threads are allowed to run, i.e. dynamically adjusting their time quantum.

The Semi-Recent Recent History of the Linux Scheduler

The recent evolution of the Linux scheduler provides a fascinating view into:

  • the difficulties in supporting a huge and growing number of hardware platforms. (Servers, desktops, Android, embedded devices, etc.)

  • the tensions that arise in collaborative development.

  • the difficulties is steering large software projects fueled by volunteers.

  • the challenges of interactive benchmarking.

  • incredibly contributions made by hackers with day jobs.

Linux By The Numbers

  • 9.2 million lines of code. (55% device drivers.)

  • 2,400 developers.

  • New kernel every 2.75 months.

Linux Development Process

Big caveat: I am not speaking from experience!

  • Each file has a maintainer.

  • Each subsystem has a maintainer.

  • And then there’s Linus/Andrew.

mm

linus

Kernel Development

  • Linus, Andrew, and others maintain the mainline Linux kernel tree used to generate official releases.

  • Other developers may maintain their own trees to experiment with new features. Some code will eventually move into the mainline, other things may not.

  • (Git was built to facilitate this working model. And to not suck.)

Desktop v. Server

Just one example of tensions in a large community that now supports many devices.

Server Linux:
  • Well-defined goals and well-established benchmarks.

  • Corporate clients, many who pay for support.

  • Regular contact with the Linux community.

Desktop Linux:
  • Poorly-defined goals, a scarcity of good benchmarks, and few people savvy enough to run them.

  • Users too cheap to buy Windows.

  • Development? What’s that?

The Linux Kernel Mailing List

  • I tried reading it once. I had no idea what was going on 99% of the time.

  • This is an intimidating place to go for "help", and you probably won’t get it.

And, it’s sometimes a male-dominated mean environment:

Seriously, guys? Is this what we need in order to get improve -stable? Linus Torvalds is advocating for physical intimidation and violence. Ingo Molnar and Linus are advocating for verbal abuse.

Not fucking cool. Violence, whether it be physical intimidation, verbal threats or verbal abuse is not acceptable. Keep it professional on the mailing lists.

Let’s discuss this at Kernel Summit where we can at least yell at each other in person. Yeah, just try yelling at me about this. I’ll roar right back, louder, for all the people who lose their voice when they get yelled at by top maintainers. I won’t be the nice girl anymore.

— Sarah Sharp

Linux Scheduling Pre-2.6

  • Pre-2.6 Linux systems used a scheduler that scaled poorly, requiring O(n) time to schedule tasks where n is the number of runnable threads.

  • The Linux 2.6 scheduler aimed to fix this.

The Linux 2.6 O(1) Scheduler

  • For Linux 2.6, Ingo Molnar, the Linux kernel scheduler maintainer, implemented a new O(1) scheduler to address the scalability problems inherent to the earlier approach.

  • The O(1) scheduler combines a static and dynamic priority.

    • Static priority: set by the user or system using nice.

    • Dynamic priority: a potential "boost" to the static priority intended to reward interactive threads.

Problems With The O(1) Scheduler

  • The interactivity boost code got very complex and difficult to understand. It was not clear that it was doing the right thing.

  • Not surprisingly, the scheduler was also difficult to model, useful in order to convince yourself that it is doing "the right thing."

Enter Con Kolivas

  • An Australian anaesthetist turned Linux kernel programmer.

  • Con became interested in scheduling, and in particular scheduling for interactive workloads and systems.

con
Figure 2. Con Kolivas

Interactivity, what is it?

There has been a lot of talk about what makes up a nice feeling desktop under linux. It comes down to two different but intimately related parameters which are not well defined. We often use the terms responsiveness and interactivity in the same sentence, but I’d like to separate the two. As there is no formal definition I prefer to define them as such:

Responsiveness: The rate at which your workloads can proceed under different load conditions.

Interactivity: The scheduling latency and jitter present in tasks where the user would notice a palpable deterioration under different load conditions.

Responsiveness would allow you to continue using your machine without too much interruption to your work, whereas interactivity would allow you to play audio or video without any dropouts, or drag a gui window across the screen and have it render smoothly across the screen without jerks.

— Con Kolivas

2004: Kolivas Releases The Rotating Staircase Scheduler

  • Removes 498 lines of "black magic" aimed at improving interactivity and replaces them with 200 lines of new code implementing a simple, rank-based approach.

  • Some similarities to MLFQ.

2007: Kolivas Releases The Rotating Staircase Deadline Scheduler

A starvation free, strict fairness O(1) scalable design with interactivity as good as the above restrictions can provide. There is no interactivity estimator, no sleep/run measurements and only simple fixed accounting. The design has strict enough a design and accounting that task behaviour can be modelled and maximum scheduling latencies can be predicted by the virtual deadline mechanism that manages runqueues. The prime concern in this design is to maintain fairness at all costs determined by nice level, yet to maintain as good interactivity as can be allowed within the constraints of strict fairness.

— Con Kolivas

The Rotating Staircase Deadline Scheduler (RSDL)

(Complete description here.)

  • One parameter: the round-robin interval (RR_INTERVAL).

  • One input: a thread priority.

  • Priority defines the levels at which each task can run.

    • High priority tasks: more levels, more chances to run.

    • Low priority tasks: fewer levels, fewer chances to run.

  • Tasks can run for at most a fixed amount of time per level.

  • Each level can also run for at most a fixed amount of time—​ensures bounded latency.

RSDL

To begin a scheduling epoch:
  • Put all threads in a queue determined by their priority.

  • Then:

    1. Run threads from the highest non-empty queue round-robin

    2. If a thread blocks or yields, it remains at that level

    3. If a thread runs out of its quota, it moves to the next level down.

    4. If the level runs out of its quota, all threads move to the next level down.

    5. Continue until all quotas are exhausted or no threads are runnable, then restart another epoch.

rsdl 1
rsdl 2
rsdl 3
rsdl 4
rsdl 5
rsdl 6
rsdl 7
rsdl 8
rsdl 9
rsdl 10
rsdl 11
rsdl 12
rsdl 13
rsdl 14
rsdl 15
rsdl 16
rsdl 17
rsdl 18
rsdl 19
rsdl 20
rsdl 21
rsdl 22
rsdl 23
rsdl 24
rsdl 25
rsdl 26
rsdl 27
rsdl 28
rsdl 29
rsdl 30
rsdl 31
rsdl 32
rsdl 33
rsdl 34
rsdl 35
rsdl 36
rsdl 37
rsdl 38

Nice Features of RSDL

  • Can easily calculate how long it will be before a thread at a certain priority level runs given the number of runnable tasks in the system and their priorities. Easy to model.

  • Simple, fixed accounting. Scheduling can be done in O(1).

  • More recent versions use interleaving to further reduce the delay between tasks scheduling with different priorities:

PRIORITY:0..................20.................39
nice -20 0000000000000000000000000000000000000000
nice -10 1000100010001000100010001000100010010000
nice   0 1010101010101010101010101010101010101010
nice   5 1011010110110101101101011011010110110110
nice  10 1110111011101110111011101110111011101110
nice  15 1111111011111110111111101111111011111110
nice  19 1111111111111111111111111111111111111110

Achieving Interactivity

This design relies on the fact that interactive tasks, by their nature, sleep often. Most fair scheduling designs end up penalising such tasks indirectly giving them less than their fair possible share because of the sleep, and have to use a mechanism of bonusing their priority to offset this based on the duration they sleep.

— Con Kolivas

Sounds Great! So Now What?

  • Kolivas struggles to demonstrate that his scheduler is better than the native scheduler. (Remember about desktop benchmarks?)

  • Despite popularity with users, RSDL never leaves Con’s private -ck Linux kernel tree.

  • Simultaneously, Ingo Molnar (remember him?) develops and integrates a new Completely Fair Scheduler based on Kolivas' work.

  • Con withdraws from the Linux community, citing the difficulty in having multiple patches, including RSDL, accepted.

To Rule Them All

  • Interesting design question: does it make sense to rely on a single scheduler given the diversity of devices Linux runs on?

  • Kolivas proposed a pluggable scheduler architecture, parts of which were later adopted.

Moral?

  • Find something to do that’s fun.

  • Get good at it.

  • If you have good ideas, they will carry the day, even if your implementation does not.

…​He’s Baaaccckkk!

  • In 2009 Kolivas returns with the Brain Fuck Scheduler, which continues to try and achieve interactivity without sacrificing simplicity.

BFS

BFS is the Brain Fuck Scheduler. It was designed to be forward looking only, make the most of lower spec machines, and not scale to massive hardware. ie it is a desktop orientated scheduler, with extremely low latencies for excellent interactivity by design rather than "calculated", with rigid fairness, nice priority distribution and extreme scalability within normal load levels.

— Con Kolivas

Why Brain Fuck?

  • Because it throws out everything about what we know is good about how to design a modern scheduler in scalability.

  • Because it’s so ridiculously simple.

  • Because it performs so ridiculously well on what it’s good at despite being that simple.

  • Because it’s designed in such a way that mainline would never be interested in adopting it, which is how I like it.

  • Because it will make people sit up and take notice of where the problems are in the current design.

  • Because it throws out the philosophy that one scheduler fits all and shows that you can do a lot better with a scheduler designed for a particular purpose. I don’t want to use a steamroller to crack nuts

  • Because it actually means that more CPUs means better latencies.

  • Because I must be fucked in the head to be working on this again.


Created 2/17/2017
Updated 9/18/2020
Commit 4eceaab // History // View
Built 2/28/2016 @ 19:00 EDT