File System Caching and Consistency

Making File Systems Fast

How do we make a big slow thing look faster?

Use a cache! (Put a smaller, faster thing in front of it.)

In the case of the file system the smaller, faster thing is memory. We call the memory used to cache file system data the buffer cache.

Putting Spare Memory To Work

Operating systems use memory:

as memory (duh), but also
to cache file data in order to improve performance.

These two uses of memory compete with each other.

Big buffer cache, small main memory: file access is fast, but potential thrashing in the memory subsystem…
Small buffer cache, large main memory: little swapping occurs but file access is extremely slow.

On Linux the swappiness kernel parameter controls how aggressively the operating system prunes unused process memory pages and hence the balance between memory and buffer cache.

Where To Put the Buffer Cache?

Above the File System

What do we cache?

Entire files and directories!

What is the buffer cache interface?

open, close, read, write. (Same as the file system call interface.)

Above the File System: Operations

open

Pass down to underlying file system.

read

If file is not in the buffer cache, pass down to underlying file system and load contents into the buffer cache.
If the file is in the cache, return the cached contents.

write

If file is not in the buffer cache, pass load contents into the buffer cache and then modify them.
If the file is in the cache, modify the cached contents.

close

Remove from the cache (if necessary) and flush contents through the file system.

Above the File System: Pros and Cons

Pros:

Buffer cache sees file operations, may lead to better prediction or performance.

Cons:

Hides many file operations from the file system, preventing it from providing consistency guarantees.
Can’t cache file system metadata: inodes, superblocks, etc.

Below the File System

What do we cache?

Disk blocks!

What is the buffer cache interface?

readblock, writeblock. (Same as the disk interface.)

Below the File System: Pros and Cons

Pros:

Can cache all blocks including file system data structures, inodes, superblocks, etc.
Allows file system to see all file operations even if they eventually hit the cache.

Cons:

Cannot observe file semantics or relationships.

This is what modern operating systems do.

Caching and Consistency

How can the cache cause consistency problems?

Objects in the cache are lost on failures!

Remember: almost every file system operation involves modifying multiple disk blocks.

Example of creating a new file in an existing directory:

Allocate an inode, mark the used inode bitmap.
Allocate data blocks, mark the used data block bitmap.
Associate data blocks with the file by modifying the inode.
Add inode to the given directory by modifying the directory file.
Write data blocks.

How Caching Exacerbates Consistency

Observation: file system operations that modify multiple blocks may leave the file system in an inconsistent state if partially completed.

How does caching exacerbate this situation?

May increase the time span between when the first write of the operation hits the disk and the last is completed.

What Can Go Wrong?

What kinds of inconsistency can take place if the system is interrupted between the multiple operations necessary to complete a write?

Allocate an inode, mark the used inode bitmap. inode incorrectly marked in use.
Allocate data blocks, mark the used data block bitmap. Data blocks incorrectly marked in use.
Associate data blocks with the file by modifying the inode. Dangling file not present in any directory.
Add inode to the given directory by modifying the directory file.
Write data blocks. Data loss!

Maintaining File System Consistency

What’s the safest approach?

Don’t buffer writes!
We call this a write through cache because writes do not hit the cache.

What’s the most dangerous approach?

Buffer all operations until blocks are evicted.
We call this a write back cache.

Which approach is better for

performance?
safety?

What about a middle ground?

Write important file system data metadata structures—superblock, inode maps, bitmaps, etc.—immediately, but delay data writes.

File systems also give use processes some control through sync (sync the entire file system) and fsync (sync one file).

Another Approach to Consistency

What’s not atomic? Writing multiple disk blocks.
What is atomic? Writing one disk block.

Journaling

Track pending changes to the file system in a special area on disk called the journal.
Following a failure, replay the journal to bring the file system back to a consistent state.

Creation example:

Dear Journal, here’s what I’m going to do today:

Allocate inode 567 for a new file.

Associate data blocks 5, 87, and 98 with inode 567.

Add inode 567 to the directory with inode 33.

That’s it!

Journaling: Checkpoints

What happens when we flush cached data to disk?

Update the journal!
This is called a checkpoint.

Dear Journal, here’s what I’m going to do today:

Allocate inode 567 for a new file.

Associate data blocks 5, 87, and 98 with inode 567.

Add inode 567 to the directory with inode 33.

That’s it!

Dear Journal, I already did everything mentioned above! Checkpoint!

Journaling: Recovery

What happens on recovery?

Start at the last checkpoint and work forward, updating on-disk structures as needed.

Dear Journal, I already did everything mentioned above! Checkpoint!

Dear Journal, here’s what I’m going to do today:

Allocate inode 567 for a new file. Did this already!

Associate data blocks 5, 87, and 98 with inode 567. Didn’t do this… OK, done!

Add inode 567 to the directory with inode 33. Didn’t do this either! OK, done.

That’s it! All caught up!

What about incomplete journal entries?

These are ignored as they may leave the file system in an incomplete state.

What would happen if we processed the following incomplete journal entry?

Dear Journal, here’s what I’m going to do today:

Allocate inode 567 for a new file.

Associate data blocks 5, 87, and 98 with inode 567.

Journaling: Implications

Observation: metadata updates (allocate inode, free data block, add to directory, etc.) can be represented compactly and probably written to the journal atomically.

What about data blocks themselves changed by write()?

We could include them in the journal meaning that each data block would potentially be written twice (ugh).
We could exclude them from the journal meaning that file system structures are maintained but not file data.