Today
-
Caching and consistency.
-
Journaling.
The Next Few Classes
Moving forward we’ll have a mixture of traditional lectures as well as meetings when we review an operating system research paper.
-
Today: caching and consistency.
-
Friday: FFS and LFS.
-
Monday: RAID. Please read the paper and look at the online resources posted on Discourse early next week.
Making File Systems Fast
-
Use a cache! (Put a smaller, faster thing in front of it.)
In the case of the file system the smaller, faster thing is memory. We call the memory used to cache file system data the buffer cache.
Putting Spare Memory To Work
-
as memory (duh), but also
-
to cache file data in order to improve performance.
-
Big buffer cache, small main memory: file access is fast, but potential thrashing in the memory subsystem…
-
Small buffer cache, large main memory: little swapping occurs but file access is extremely slow.
On Linux the swappiness
kernel parameter controls how aggressively
the operating system prunes unused process memory pages and hence the
balance between memory and buffer cache.
Where To Put the Buffer Cache?
Above the File System
-
Entire files and directories!
-
open
,close
,read
,write
. (Same as the file system call interface.)
Above the File System: Operations
open
-
Pass down to underlying file system.
read
-
If file is not in the buffer cache, pass down to underlying file system and load contents into the buffer cache.
-
If the file is in the cache, return the cached contents.
write
-
If file is not in the buffer cache, pass load contents into the buffer cache and then modify them.
-
If the file is in the cache, modify the cached contents.
close
-
Remove from the cache (if necessary) and flush contents through the file system.
Above the File System: Pros and Cons
-
Buffer cache sees file operations, may lead to better prediction or performance.
-
Hides many file operations from the file system, preventing it from providing consistency guarantees.
-
Can’t cache file system metadata: inodes, superblocks, etc.
Below the File System
-
Disk blocks!
-
readblock
,writeblock
. (Same as the disk interface.)
Below the File System: Pros and Cons
-
Can cache all blocks including file system data structures, inodes, superblocks, etc.
-
Allows file system to see all file operations even if they eventually hit the cache.
-
Cannot observe file semantics or relationships.
This is what modern operating systems do.
Review: Data Blocks: Multilevel Index
Observation: most files are small, but some can get very large.
-
some pointers to blocks, which we refer to as direct blocks.
-
some pointers to blocks containing pointers to blocks, which we refer to as indirect blocks.
-
some pointers to blocks containing pointers to blocks containing pointers to blocks, which we refer to as doubly indirect blocks.
-
etc…
Buffer Cache v. Process Pages
-
as memory (duh), but also
-
to cache file data in order to improve performance.
-
Big buffer cache, small main memory: file access is fast, but potential thrashing in the memory subsystem…
-
Small buffer cache, large main memory: little swapping occurs but file access is extremely slow.
Buffer Cache Location
-
Below.
-
Complete disk blocks, including file system metadata.
File System Structures: Questions?
Caching and Consistency
-
Objects in the cache are lost on failures!
Remember: almost every file system operation involves modifying multiple disk blocks.
-
Allocate an inode, mark the used inode bitmap.
-
Allocate data blocks, mark the used data block bitmap.
-
Associate data blocks with the file by modifying the inode.
-
Add inode to the given directory by modifying the directory file.
-
Write data blocks.
How Caching Exacerbates Consistency
Observation: file system operations that modify multiple blocks may leave the file system in an inconsistent state if partially completed.
-
May increase the time span between when the first write of the operation hits the disk and the last is completed.
What Can Go Wrong?
What kinds of inconsistency can take place if the system is interrupted between the multiple operations necessary to complete a write?
-
Allocate an inode, mark the used inode bitmap. inode incorrectly marked in use.
-
Allocate data blocks, mark the used data block bitmap. Data blocks incorrectly marked in use.
-
Associate data blocks with the file by modifying the inode. Dangling file not present in any directory.
-
Add inode to the given directory by modifying the directory file.
-
Write data blocks. Data loss!
Maintaining File System Consistency
-
Don’t buffer writes!
-
We call this a write through cache because writes do not hit the cache.
-
Buffer all operations until blocks are evicted.
-
We call this a write back cache.
-
performance?
-
safety?
-
Write important file system data metadata structures—superblock, inode maps, bitmaps, etc.—immediately, but delay data writes.
File systems also give use processes some control through sync
(sync
the entire file system) and fsync
(sync one file).
Another Approach to Consistency
-
What’s not atomic? Writing multiple disk blocks.
-
What is atomic? Writing one disk block.
Journaling
-
Track pending changes to the file system in a special area on disk called the journal.
-
Following a failure, replay the journal to bring the file system back to a consistent state.
Creation example:
Dear Journal, here’s what I’m going to do today:
Allocate inode 567 for a new file.
Associate data blocks 5, 87, and 98 with inode 567.
Add inode 567 to the directory with inode 33.
That’s it!
Journaling: Checkpoints
-
Update the journal!
-
This is called a checkpoint.
Dear Journal, here’s what I’m going to do today:
Allocate inode 567 for a new file.
Associate data blocks 5, 87, and 98 with inode 567.
Add inode 567 to the directory with inode 33.
That’s it!
Dear Journal, I already did everything mentioned above! Checkpoint!
Journaling: Recovery
-
Start at the last checkpoint and work forward, updating on-disk structures as needed.
Dear Journal, I already did everything mentioned above! Checkpoint!
Dear Journal, here’s what I’m going to do today:
Allocate inode 567 for a new file. Did this already!
Associate data blocks 5, 87, and 98 with inode 567. Didn’t do this… OK, done!
Add inode 567 to the directory with inode 33. Didn’t do this either! OK, done.
That’s it! All caught up!
Journaling: Recovery
-
These are ignored as they may leave the file system in an incomplete state.
What would happen if we processed the following incomplete journal entry?
Dear Journal, here’s what I’m going to do today:
Allocate inode 567 for a new file.
Associate data blocks 5, 87, and 98 with inode 567.
Journaling: Implications
Observation: metadata updates (allocate inode, free data block, add to directory, etc.) can be represented compactly and probably written to the journal atomically.
-
We could include them in the journal meaning that each data block would potentially be written twice (ugh).
-
We could exclude them from the journal meaning that file system structures are maintained but not file data.
Next Time
Two very different file system designs:
-
The Berkeley Fast File System (FFS)
-
And log-structures file systems (LFS).