COMS W4118: Operating Systems I

Dept. of Computer Science, Columbia University

Spring 2013 — Kaustubh Joshi

Homework 5

Homework 5 is due Monday 5/6 at 12:01 AM ET.

All written problems in this assignment are to be done alone, while the programming assignment is a group assignment. Submit the written and programming assignments separately using the instructions on the homeworks page. Periodically check back on the course page for any errata or updates to submission instructions.

Written Assignment (40 pts)

Exercise numbers X.Y refer to the exercise problem Y in Chapter X in the course textbook, Operating Systems Concepts 9e. Each problem is worth 5 points unless stated otherwise. Make your answers concise. You'll lose points for verbosity. Your written assignment directory should contain a single ASCII file called hw5.txt with your solutions and a completed teammate evaluation form for the programming assignment.

  1. 10.12, 10.13, 10.14 modified (10 points). In this problem, we're going to begin to see how disk geometry impacts I/O scheduling. Suppose that a disk drive has 5,000 cylinders, numbered 0 to 4999.

    1. The drive is currently serving a request at cylinder 2150, and the previous request was at cylinder 1805. The queue of pending requests, in FIFO order, is: 2069, 1212, 2296, 2800, 544, 1618, 356, 1523, 4965, 3681. Starting from the current head position, what is the total distance (in cylinders) that the disk arm moves to satisfy all the pending requests for each of the following disk-scheduling algorithms? a) FCFS, b) SSTF, c) LOOK (Elevator), and d) C-LOOK (Elevator).

    2. During a seek, the disk accelerates the disk arm at a constant rate for half the seek, and then decelerates it at the same rate for the second half. Given that the relation between distance d travelled by a body undergoing constant acceleration a in time t is , compute the relation for seek time as a function of seek distance in cylinders. Assume that the disk can perform a seek to an adjacent cylinder in 1ms and a full-stroke seek over all 5000 cylinders in 18 ms. The relation should be in the form where t is the time in ms and L is the seek distance in cylinders.

    3. Calculate the total seek time for each of the schedules in Part 1 of this problem. Determine which schedule has the smallest total seek time.

    4. Assume that the disk rotates at 7200 RPM. What seek distance (in cylinders) can be covered in the time of a single average rotation (half rotation)? This number will give you a sense of the cylinder range over which it makes more sense for the disk controller to do the scheduling as opposed to the OS's more coarse-grained scheduling policies.

  2. 11.19. Read section 11.5.3 of the textbook to understand what consistency semantics are. What are the implications of supporting UNIX consistency semantics for shared access for those files that are stored on remote file systems?

  3. 12.12 (modified). Consider a system where free space is kept in a free-space list.

    1. Suppose that the pointer to the free-space list is lost. Can the system recover by reconstructing the free-space list? Explain your answer.
    2. Suggest a scheme to ensure that the pointer is never lost as a result of power, memory, or disk block failure.

  4. 12.13. Some file systems allow disk storage to be allocated at different levels of granularity. For instance, a file system could allocate 4 KB of disk space as a single 4-KB block or as eight 512-byte blocks. How could we take advantage of this flexibility to improve performance? What modifications would have to be made to the free-space management scheme in order to support this feature?

  5. Consider a file system on a disk that has both logical and physical block sizes of 512 bytes. The current file pointer is at offset 4096 (the last block read). For each of the allocation strategies (contiguous, linked, FAT, and indexed), how many physical blocks must be read from the disk if we want to access data at a) offset 2048 and b) offset 8192 next? Assume that all inode information and systemwide file allocation tables are already in memory, and that each inode can store at-most 10 direct block pointers.

  6. 10.18 (modified). Consider a RAID Level 5 organization comprising five disks, with the parity for sets of four blocks on four disks stored on the fifth disk. How many blocks are accessed in order to perform the following? What is the write amplification for each scenario? (Write amplification is the ratio of total data and parity writes to the number of data writes generated by the OS).

    1. A write of one block of data.
    2. A write of seven continuous blocks of data (starting at a 4 block boundary).
    3. A write of eight continuous blocks of data (starting at a 4 block boundary).
    4. A write of eight non-contiguous blocks of data.
    5. Based on these calculations, what can you say about the suitability of using RAID-5 for SSD drives?

  7. Consider a guest virtual machine (VM) running on a host OS on a laptop. The VM's virtual disk image is stored as a file on the host OS's filesystem.

    1. If the laptop has a hard disk, which I/O scheduler (FCFS/Noop, SSTF, Elevator) would you pick for the guest OS? host OS? Why?
    2. For the remaining questions, assume the laptop has an SSD drive. Which I/O scheduler (FCFS/Noop, SSTF, Elevator) would you pick for the guest OS? host OS? Why?
    3. Would you pick a log-structured FS for the guest OS? host OS? Why or why not?
    4. Is it useful to have TRIM support in the guest OS? host OS? Why or why not?

Group Programming Assignment: Cloud File System (75 pts)

In this programming assignment, we'll be working with the Android kernel again using the same emulator environment you constructed for Homework 2 using these Android installation instructions. In your checked out git repository, make sure you checkout the hw5 branch before starting work on this assignment using git checkout hw5.

As the deliverable, you will be required to submit a kernel patch and diff against the hw5_clean tag in your git repo. The patch shows us all commits you have made to the kernel during development, while the diff reflects only your latest code. To produce the patch and diff files, make sure you've committed all your changes, and then go to the root of the kernel tree and type git format-patch hw5_clean.. --stdout > groupN_hw5.patch where N is your group number. This command should create a single groupN_hw5.patch file to submit. Similarly, use the command git diff hw5_clean.. > groupN_hw5.diff to create a single diff file to submit. In addition to these files, you should also include all user mode test programs: .c, .h, Makefile files, the executable files for your test programs, and a README file documenting your files and code that explains any way in which your solution differs from what was assigned, and any assumptions you made. In addition, you are required to also fill out and include this cover sheet to indicate whether your code has any problems or not, and what attempts you have made to fix them.

The usual coding instructions apply. All code must be written in C and a Makefile must be provided with a default target that compiles all user-mode programs. When using gcc, the -Wall switch must be used to display warnings. In addition, you must check your kernel diff against the Linux coding guidelines using the scripts/checkpatch.pl groupN_hw5.diff command in the kernel tree. You can find a tutorial on how to use checkpatch and general kernel hacking tips here. Before submitting, make sure you remove any files produced during compilation (.o files, executable).

Description

Phones and tablets today often ship multiple models with flash-based storage of varying sizes such as 8GB, 16GB, 32GB, etc. However, because flash is expensive compared to disks, one often pays a hefty premium for buying a model that has more than the minimum amount. Nevertheless, consumers buy higher capacity devices anyway because they have lots of media (e.g., pictures, videos, music, etc.), but often end up actively using only a small fraction of it on a regular basis.

In this assignment, we are going to remedy this situation by modifying an Android filesystem such that it can store a device's data partly on the device, and partly on cheaper cloud-based storage to give the user the illusion that their device has "infinite storage" irrespective of the amount of flash storage available. The idea is to use the local filesystem as a cache - when its utilization increases beyond a watermark W, old files that haven't been used in a while will be evicted and stored in a cloud store. If a file that is stored on the cloud store is accessed by the user, it will transparently be fetched (evicting other files if necessary). Metadata, such as directory files and inodes usually take very little space compared to data and allow the user to quickly browse the contents of the filesystem, so they should never be evicted. In a sense, we'll be implementing a swapping subsystem that works between flash and remote disks, rather than between RAM and disk.

To keep things simple, we will modify the Ext2 filesystem rather than Android's more complex default YAFFS2 flash-optimized log-structured filesystem. Also, rather than working at the block level, we will work with files. A file's inode should always remain on the mobile device; when the file is evicted, only its data blocks are stored on the cloud server and deleted from the local filesystem. For simplicity, we will rely on the Clock (second chance) algorithm to determine which files to evict.

Deliverables

  1. Part I. Cloud Server (20 points): Write a cloud server clfs_server.c that runs on your host machine (outside the Android emulator) and stores evicted files from your filesystem. Your server should only use the standard libC API (no external libraries). On execution, it listens for TCP connections from the Cloud FS client on port 8888. On receiving a new connection, it should spawn a new thread, and service the request. Requests are of the following format:

      struct clfs_req {
              enum {
                      CLFS_PUT,
                      CLFS_GET,
                      CLFS_RM
              } type;
              int inode;
              int size;
      };
      
    After receiving a CLFS_PUT request, the server should then expect a unsigned char data[size]; sized data block next. The server should always send an int status code back to the client on the same socket indicating whether it was able to successfully execute the operation. If the request is of type CLFS_GET and there is no error, the server should subsequently send back the variable sized unsigned char data[size]; data block that was requested. The following status codes are allowed:
      enum clfs_status {
              CLFS_OK = 0,            /* Success */
              CLFS_INVAL = EINVAL,    /* Invalid address */
              CLFS_ACCESS = EACCES,   /* Could not read/write file */
              CLFS_ERROR              /* Other errors */
      }
      
    Run the cloud server on the host machine on which you are running the emulator. The host can always be reached at IP address 10.0.2.2 from within the emulator. The cloud server should store all blocks as soon as they are received in a directory named clfs_store in the same directory as the cloud server executable. Each file should be named inodenum.dat. If the cloud receives a CLFS_RM command, it should delete the appropriate file from the store.

  2. Part II. Initializing an FS (5 points): Before modifying the Ext2 filesystem, you must first enable support for it by reconfiguring the kernel. Following the instructions for kernel configuration in step 2.7 of the Android installation tutorial, enable Second extended fs support and Ext2 extended attributes from the File systems menu of the kernel configuration tool. You should leave any other Ext2 options besides these unselected. Save your new configuration as arch/x86/configs/goldfish_ext2_defconfig and use it to configure your kernel. Next, you must create and mount an instance of the FS in the emulator. The instructions to do this are very similar to the procedure for creating swap space in homework 4.

    1. The tools to work with disks and Ext2 filesystems aren't included in the default Android distribution. To get them, we will use busybox a minimal and self-contained UNIX-like distribution. First, get a precompiled version of busybox from here. Download the busybox-i686 binary because our emulator runs x86. This will be a self contained binary.

    2. Before starting the emulator, edit the avd settings (android avd) and enable the "SDCard" option to create a sdcard of size 10MB. This image will be located in ~/.android/avd/<name_of_your_avd>.avd/sdcard.img. Alternatively, you may also create (additional) images yourself using the mksdcard tool in the tools/ subdirectory of the Android SDK. Use the following options: mksdcard -l HW5 10M hw5.img. If you create your own image, you must ask the emulator to load it by passing a -sdcard <img_name.img> command line parameter to the emulator. In either case, the image will appear in the Android device as the /dev/block/mmcblk0 block device.

    3. Start the emulator and push the busybox-i686 binary to the emulator: adb push busybox-i686 /data/busybox and adb shell chmod 777 /data/busybox

    4. From the adb shell, initialize a partition on the sdcard:

      • First, if the sdcard gets automounted (check via the mount command), unmount it by going to the Settings app, going to the Storage menu, and selecting Unmount SD card.
      • Run adb shell
      • cd /data
      • ./busybox fdisk /dev/block/mmcblk0
      • In fdisk, use m to see help. You need to do the following:
        • o (create new partition table)
        • n (add a new primary partition 1 starting at cylinder 1 and ending at cylinder 320 - the partition should have 10232 blocks of 512bytes each)
        • t (make sure the partition is set to Linux: type 83)
        • v (verify the partition table)
        • w (write partition table to disk)

    5. At this point, you should see a /dev/block/mmcblk0p1 partition. Create an Ext2 filesystem on it with 1k blocks and HW5 label using the command:

      ./busybox mke2fs -b 1024 -L HW5 /dev/block/mmcblk0p1
      Mount the filesystem to /mnt/sdcard using the mount command. Once the filesystem is mounted, print the mount table by calling the mount command with no options.

    6. Your deliverables for this part of the homework are the output of the mke2fs command and the mount command showing the mount table included in your cover sheet.

  3. Part III. Eviction Mechanism (30 points): Extend the Ext2 filesystem to evict files to the cloud server and to fetch them when needed. Once Ext2 is enabled, implement the following two calls in the file fs/ext2/ext2_evict.c.

          /* Evict the file identified by i_node to the cloud server,
           * freeing its disk blocks and removing any page cache pages.
           * The call should return when the file is evicted. Besides
           * the file data pointers, no other metadata, e.g., access time,
           * size, etc. should be changed. Appropriate errors should
           * be returned. In particular, the operation should fail if the
           * inode currently maps to an open file. Lock the inode
           * appropriately to prevent a file open operation on it while
           * it is being evicted.
           */ 
          int ext2_evict(struct inode *i_node);
    
          /* Fetch the file specified by i_node from the cloud server.
           * The function should allocate space for the file on the local
           * filesystem. No other metadata of the file should be changed.
           * Lock the inode appropriately to prevent concurrent fetch
           * operations on the same inode, and return appropriate errors.
           */
          int ext2_fetch(struct inode *i_node);
      
    For the assignment, we do not want to change the physical layout of inodes or the superblock to avoid having to use modified tools to create the Ext2 filesystem. So, you should store data about whether a file is evicted or not using the extended attributes (xattr) API. Add a new xattr named evicted whose value is 1 if the file is evicted, and 0 otherwise. If the xattr for a file is missing, then assume its value is 0, and add the xattr to the file. Any space allocated to an evicted file including its direct and indirect blocks must be freed, and any page-cache memory pages that refer to that file must be removed. Modify the Ext2 implementation of the VFS file_operations.open function to check whether a file that is being open is evicted or not. If the file is evicted when opened by a program, fetch it from the cloud server by calling ext2_fetch before allowing the open call to proceed.

    Register and mount your filesystem: Because we are creating a new filesystem, normally we would change the FS magic number (found as EXT2_SUPER_MAGIC in include/linux/magic.h). However, we do not want to change any on-disk structures, so we will leave the magic number unmodified. However, by modifying the option parsing code in fs/ext2/super.c, your FS should accept the following new mount options.

    1. srv=server_name_or_ip:port: the cloud server location. Return from the mount operation with an error (without mounting) if no value for this option is specified.
    2. wh=num: the high watermark. Should be a number from 0 to 100. It's default value is 95. When the utilization of the FS exceeds wh percent, any new write call should result in running of an eviction operation (see next section) before the write is allowed to proceed.
    3. wl=num: the low watermark. Should be a number between 0 to wh. It's default value is 85. When the utilization of the FS exceeds wl percent, the evictor daemon (next section) should asynchronously run an eviction operation on the filesystem.
    4. evict=num: the eviction target. Should be a number between 0 to wl. It's default value is 70. Any eviction operation should leave the disk evict percent utilized.

  4. Part IV. Eviction Policy (20 points): Implement an Ext2 specific eviction policy as a function ext2_evict_fs(struct super_block *super) in the file fs/ext2/ext2_evict.c. This function should implement the Clock (second chance) algorithm on the supplied Ext2 FS. If the utilization of the filesystem is greater than the low watermark (wl) mount option, it should scan inodes starting from the current clock hand in sequence, skipping directory inodes and storing the last time it scanned that inode until it identifies an inode whose access time is older than its last scan time. The evictor should then evict this file using the evict call, and repeat the process until the disk usage drops to below the evict threshold evict mount option (see Options in Part III above). Store the last scan time for the file as an xattr named scantime. If the xattr is missing, create it with an initial value of 0. The position of your clock hand must be persistent across reboots, i.e., you must store the inode number the hand is pointing to on disk as an extended attribute named clockhand associated with the root directory (default value of inode 0 if the xattr is missing).

    Add the following VFS superblock operation to super_operations in linux/fs.h.

    int evict_fs(struct super_block *super);
    . Your modified Ext2 filesystem must implement this operation by invoking the ext2_evict_fs call. The evict_fs VFS call must be invoked from two places. Once from the Ext2 filesystem specific function that allocates a new disk block whenever the FS utilization after the allocation becomes higher than the wh parameter. It must also be invoked from a new kernel thread you will create called kfs_evictd implemented in the fs/evictd.c file. This evictor thread must be registered to start when the system starts using the fs_initcall(fn) function in kernel/kthread.c (see migration_init in kernel/sched.c for examples). Once started, the evictor should run once every minute and scan each mounted filesystem, invoking the evict_fs VFS call if the FS defines it.

  5. Part V. Testing your cloud filesystem: For this last assignment of the course, we ask you to test your filesystem yourselves. During the last week before the submission deadline, we will provide you with a grading workload in the form of two scripts client_test.sh and host_test.sh that you have to run on the host and on the emulator on a freshly created image to test your filesystem implementation. In addition to your kernel diffs, userspace cloud server source code, and cover sheet, you must also submit the following:
    1. The output of the grading program client_test.sh run in the emulator.
    2. The kernel log after the grading program completes (output of adb dmesg).
    3. The final disk image file hw5.img after running the test workload.
    4. The output of the ls -all command executed in the clfs_store directory on your cloud server.
    5. The output of the cloud FS host grading program host_test.sh run on the cloud server host.

Important Hints

  1. We are going to implement the network client code to evict files to the cloud server directly in the kernel to avoid having this become a semester long project. You can find a tutorial on how to write socket code in the kernel here. But normally, you should think many times about why you might want to put networking code in the kernel, especially when talking to untrusted servers. Imagine what could happen if a hacker got a direct backdoor through your server by replacing the contents of a root owned evicted file with a malicious script - at the very least, we would want some encryption and authentication.

  2. Check the return values of all functions utilizing system resources. Your code should handle errors properly. Many failed function calls should not be fatal to a program. Typically, a system call will return -1 in the case of an error (malloc will return NULL). If a function call sets the errno variable (see the function's man page to find out if it does), you should use the result of strerror(errno) as part of your error message. As far as system calls are concerned, you will want to use one of the mechanisms described in Reporting Errors or Error Reporting.