`
`John S. Quarterman
`
`Canon Exhibit 1111
`
`
`
`The Design and Implementation of the
`4.4BSD
`Operating System
`
`Canon Exhibit 1111
`
`
`
`The Design and Implementation of the
`4.4BSD
`Operating System
`
`Marshall Kirk McKusick
`Consultant
`
`Keith Bostic
`Berkeley Software Design, Inc.
`
`Michael J. Karels
`Berkeley Software Design, Inc.
`
`John S. Quarterman
`Texas Internet Consulting
`
`• 'f''f'
`
`ADDISON-WESLEY
`Boston • San Francisco • New York • Toronto • Montreal
`London • Munich • Paris • Madrid
`Capetown • Sydney • Tokyo • Singapore • Mexico City
`
`Canon Exhibit 1111
`
`
`
`This book is in the Addison-Wesley UNIX and Open Systems Series
`
`Series Editors: Marshall Kirk McKusick and John S. Quarterman
`Publishing Partner: Peter S. Gordon
`Associate Editor: Deborah R. Lafferty
`Associate Production Supervisor: Patricia A. Oduor
`Marketing Manager: Bob Donegan
`Senior Manufacturing Manager: Roy E. Logan
`Cover Designer: Barbara Atkinson
`Troff Macro Designer: Jaap Akkerhuis
`Copy Editor: Lyn Dupre
`Cover Art: John Lasseter
`
`UNIX is a registered trademark of X/Open in the United States and other countries. Many of
`the designations used by manufacturers and sellers to distinguish their products are claimed
`as trademarks. Where those designations appear in this book, and Addison-Wesley was
`aware of a trademark claim, the designations have been printed in initial caps or all caps.
`
`The programs and applications presented in this book have been included for their instruc(cid:173)
`tional value. They have been tested with care, but are not guaranteed for any particular pur(cid:173)
`pose. The publisher offers no warranties or representations, nor does it accept any liabili(cid:173)
`ties with respect to the programs or applications.
`
`Library of Congress Cataloging-in-Publication Data
`
`The design and implementation of the 4.4BSD operating system I
`Marshall Kirk McKusick ... [et al.].
`p. cm.
`Includes bibliographical references and index.
`ISBN 0-201-54979-4
`1. UNIX (Computer file) 2. Operating systems (Computers)
`I. McKusick, Marshall Kirk.
`QA76.76.063D4743 1996
`005.4'3--dc20
`
`96-2433
`CIP
`
`Copyright © 1996 by Addison-Wesley Longman, Inc.
`
`All rights reserved. No part of this publication may be reproduced, stored in a retrieval sys(cid:173)
`tem, or transmitted, in any form or by any means, electronic, mechanical, photocopying,
`recording, or otherwise, without the prior written permission of the publisher. Printed in
`the United States of America. Published simultaneously in Canada.
`
`Text printed on recycled and acid-free paper.
`ISBN 0201549794
`1112131415 MA
`
`03 02 OJ
`
`I 0th Printing
`
`February 200 I
`
`Canon Exhibit 1111
`
`
`
`36
`
`Chapter 2
`
`Design Overview of 4.4BSD
`
`This facility allows buffers in different parts of a process address space to be
`written atomically, without the need to copy them to a single contiguous buffer.
`Atomic writes are necessary in the case where the underlying abstraction is record
`based, such as tape drives that output a tape block on each write request. It is also
`convenient to be able to read a single request into several different buffers (such as
`a record header into one place and the data into another). Although an application
`can simulate the ability to scatter data by reading the data into a large buffer and
`then copying the pieces to their intended destinations, the cost of memory-to(cid:173)
`memory copying in such cases often would more than double the running time of
`the affected application.
`Just as send and recv could have been implemented as library interfaces to
`sendto and recvfrom, it also would have been possible to simulate read with readv
`and write with writev. However, read and write are used so much more frequently
`that the added cost of simulating them would not have been worthwhile.
`
`Multiple Filesystem Support
`With the expansion of network computing, it became desirable to support both
`local and remote filesystems. To simplify the support of multiple filesystems, the
`developers added a new virtual node or vnode interface to the kernel. The set of
`operations exported from the vnode interface appear much like the filesystem
`operations previously supported by the local filesystem. However, they may be
`supported by a wide range of filesystem types:
`
`•Local disk-based filesystems
`
`• Files imported using a variety of remote filesystem protocols
`
`• Read-only CD-ROM filesystems
`
`• Filesystems providing special-purpose interfaces-for example, the /proc
`filesystem
`
`A few variants of 4.4BSD, such as FreeBSD, allow filesystems to be loaded
`dynamically when the filesystems are first referenced by the mount system call.
`The vnode interface is described in Section 6.5; its ancillary support routines are
`described in Section 6.6; several of the special-purpose filesystems are described
`in Section 6.7.
`
`2. 7
`
`Filesystems
`
`A regular file is a linear array of bytes, and can be read and written starting at any
`byte in the file. The kernel distinguishes no record boundaries in regular files,
`although many programs recognize line-feed characters as distinguishing the ends
`of lines, and other programs may impose other structure. No system-related infor(cid:173)
`mation about a file is kept in the file itself, but the filesystem stores a small amount
`of ownership, protection, and usage information with each file.
`
`Canon Exhibit 1111
`
`
`
`Section 2.7
`
`Filesystems
`
`37
`
`A filename component is a string of up to 255 characters. These filenames are
`stored in a type of file called a directory. The information in a directory about a
`file is called a directory entry and includes, in addition to the filename, a pointer to
`the file itself. Directory entries may refer to other directories, as well as to plain
`files. A hierarchy of directories and files is thus formed, and is called afilesystem;
`a small one is shown in Fig. 2.2. Directories may contain subdirectories, and there
`is no inherent limitation to the depth with which directory nesting may occur. To
`protect the consistency of the filesystem, the kernel does not permit processes to
`write directly into directories. A filesystem may include not only plain files and
`directories, but also references to other objects, such as devices and sockets.
`The filesystem forms a tree, the beginning of which is the root directory,
`sometimes referred to by the name slash, spelled with a single solidus character
`(I). The root directory contains files; in our example in Fig. 2.2, it contains vmu(cid:173)
`nix, a copy of the kernel-executable object file. It also contains directories; in this
`example, it contains the usr directory. Within the usr directory is the bin direc(cid:173)
`tory, which mostly contains executable object code of programs, such as the files
`Is and vi.
`A process identifies a file by specifying that file's pathname, which is a string
`composed of zero or more filenames separated by slash (I) characters. The kernel
`associates two directories with each process for use in interpreting pathnames. A
`process's root directory is the topmost point in the filesystem that the process can
`access; it is ordinarily set to the root directory of the entire filesystem. A path(cid:173)
`name beginning with a slash is called an absolute pathname, and is interpreted by
`the kernel starting with the process's root directory.
`
`Figure 2.2 A small filesystem tree.
`
`Canon Exhibit 1111
`
`
`
`38
`
`Chapter2
`
`Design Overview of 4.4BSD
`
`A pathname that does not begin with a slash is called a relative pathname, and
`is interpreted relative to the current working directory of the process. (This direc(cid:173)
`tory also is known by the shorter names current directory or working directory.)
`The current directory itself may be referred to directly by the name dot, spelled
`with a single period (.). The filename dot-dot ( •• ) refers to a directory's parent
`directory. The root directory is its own parent.
`A process may set its root directory with the chroot system call, and its cur(cid:173)
`rent directory with the chdir system call. Any process may do chdir at any time,
`but chroot is permitted only a process with superuser privileges. Chroot is nor(cid:173)
`mally used to set up restricted access to the system.
`Using the filesystem shown in Fig. 2.2, if a process has the root of the filesys(cid:173)
`tem as its root directory, and has /usr as its current directory, it can refer to the file
`vi either from the root with the absolute pathname /usr/bin/vi, or from its current
`directory with the relative pathname bin/vi.
`System utilities and databases are kept in certain well-known directories. Part
`of the well-defined hierarchy includes a directory that contains the home directory
`for each user-for example, /usr/staff/mckusick and /usr/staff/karels in Fig. 2.2.
`When users log in, the current working directory of their shell is set to the home
`directory. Within their home directories, users can create directories as easily as
`they can regular files. Thus, a user can build arbitrarily complex subhierarchies.
`The user usually knows of only one filesystem, but the system may know that
`this one virtual filesystem is really composed of several physical filesystems, each
`on a different device. A physical filesystem may not span multiple hardware
`devices. Since most physical disk devices are divided into several logical devices,
`there may be more than one filesystem per physical devic~. but there will be no
`more than one per logical device. One filesystem-the filesystem that anchors all
`absolute pathnames-is called the root filesystem, and is always available. Others
`may be mounted; that is, they may be integrated into the directory hierarchy of the
`root filesystem. References to a directory that has a filesystem mounted on it are
`converted transparently by the kernel into references to the root directory of the
`mounted filesystem.
`The link system call takes the name of an existing file and another name to
`create for that file. After a successful link, the file can be accessed by either file(cid:173)
`name. A filename can be removed with the unlink system call. When the final
`name for a file is removed (and the final process that has the file open closes it),
`the file is deleted.
`Files are organized hierarchically in directories. A directory is a type of file,
`but, in contrast to regular files, a directory has a structure imposed on it by the sys(cid:173)
`tem. A process can read a directory as it would an ordinary file, but only the ker(cid:173)
`nel is permitted to modify a directory. Directories are created by the mkdir system
`call and are removed by the rmdir system call. Before 4.2BSD, the mkdir and
`rmdir system calls were implemented by a series of link and unlink system calls
`being done. There were three reasons for adding systems calls explicitly to create
`and delete directories:
`
`Canon Exhibit 1111
`
`
`
`Section 2.7
`
`Filesystems
`
`39
`
`1. The operation could be made atomic. If the system crashed, the directory
`would not be left half-constructed, as could happen when a series of link oper(cid:173)
`ations were used.
`
`2. When a networked filesystem is being run, the creation and deletion of files
`and directories need to be specified atomically so that they can be serialized.
`
`3. When supporting non-UNIX filesystems, such as an MS-DOS filesystem, on
`another partition of the disk, the other filesystem may not support link opera(cid:173)
`tions. Although other filesystems might support the concept of directories,
`they probably would not create and delete the directories with links, as the
`UNIX filesystem does. Consequently, they could create and delete directories
`only if explicit directory create and delete requests were presented.
`
`The chown system call sets the owner and group of a file, and chmod changes
`protection attributes. Stat applied to a filename can be used to read back such
`properties of a file. The fchown, fchmod, and /stat system calls are applied to a
`descriptor, instead of to a filename, to do the same set of operations. The rename
`system call can be used to give a file a new name in the filesystem, replacing one
`of the file's old names. Like the directory-creation and directory-deletion opera(cid:173)
`tions, the rename system call was added to 4.2BSD to provide atomicity to name
`changes in the local filesystem. Later, it proved useful explicitly to export renam(cid:173)
`ing operations to foreign filesystems and over the network.
`The truncate system call was added to 4.2BSD to allow files to be shortened
`to an arbitrary offset. The call was added primarily in support of the Fortran run(cid:173)
`time library, which has the semantics such that the end of a random-access file is
`set to be wherever the program most recently accessed that file. Without the trun(cid:173)
`cate system call, the only way to shorten a file was to copy the part that was
`desired to a new file, to delete the old file, then to rename the copy to the original
`name. As well as this algorithm being slow, the library could potentially fail on a
`full filesystem.
`Once the filesystem had the ability to shorten files, the kernel took advantage
`of that ability to shorten large empty directories. The advantage of shortening
`empty directories is that it reduces the time spent in the kernel searching them
`when names are being created or deleted.
`Newly created files are assigned the user identifier of the process that created
`them and the group identifier of the directory in which they were created. A three(cid:173)
`level access-control mechanism is provided for the protection of files. These three
`levels specify the accessibility of a file to
`
`1. The user who owns the file
`
`2. The group that owns the file
`
`3. Everyone else
`
`Canon Exhibit 1111
`
`
`
`40
`
`Chapter 2
`
`Design Overview of 4.4BSD
`
`Each level of access has separate indicators for read permission, write permission,
`and execute permission.
`Files are created with zero length, and may grow when they are written.
`While a file is open, the system maintains a pointer into the file indicating the cur(cid:173)
`rent location in the file associated with the descriptor. This pointer can be moved
`about in the file in a random-access fashion. Processes s~aring a file descriptor
`through a fork or dup system call share the current location pointer. Descriptors
`created by separate open system calls have separate current location pointers.
`Files may have holes in them. Holes are void areas in the linear extent of the file
`where data have never been written. A process can create these holes by position(cid:173)
`ing the pointer past the current end-of-file and writing. When read, holes are
`treated by the system as zero-valued bytes.
`Earlier UNIX systems had a limit of 14 characters per filename component.
`This limitation was often a problem. For example, in addition to the natural desire
`of users to give files long descriptive names, a common way of forming filenames
`is as basename.extension, where the extension (indicating the kind of file, such as
`.c for C source or .o for intermediate binary object) is one to three characters,
`leaving 10 to 12 characters for the basename. Source-code-control systems and
`editors usually take up another two characters, either as a prefix or a suffix, for
`their purposes, leaving eight to 10 characters. It is easy to use 10 or 12 characters
`in a single English word as a basename (e.g., "multiplexer").
`It is possible to keep within these limits, but it is inconvenient or even dan(cid:173)
`gerous, because other UNIX systems accept strings longer than the limit when
`creating files, but then truncate to the limit. A C language source file named
`multiplexer.c (already 13 characters) might have a source-code-control file with
`s. prepended, producing a filename s.multiplexer that is indistinguishable from
`the source-code-control file for multiplexer.ms, a file containing troff source for
`documentation for the C program. The contents of the two original files could
`easily get confused with no warning from the source-code-control system. Care(cid:173)
`ful coding can detect this problem, but the long filenames first introduced in
`4.2BSD practically eliminate it.
`
`2.8 F'ilestores
`
`The operations defined for local filesystems are divided into two parts. Common
`to all local filesystems are hierarchical naming, locking, quotas, attribute manage(cid:173)
`ment, and protection. These features are independent of how the data will be
`stored. 4.4BSD has a single implementation to provide these semantics.
`The other part of the local filesystem is the organization and management of
`the data on the storage media. Laying out the contents of files on the storage
`media is the responsibility of the filestore. 4.4BSD supports three different file(cid:173)
`store layouts:
`
`Canon Exhibit 1111
`
`
`
`Section 2.9
`
`Network Filesystem
`
`41
`
`• The traditional Berkeley Fast Filesystem
`
`• The log-structured filesystem, based on the Sprite operating-system design
`[Rosenblum & Ousterhout, 1992]
`
`• A memory-based filesystem
`
`Although the organizations of these filestores are completely different, these dif(cid:173)
`ferences are indistinguishable to the processes using the filestores.
`The Fast Filesystem organizes data into cylinder groups. Files that are likely
`to be accessed together, based on their locations in the filesystem hierarchy, are
`stored in the same cylinder group. Files that are not expected to accessed together
`are moved into different cylinder groups. Thus, files written at the same time may
`be placed far apart on the disk.
`The log-structured filesystem organizes data as a log. All data being written
`at any point in time are gathered together, and are written at the same disk loca(cid:173)
`tion. Data are never overwritten; instead, a new copy of the file is written that
`replaces the old one. The old files are reclaimed by a garbage-collection process
`that runs when the filesystem becomes full and additional free space is needed.
`The memory-based filesystem is designed to store data in virtual memory. It
`is used for filesystems that need to support fast but temporary data, such as /tmp.
`The goal of the memory-based filesystem is to keep the storage packed as com(cid:173)
`pactly as possible to minimize the usage of virtual-memory resources.
`
`2.9 Network Filesystem
`
`Initially, networking was used to transfer data from one machine to another. Later,
`it evolved to allowing users to log in remotely to another machine. The next logi(cid:173)
`cal step was to bring the data to the user, instead of having the user go to the
`data-and network filesystems were born. Users working locally do not experi(cid:173)
`ence the network delays on each keystroke, so they have a more responsive envi(cid:173)
`ronment.
`Bringing the filesystem to a local machine was among the first of the major
`client-server applications. The server is the remote machine that exports one or
`more of its filesystems. The client is the local machine that imports those filesys(cid:173)
`tems. From the local client's point of view, a remotely mounted filesystem
`appears in the file-tree name space just like any other locally mounted filesystem.
`Local clients can change into directories on the remote filesystem, and can read,
`write, and execute binaries within that remote filesystem identically to the way
`that they can do these operations on a local filesystem.
`When the local client does an operation on a remote filesystem, the request is
`packaged and is sent to the server. The server does the requested operation and
`returns either the requested information or an error indicating why the request was
`
`Canon Exhibit 1111
`
`
`
`196
`
`Chapter6
`
`1/0 System Overview
`
`Interrupt Handling
`
`Interrupts are generated by devices to signal that an operation has completed or
`that a change in status has occurred. On receiving a device interrupt, the system
`invokes the appropriate device-driver interrupt service routine with one or more
`parameters that identify uniquely the device that requires service. These parame(cid:173)
`ters are needed because device drivers typically support multiple devices of the
`same type. If the interrupting device's identity were not supplied with each inter(cid:173)
`rupt, the driver would be forced to poll all the potential devices to identify the de(cid:173)
`vice that interrupted.
`The system arranges for the unit-number parameter to be passed to the inter(cid:173)
`rupt service routine for each device by installing the address of an auxiliary glue
`routine in the interrupt-vector table. This glue routine, rather than the actual inter(cid:173)
`rupt service routine, is invoked to service the interrupt; it takes the following
`actions:
`
`1. Save all volatile registers.
`
`2. Update statistics on device interrupts.
`
`3. Call the interrupt service routine with the appropriate unit number parameter.
`
`4. Restore the volatile registers saved in step 1.
`
`5. Return from the interrupt.
`
`Because a glue routine is interposed between the interrupt-vector table and the
`interrupt service routine, device drivers do not need to be concerned with saving
`and restoring machine state. In addition, special-purpose instructions that cannot
`be generated from C, which are needed by the hardware to support interrupts, can
`be kept out of the device driver; this interposition of a glue routine permits device
`drivers to be written without assembly language.
`
`6.2 Block Devices
`Block devices include disks and tapes. The task of the block-device interface is to
`convert from the user abstraction of a disk as an array of bytes to the structure
`imposed by the underlying physical medium. Although the user may wish to
`write a single byte to a disk, the hardware can read and write only in multiples of
`sectors. Hence, the system must arrange to read in the sector containing the byte
`to be modified, to replace the affected byte, and to write back the sector to the
`disk. This operation of converting random access to an array of bytes to reads and
`writes of disk sectors is known as block 110. Block devices are accessible directly
`through appropriate device special files, but are more commonly accessed indi(cid:173)
`rectly through the filesystem (see Section 8.2).
`Processes may read data in sizes smaller than a disk block. The first time that
`a small read is required from a particular disk block, the block will be transferred
`
`Canon Exhibit 1111
`
`
`
`Section 6.2
`
`Block Devices
`
`197
`
`from the disk into a kernel buffer. Later reads of parts of the same block then
`require only copying from the kernel buffer to the memory of the user process.
`Multiple small writes are treated similarly. A buffer is allocated from the cache
`when the first write to a disk block is made, and later writes to part of the same
`block are then likely to require only copying into the kernel buffer, and no disk I/O.
`In addition to providing the abstraction of arbitrary alignment of reads and
`writes, the block buffer cache reduces the number of disk I/O transfers required by
`filesystem accesses. Because system-parameter files, commands, and directories
`are read repeatedly, their data blocks are usually in the buffer cache when they are
`needed. Thus, the kernel does not need to read them from the disk every time that
`they are requested.
`If the system crashes while data for a particular block are in the cache but
`have not yet been written to disk, the filesystem on the disk will be incorrect and
`those data will be lost. (Critical system data, such as the contents of directories,
`however, are written synchronously to disk, to ensure filesystem consistency;
`operations requiring synchronous I/O are described in the last subsection of Sec(cid:173)
`tion 8.2.) So that lost data are minimized, writes are forced periodically for dirty
`buffer blocks. These forced writes are done (usually every 30 seconds) by a user
`process, update, which uses the sync system call. There is also a system call,
`fsync, that a process can use to force all dirty blocks of a single file to be written to
`disk immediately; this synchronization is useful for ensuring database consistency
`or before removing an editor backup file.
`Most magnetic-tape accesses are done through the appropriate raw tape de(cid:173)
`vice, bypassing the block buffer cache. When the cache is used, tape blocks must
`still be written in order, so the tape driver forces synchronous writes for them.
`
`Entry Points for Block-Device Drivers
`
`Device drivers for block devices are described by an entry in the bdevsw table.
`Each bdevsw structure contains the following entry points:
`
`open
`
`Open the device in preparation for I/O operations. A device's open
`entry point will be called for each open system call on a block special
`device file, or, internally, when a device is prepared for mounting a
`filesystem with the mount system call. The open() routine will com(cid:173)
`monly verify the integrity of the associated medium. For example, it
`will verify that the device was identified during the autoconfiguration
`phase and, for tape and disk drives, that a medium is present and on(cid:173)
`line.
`
`strategy Start a read or write operation, and return immediately. I/O requests to
`or from filesystems located on a device are translated by the system
`into calls to the block I/O routines bread() and bwrite(). These block
`I/O routines in turn call the device's strategy routine to read or write
`data not in the cache. Each call to the strategy routine specifies a
`pointer to a buf structure containing the parameters for an I/O request.
`
`Canon Exhibit 1111
`
`
`
`198
`
`close
`
`dump
`
`psize
`
`Chapter 6
`
`110 System Overview
`
`If the request is synchronous, the caller must sleep (on the address of
`the buf structure) until 1/0 completes.
`
`Close a device. The close() routine is called after the final client inter(cid:173)
`ested in using the device terminates. These semantics are defined by
`the higher-level 1/0 facilities. Disk devices have nothing to do when a
`device is closed, and thus use a null close() routine. Devices that sup(cid:173)
`port access to only a single client must mark the device as available
`once again. Closing a tape drive that was open for writing typically
`causes end-of-file marks to be written on the tape and the tape to be
`rewound.
`
`Write all physical memory to the device. The dump entry point saves
`the contents of memory on secondary storage. The system automati(cid:173)
`cally takes a dump when it detects an unrecoverable error and is about
`to crash. The dump is used in a postmortem analysis of the problem
`that caused the system to crash. The dump routine is invoked with the
`processor priority at its highest level; thus, the device driver must poll
`for device status, rather than wait for interrupts. All disk devices are
`expected to support this entry point; some tape devices do as well.
`
`Return the size of a disk-drive partition. The driver is supplied a logi(cid:173)
`cal unit and is expected to return the size of that unit, typically a disk(cid:173)
`drive partition, in DEV _BSIZE blocks. This entry point is used during
`the bootstrap procedure to calculate the location at which a crash dump
`should be placed and to determine the sizes of the swap devices.
`
`Sorting of Disk 1/0 Requests
`
`The kernel provides a generic disksort() routine that can be used by all the disk
`device drivers to sort 1/0 requests into a drive's request queue using an elevator
`sorting algorithm. This algorithm sorts requests in a cyclic, ascending, cylinder
`order, so that requests can be serviced with a minimal number of one-way scans
`over the drive. This ordering was originally designed to support the normal read(cid:173)
`ahead requested by the filesystem as well as to counteract the filesystem's random
`placement of data on a drive. With the improved placement algorithms in the cur(cid:173)
`rent filesystem, the effect of the disksort() routine is less noticeable; disksort()
`produces the largest effect when there are multiple simultaneous users of a drive.
`The disksort() algorithm is shown in Fig. 6.2. A drive's request queue is
`made up of one or two lists of requests ordered by cylinder number. The request
`at the front of the first list indicates the current position of the drive. If a second
`list is present, it is made up of requests that lie before the current position. Each
`new request is sorted into either the first or the second list, according to the
`request's location. When the heads reach the end of the first list, the drive begins
`servicing the other list.
`Disk sorting can also be important on machines that have a fast processor, but
`that do not sort requests within the device driver. In this situation, if a write of
`
`Canon Exhibit 1111
`
`
`
`Section 6.2
`
`Block Devices
`
`199
`
`disksort(dq, bp)
`drive queue *dq;
`buffer *bp;
`
`if (drive queue is empty) {
`place the buffer at the front of the drive queue;
`return;
`
`if (request lies before the first active request) {
`locate the beginning of the second request list;
`sort bp into the second request list;
`else
`sort bp into the current request list;
`
`Figure 6.2 Algorithm for disksort().
`
`several Kbyte is honored in order of queueing, it can block other processes from
`accessing the qisk while it completes. Sorting requests provides some scheduling,
`which more fairly distributes accesses to the disk controller.
`
`Disk Labels
`
`Many disk controllers require the device driver to identify the location of disk sec(cid:173)
`tors that are to be transferred by their cylinder, track, and rotational offset. For
`maximum throughput efficiency, this information is also needed by the filesystem
`when deciding how to lay out files. Finally, a disk may be broken up into several
`partitions, each of which may be used for a separate filesystem or swap area.
`Historically, the information about the geometry of the disk and about the lay(cid:173)
`out of the partitions was compiled into the kernel device drivers. This approach
`had several flaws. First, it was cumbersome to have all the possible disk types and
`partitions compiled into the kernel. Any time that a disk with a new geometry was
`added, the driver tables had to be updated and the kernel recompiled. It was also
`restrictive in that there was only one choice of partition table for each drive type.
`Choosing a different set of tables required modifying the disk driver and rebuild(cid:173)
`ing the kernel. Installing new tables also required dumping all the disks of that
`type on the system, then booting the new kernel and restoring them onto the new
`partitions. Disks with different partition layouts could not be moved from one
`system to another. An additional problem arose when nonstandard partition tables
`were used; new releases from the vendor had to have the partition tables modified
`before they could be used on an existing system.
`For all these reasons, 4.4BSD and most commercial UNIX vendors added disk
`labels. A disk label contains detailed geometry information, including cylinder,
`track, and sector layout, along with any other driver-specific information. It also
`contains information about the partition layout and usage, the latter describing
`
`Canon Exhibit 1111
`
`
`
`200
`
`Chapter 6
`
`1/0 System Overview
`
`partltlon usage: type of filesystem, swap partition, or unused. For the fast
`filesystem, the partition usage contains enough additional information to enable
`the filesystem check program (fsck) to locate the alternate superblocks for the
`filesystem.
`Having labels on each disk means that partition information can be different
`for each disk, and that it carries over when the disk is moved from one system to
`another. It also means that, when previously unknown types of disks are con(cid:173)
`nected to the system, the system administrator can use them without changing the
`disk driver, recompiling, and rebooting the system.
`The label is located near the beginning of each drive-usually, in block zero.
`It must be located in the first track, because the device driver does not know the
`geometry of the disk until the driver has read the label. Thus, it must assume that
`the label is in cylinder zero, track zero, at some valid offset within that track.
`Most architectures have hardware (or first-level) bootstrap code stored in read(cid:173)
`only memory (ROM). When the machine is powered up or the reset button is
`pressed, the CPU executes the hardware bootstrap code from the ROM. The hard(cid:173)
`ware bootstrap code typically reads the first few sectors on the disk into the main
`memory, t