Distributed File Systems


        Two main purposes of using files:

1.     Permanent storage of information on a secondary storage media.

2.     Sharing of information between applications.


        A file system is a subsystem of the operating system that performs file management activities such as organization, storing, retrieval, naming, sharing, and protection of files.


        A file system frees the programmer from concerns about the details of space allocation and layout of the secondary storage device.


The design and implementation of a distributed file system is more complex than a conventional file system due to the fact that the users and storage devices are physically dispersed.



1.     Remote information sharing

Thus any node, irrespective of the physical location of the file, can access the file.


2.     User mobility

User should be permitted to work on different nodes.


3.     Availability

For better fault-tolerance, files should be available for use even in the event of temporary failure of one or more nodes of the system. Thus the system should maintain multiple copies of the files, the existence of which should be transparent to the user.


4.     Diskless workstations

A distributed file system, with its transparent remote-file accessing capability, allows the use of diskless workstations in a system.


        A distributed file system provides the following types of services:


1.     Storage service

Allocation and management of space on a secondary storage device thus providing a logical view of the storage system.


2.     True file service

Includes file-sharing semantics, file-caching mechanism, file replication mechanism, concurrency control, multiple copy update protocol etc.


†††† 3.†† Name/Directory service

Responsible for directory related activities such as creation and deletion of directories, adding a new file to a directory, deleting a file from a directory, changing the name of a file, moving a file from one directory to another etc.

(This chapter deals with the design and implementation issues of the true file service component of distributed file systems).


        Desirable features of a distributed file system:


1.     Transparency

-         Structure transparency

Clients should not know the number or locations of file servers and the storage devices. Note: multiple file servers provided for performance, scalability, and reliability.

-         Access transparency

Both local and remote files should be accessible in the same way. The file system should automatically locate an accessed file and transport it to the clientís site.

-         Naming transparency

The name of the file should give no hint as to the location of the file. The name of the file must not be changed when moving from one node to another.

-         Replication transparency

If a file is replicated on multiple nodes, both the existence of multiple copies and their locations should be hidden from the clients.


2.     User mobility

Automatically bring the userís environment (e.g. userís home directory) to the node where the user logs in.


3.     Performance

Performance is measured as the average amount of time needed to satisfy client requests. This time includes CPU time + time for accessing secondary storage + network access time. It is desirable that the performance of a distributed file system be comparable to that of a centralized file system.

4.     Simplicity and ease of use

User interface to the file system be simple and number of commands should be as small as possible.


5.     Scalability

Growth of nodes and users should not seriously disrupt service.


6.     High availability

A distributed file system should continue to function in the face of partial failures such as a link failure, a node failure, or a storage device crash.

A highly reliable and scalable distributed file system should have multiple and independent file servers controlling multiple and independent storage devices.


7.     High reliability

Probability of loss of stored data should be minimized. System should automatically generate backup copies of critical files.


8.     Data integrity

Concurrent access requests from multiple users who are competing to access the file must be properly synchronized by the use of some form of concurrency control mechanism. Atomic transactions can also be provided.


9.     Security

Users should be confident of the privacy of their data.


10. Heterogeneity

There should be easy access to shared data on diverse platforms (e.g. Unix workstation, Wintel platform etc).


        File Models


1.Unstructured and Structured files

In the unstructured model, a file is an unstructured sequence of bytes. The interpretation of the meaning and structure of the data stored in the files is up to the application (e.g. UNIX and MS-DOS). Most modern operating systems use the unstructured file model.


In structured files (rarely used now) a file appears to the file server as an ordered sequence of records. Records of different files of the same file system can be of different sizes.

2.†† Mutable and immutable files

Based on the modifiability criteria, files are of two types, mutable and immutable. Most existing operating systems use the mutable file model. An update performed on a file overwrites its old contents to produce the new contents.


In the immutable model, rather than updating the same file, a new version of the file is created each time a change is made to the file contents and the old version is retained unchanged. The problems in this model are increased use of disk space and increased disk activity.


        File Accessing Models

This depends on the method used for accessing remote files and the unit of data access.


1. Accessing remote files

A distributed file system may use one of the following models to service a clientís file access request when the accessed file is remote:


a. Remote service model

Processing of a clientís request is performed at the serverís node. Thus, the clientís request for file access is delivered across the network as a message to the server, the server machine performs the access request, and the result is sent to the client. Need to minimize the number of messages sent and the overhead per message.


b. Data-caching model

This model attempts to reduce the network traffic of the previous model by caching the data obtained from the server node. This takes advantage of the locality feature of the found in file accesses. A replacement policy such as LRU is used to keep the cache size bounded.


While this model reduces network traffic it has to deal with the cache coherency problem during writes, because the local cached copy of the data needs to be updated, the original file at the server node needs to be updated and copies in any other caches need to be updated.


Advantage of Data-caching model over the Remote service model:


The data-caching model offers the possibility of increased performance and greater system scalability because it reduces network traffic, contention for the network, and contention for the file servers. Hence almost all distributed file systems implement some form of caching.


Example, NFS uses the remote service model but adds caching for better performance.





2. Unit of Data Transfer

In file systems that use the data-caching model, an important design issue is to decide the unit of data transfer. This refers to the fraction of a file that is transferred to and form clients as a result of single read or write operation.


File-level transfer model

In this model when file data is to be transferred, the entire file is moved. Advantages: file needs to be transferred only once in response to client request and hence is more efficient than transferring page by page which requires more network protocol overhead. Reduces server load and network traffic since it accesses the server only once. This has better scalability. Once the entire file is cached at the client site, it is immune to server and network failures.


Disadvantage: requires sufficient storage space on the client machine. This approach fails for very large files, especially when the client runs on a diskless workstation. If only a small fraction of a file is needed, moving the entire file is wasteful.


Block-level transfer model

File transfer takes place in file blocks. A file block is a contiguous portion of a file and is of fixed length (can also be a equal to a virtual memory page size).

Advantages: Does not require client nodes to have large storage space. It eliminates the need to copy an entire file when only a small portion of the data is needed.


Disadvantages: When an entire file is to be accessed, multiple server requests are needed, resulting in more network traffic and more network protocol overhead.NFS uses block-level transfer model.


Byte-level transfer model

Unit of transfer is a byte. Model provides maximum flexibility because it allows storage and retrieval of an arbitrary amount of a file, specified by an offset within a file and length. Drawback is that cache management is harder due to the variable-length data for different access requests.


Record-level transfer model

This model is used with structured files and the unit of transfer is the record.


        File-Sharing Semantics

Multiple users may access a shared file simultaneously. An important design issue for any file system is to define when modifications of file data made by a user are observable by other users.


UNIX semantics:

This enforces an absolute time ordering on all operations and ensures that every read operation on a file sees the effects of all previous write operations performed on that file.


Original file††††††††††††††††††††††††††† Retrieved†††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††† Retrieved

Contents††††††††††††††††††††††††††††††††††††††††††††††† file contents††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††† file contents

A, B††††††††††††††††††††††††††††††††††††††† A, B, C†††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††† A, B, C, D, E



††††††††††† Append C††††††††††††††††††† †††††††††† Append D††††††† Append E

††††††††† T1††††††† †† t2††††††††††††††††† ††††† t3†††††††††††††† ††††† t4†††††††††††††† ††††††† t5†††††††††††† †††† t6



††††††††††††††††††††††† New file†††††††††††††††††††††††††††††††††† New file†††††††††† New file††††††††††††††††††††††

††††††††††††††††††††††† Contents†††††††††††††††††††††††††††††††††† Contents†††††††††† Contents

††††††††††††††††††††††† A, B, C††††††††††††††††††††††††††††††††††† A, B, C, D†††††† A, B, C, D, E


UNIX File-sharing semantics



The UNIX semantics is implemented in file systems for single CPU systems because it is the most desirable semantics and because it is easy to serialize all read/write requests. Implementing UNIX semantics in a distributed file system is not easy. One may think that this can be achieved in a distributed system by disallowing files to be cached at client nodes and allowing a shared file to be managed by only one file server that processes all read and write requests for the file strictly in the order in which it receives them. However, even with this approach, there is a possibility that, due to network delays, client requests from different nodes may arrive and get processed at the server node in an order different from the actual order in which the requests were made.


Also, having all file access requests processed by a single server and disallowing caching on client nodes is not desirable in practice due to poor performance, poor scalability, and poor reliability of the distributed file system.

Hence distributed file systems implement a more relaxed semantics of file sharing. Applications that need to guarantee UNIX semantics should provide mechanisms (e.g. mutex lock etc) themselves and not rely on the underlying semantics of sharing provided by the file system.


        File Caching Schemes


Every distributed file system uses some form of caching. The reasons are:


1. Better performance since repeated accesses to the same information is handled additional network accesses and disk transfers. This is due to locality in file access patterns.

2. It contributes to the scalability and reliability of the distributed file system since data can be remotely cached on the client node.

Key decisions to be made in file-caching scheme for distributed systems:


1.     Cache location

2.     Modification Propagation

3.     Cache Validation


Cache Location


This refers to the place where the cached data is stored. Assuming that the original location of a file is on its serverís disk, there are three possible cache locations in a distributed file system:


1.     Serverís main memory

In this case a cache hit costs one network access.

It does not contribute to scalability and reliability of the distributed file system.

Since we every cache hit requires accessing the server.



a.      Easy to implement

b.     Totally transparent to clients

c.     Easy to keep the original file and the cached data consistent.


2.     Clientís disk

In this case a cache hit costs one disk access. This is somewhat slower than having the cache in serverís main memory. Having the cache in serverís main memory is also simpler.



a.      Provides reliability against crashes since modification to cached data is lost in a crash if the cache is kept in main memory.

b.     Large storage capacity.

c.     Contributes to scalability and reliability because on a cache hit the access request can be serviced locally without the need to contact the server.


3.     Clientís main memory

Eliminates both network access cost and disk access cost. This technique is not preferred to a clientís disk cache when large cache size and increased reliability of cached data are desired.



a. Maximum performance gain.

b. Permits workstations to be diskless.

c. Contributes to reliability and scalability.


Modification Propagation


When the cache is located on clientsí nodes, a fileís data may simultaneously be cached on multiple nodes. It is possible for caches to become inconsistent when the file data is changed by one of the clients and the corresponding data cached at other nodes are not changed or discarded.


There are two design issues involved:


1.     When to propagate modifications made to a cached data to the corresponding file server.

2.     How to verify the validity of cached data.


The modification propagation scheme used has a critical affect on the systemís performance and reliability. Techniques used include:


a. Write-through scheme.

When a cache entry is modified, the new value is immediately sent to the server for updating the master copy of the file.



High degree of reliability and suitability for UNIX-like semantics.

This is due to the fact that the risk of updated data getting lost in the event of a client crash is very low since every modification is immediately propagated to the server having the master copy.



This scheme is only suitable where the ratio of read-to-write accesses is fairly large. It does not reduce network traffic for writes.

This is due to the fact that every write access has to wait until the data is written to the master copy of the server. Hence the advantages of data caching are only read accesses because the server is involved for all write accesses.



b. Delayed-write scheme.

††† To reduce network traffic for writes the delayed-write scheme is used. In this case, the new data value is only written to the cache and all updated cache entries are sent to the server at a later time.


There are three commonly used delayed-write approaches:

i.        Write on ejection from cache

Modified data in cache is sent to server only when the cache-replacement policy has decided to eject it from clientsí cache. This can result in good performance but there can be a reliability problem since some server data may be outdated for a long time.


ii. Periodic write

The cache is scanned periodically and any cached data that has been modified since the last scan is sent to the server.


iii. Write on close

Modification to cached data is sent to the server when the client closes the file. This does not help much in reducing network traffic for those files that are open for very short periods or are rarely modified.


Advantages of delayed-write scheme:

1.     Write accesses complete more quickly because the new value is written only client cache. This results in a performance gain.


2.     Modified data may be deleted before it is time to send to send them to the server (e.g. temporary data). Since modifications need not be propagated to the server this results in a major performance gain.


3.     Gathering of all file updates and sending them together to the server is more efficient than sending each update separately.




Disadvantage of delayed-write scheme:

††††††††† Reliability can be a problem since modifications not yet sent to the server from a clientís cache will be lost if the client crashes.


Cache Validation schemes


The modification propagation policy only specifies when the master copy of a file on the server node is updated upon modification of a cache entry. It does not tell anything about when the file data residing in the cache of other nodes is updated.


A file data may simultaneously reside in the cache of multiple nodes. A clientís cache entry becomes stale as soon as some other client modifies the data corresponding to the cache entry in the master copy of the file on the server.


It becomes necessary to verify if the data cached at a client node is consistent with the master copy. If not, the cached data must be invalidated and the updated version of the data must be fetched again from the server.


There are two approaches to verify the validity of cached data: the client-initiated approach and the server-initiated approach.


Client-initiated approach

The client contacts the server and checks whether its locally cached data is consistent with the master copy. Two approaches may be used:


1.     Checking before every access.

This defeats the purpose of caching because the server needs to be contacted on every access.

2.     Periodic checking.

A check is initiated every fixed interval of time.


Disadvantage of client-initiated approach: If frequency of the validity check is high, the cache validation approach generates a large amount of network traffic and consumes precious server CPU cycles.


Server-initiated approach

A client informs the file server when opening a file, indicating whether a file is being opened for reading, writing, or both. The file server keeps a record of which client has which file open and in what mode.

So server monitors file usage modes being used by different clients and reacts whenever it detects a potential for inconsistency. E.g. if a file is open for reading, other clients may be allowed to open it for reading, but opening it for writing cannot be allowed. So also, a new client cannot open a file in any mode if the file is open for writing.


When a client closes a file, it sends intimation to the server along with any modifications made to the file. Then the server updates its record of which client has which file open in which mode.


When a new client makes a request to open an already open file and if the server finds that the new open mode conflicts with the already open mode, the server can deny the request, queue the request, or disable caching by asking all clients having the file open to remove that file from their caches.


Note: On the web, the cache is used in read-only mode so cache validation is not an issue.


Disadvantage: It requires that file servers be stateful. Stateful file servers have a distinct disadvantage over stateless file servers in the event of a failure.


        File Replication


High availability is a desirable feature of a good distributed file system and file replication is the primary mechanism for improving file availability.


A replicated file is a file that has multiple copies, with each file on a separate file server.


Difference Between Replication and Caching


1.     A replica of a file is associated with a server, whereas a cached copy is normally associated with a client.

2.     The existence of a cached copy is primarily dependent on the locality in file access patterns, whereas the existence of a replica normally depends on availability and performance requirements.

3.     As compared to a cached copy, a replica is more persistent, widely known, secure, available, complete, and accurate.

4.     A cached copy is contingent upon a replica. Only by periodic revalidation with respect to a replica can a cached copy be useful.



Advantages of Replication

1.     Increased Availability.

Alternate copies of a replicated data can be used when the primary copy is unavailable.


2.     Increased Reliability.

Due to the presence of redundant data files in the system, recovery from catastrophic failures (e.g. hard drive crash) becomes possible.


3.     Improved response time.

It enables data to be accessed either locally or from a node to which access time is lower than the primary copy access time.


4.     Reduced network traffic.

If a fileís replica is available with a file server that resides on a clientís node, the clientís access request can be serviced locally, resulting in reduced network traffic.


5.     Improved system throughput.

Several clientsí request for access to a file can be serviced in parallel by different servers, resulting in improved system throughput.


6.     Better scalability.

Multiple file servers are available to service client requests since due to file replication. This improves scalability.


Replication Transparency


Replication of files should be transparent to the users so that multiple copies of a replicated file appear as a single logical file to its users.This calls for the assignment of a single identifier/name to all replicas of a file.


In addition, replication control should be transparent, i.e., the number and locations of replicas of a replicated file should be hidden from the user. Thus replication control must be handled automatically in a user-transparent manner.


Multicopy Update Problem

Maintaining consistency among copies when a replicated file is updated is a major design issue of a distributed file system that supports file replication.


1.     Read-only replication

In this case the update problem does not arise. This method is too restrictive.


2.     Read-Any-Write-All Protocol

A read operation on a replicated file is performed by reading any copy of the file and a write operation by writing to all copies of the file. Before updating any copy, all copies need to be locked, then they are updated, and finally the locks are released to complete the write.


Disadvantage: A write operation cannot be performed if any of the servers having a copy of the replicated file is down at the time of the write operation.


3.     Available-Copies Protocol

A read operation on a replicated file is performed by reading any copy of the file and a write operation by writing to all available copies of the file. Thus if a file server with a replica is down, its copy is not updated. When the server recovers after a failure, it brings itself up to date by copying from other servers before accepting any user request.


4.     Primary-Copy Protocol

For each replicated file, one copy is designated as the primary copy and all the others are secondary copies. Read operations can be performed using any copy, primary or secondary. But write operations are performed only on the primary copy. Each server having a secondary copy updates its copy either by receiving notification of changes from the server having the primary copy or by requesting the updated copy from it.


E.g. for UNIX-like semantics, when the primary-copy server receives an update request, it immediately orders all the secondary-copy servers to update their copies. Some form of locking is used and the write operation completes only when all the copies have been updated. In this case, the primary-copy protocol is simply another method of implementing the read-any-write-all protocol.