NFS File Locking

Getting the Protocol Specs

NFS file locking is mainly provided by the RPC-based network lock manager (NLM) service. The NLM spec is not freely available; you must fork some 45 British Pound including shipping to X/Open Ltd. to get it. The document is titled

	X/Open CAE Specification
	Protocols for X/Open Networking: XNFS, Issue 4

	X/Open Document Number: C218
	Order from: xospecs@xopen.co.uk

If you're just curious, here's a brief overview of how NLM works. You can find the RPC interface specification for NLM, KLM and NSM in /usr/include/rpcsvc on most Unixish systems; they're called nlm_prot.x, klm_prot.x and sm_inter.x, respectively.

General Introduction to File Locking

Starting with 1.3, Linux now supports two types of file locks. One is flock locking, where a process can request to lock the entire file, and the other is POSIX locking, where it can lock specific regions of a file. Both implement two types of locks, exclusive and shared. For any file or file region, there may be at most one exclusive lock, but (in the absence of an exlcuive one) several shared locks. You usually create an exclusive lock when you intend to write to the file region, otherwise a shared lock would be used.

However, these locks are only advisory. That means they're a convention, just like the lockfile-based file locking. For instance, if all mail-related programs agree to use kernel file locking when accessing a mailbox, the procedure when writing to the mailbox goes something like this:

	struct flock	fl;

	fd = open("/var/spool/mail/alex", O_WRONLY);
	fl.l_type = F_WRLCK;
	fl.l_whence = 0;
	fl.l_length = 0; /* until EOF */
	...
	while (fcntl(fd, F_SETLK, &fl) < 0) {
		sleep(2);
		if (retries++ > 5)
			bail out;
	}

	/* write to file */

	fl.l_type = F_UNLCK;
	fcntl(fd, F_SETLK,&fl);

While the file is locked, other process who do not care to check for any existing locks can still read from or write to the file. That's why they're called advisory locks.

There is also the concept of mandatory locks (i.e. those for which the no-read/no-write policy is enforced by the kernel), but it hasn't been implemented yet. Here's a quote from the locks.c file in the kernel:

 *  NOTE:
 *  I do not intend to implement mandatory locks unless demand is
 *  *HUGE*. They are not in BSD, and POSIX.1 does not require them. 
 *  I have never seen any public code that relied on them. As Kelly
 *  Carmichael suggests above, mandatory locks requires lots of changes
 *  elsewhere and I am reluctant to start something so drastic for so
 *  little gain.
 *  Andy Walker (andy@keo.kvaerner.no), June 09, 1995

Network Locking

Lock Monitoring

The most important thing to remember about NLM is that it attempts to provide a locking service that keeps the client's and server's idea of locked files in sync even across server or client client crashes. NLM understands only advisory locks.

The program implementing NLM is usually called rpc.lockd, and must be running on both the NFS client and server machine. When a user process wants to lock a part or the whole of a file from an NFS-mounted volume, the kernel sends the lock request to the local lock daemon. This is usually done using the KLM (Kernel Lock Manager) protocol.

The local lockd forwards the request to the server host, which either grants or rejects the request (well, it can also block it, but that's the advanced stuff), and returns the result to the client lockd, which in turn tells the kernel.

Of course, this is just a raw sketch of what's happing. If you look at nlm_prot.x, you will find that NLM has basically 5 operations:

Test for lock. Server either returns conflicting lock, or LCK_GRANTED if there's none.
Claim a lock. If the lock can be granted, the server allocates it and returns LCK_GRANTED. Otherwiese, it either returns LCK_DENIED or LCK_BLOCKED (depending on whether the client requested blocking or not).
Unlock a lock.
Grant a blocked lock. When an unlock operation on the server would allow a previously blocked lock to be granted, the server grants the request and notifies the client of this.
Cancel a blocked lock. This is called by the client if e.g. the process having requested the blocked lock terminates.

All of these operations also work asynchronously, i.e. with RPC callbacks instead of the usual RPC call-reply scheme. This allows a client to do other things while waiting for the server to process a request. This also helps avoid deadlock situations, for instance when a client places a call to the server at the same time the server sends out a grant notification. With the normal call-reply scheme neither of them would start processing the other's request before it received a timeout.

Apart from that, there are a couple of other function for DOS file sharing which I currently care too little about to deal with them at all. There are some provisions in the code, however.

Internally, lockd has to duplicate a lot of code from the kernel lock implementation because it has to do all the lock conflict checking itself (UNIX does not regard locks held by the same process as conflicting). The same holds for the client, which must remember what locks it holds on what server.

Status Monitoring

NOW, we get to status monitoring. NLM was designed to provide file locking even in the face of client or server crashes. For this, clients and servers have each other monitored by rpc.statd (Network Status Monitor protocol, or NSM, described in sm_inter.x)

Status monitor is a bit of a misnomer; NSM is more appropriately described as a status change notification service.

rpc.statd keeps a list of hosts to notify in stable storage (usually /etc/sm/hostname, although I'll probably use /var/sm instead). After a system crash, it sends a status change notification to rpc.statd on each of these hosts. The remote statd, when receiving this message, notifies all local parties interested of this status change.

How does this fit in with NLM? When a lock server grants a lock request from another host, it registers an RPC callback with rpc.statd that should be invoked when the client has a status change. The client, when requesting its first lock on some server, does the same.

When the client crashes and reboots, the client statd notifies the server's statd of this, which in turn calls back lockd, which in turn frees all locks held by the client.

When the server crashes, the client lockd is notified by the NSM mechanism as above. Now comes the really weird part of the protocol. Instead of throwing up and sending a SIGLOST to all affected processes, the client tries to reclaim all locks it held. To allow for this, the server enters a so-called `grace period' after starting up to give all clients a chance to reclaim their locks. During this period, only lock requests with the `reclaim' flag set to true will be granted.

Kernel-Daemon Communication

Normally, the kernel coomunicates with rpc.lockd using yet another RPC-based protocol called KLM, the kernel lock manager protocol. As this is quite clumsy and requires adding a lot of RPC stuff to the kernel that understands retransmission, request-reply matching etc, I will most likely invent my own protocol using a UNIX domain socket created by socketpair.

What about mandatory locks?

The crucial part about file locks is that they are owned by a process or process group. POSIX locks remember the PID of the process that created the lock, while flock locks don't. Thus, the latter will be inherited by child processes along with the open file, while the former are not. Still, the essence of this is that both of them rely on some notion of the lock being attached to a process, either through an open file or an explicit PID.

Now the problem is that NFS doesn't transport any of this information. Neither does it know about open files, nor is a process ID being passed in the AUTH_UNIX credentials. Moreover, if a client chooses to cache NFS data (e.g. in the kernel buffer cache), any server-side lock checks would be bypassed completely.

The only solution to this problem as I see it would require a total overhaul of either the locking protocol and/or NFS.

The modification to NFS would be straightforward: information on the requester would have to be added to each argument (NLM has an `owner handle' for this, which is often implemented as the hostname concatenated with the PID). This would allow NFS to check all requests against NLM locks. Still, for locks to be strictly mandatory, the kernel on the server machine would have to implement mandatory locking, too. And of course, a modified NLM would have to support mandatory lock requests, too.

The second road to mandatory happiness is to shift the responsibility for mandatory locks to the clients. Each client would have to implement them, and check all read/write requests against its own lists before passing the request to the server. Furthermore, the server lockd would have to inform all NFS clients when one of them requests a mandatory lock. Apart from any race conditions this entails, we all know how difficult it is to keep track of which hosts have mounted file systems from an NFS server.

And yet... even if one found a satisfying concept for mandatory locks (and I consider none of the above satisfying), you will still have synchronization problems because of the RPC retransmissions. Consider the following scenario:

A wants to write to file foobar on the NFS server. Transmits request.
Server honors request, writes data, transmits reply. Reply gets lost.
Client B locks file foobar using our magical best-protocol-since- sliced-bread locking mechanism.
Client A times out, retransmits request.
Server receives request, tries to write data, fails because of lock.
Write call on client A returns error to process, which quits, leaving a half-modified file lying around on the server.

Admittedly, this problem would not exist when client A would also lock the file prior to writing. But cases like these are exactly the very reason for having lock enforcement. If all processes had to lock their files anyway to avoid corruption, then we're back at where we started, and advisory locking would be enough.

Written and maintained by okir@monad.swb.de.

Last updated July 15, 1995.