Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Thu, 19 Jul 2001 09:19:37 -0400

On Thu, Jul 19, 2001 at 01:43:58AM +0200, Cain Ransbottyn wrote:
> What about qmail ? Will coda or reiserfs have problems with qmail ? What
> should we do about it to get this fixed ?

Like Andrea said, qmail uses maildir which tries to be NFS safe during
mail delivery. Renames tend to remove the target file if it already
existed, so on the odd chance that maildir picked the same name for two
different email messages they use link()/unlink() to move files between
directories.

However, Coda doesn't allow cross directory links, which we fail with
EXDEV. The maildir (qmail) patch changes the safe_rename to

    safe_rename(old, new)
    {
	err = link(old, new);
+   	if (err == -1 && errno == EXDEV) {
+		err = rename(old, new);
+		return err;
+   	}
	if (err == -1)
	    return err;
	err = unlink(old);
	return err;
    }

As non-Coda filesystems will never report EXDEV as an error result when
linking within the same FS this does not break the existing guarantees.
With Coda there is a chance that names collide and therefore one is
lost, but the way these names are constructed makes it very small
indeed.

>From the maildir specification (http://cr.yp.to/proto/maildir.html)

    Okay, so you're writing messages. A unique name has three pieces,
    separated by dots. On the left is the result of time(). On the right
    is the result of gethostname(). In the middle is something that
    doesn't repeat within one second on a single host. I fork a new
    process for each delivery, so I just use the process ID. If you're
    delivering several messages from one process, use
    starttime.pid_count.host, where starttime is the time that your
    process started, and count is the number of messages you've
    delivered. If your system provides a sequence number syscall, use
    that instead of the pid, preceded by #.

So only when the system uses a different process to deliver each mail,
and there are so many emails that the PID numbers wrap within a second,
there is a chance that these names collide. I don't believe Coda is fast
enough to handle over 32768 (assuming pid's wrap at 2^15) * 3 (create,
store, rename) ~= 100000 RPC calls (and associated server transactions)
per second.

Now, there are some problems. One serious problem is that the current
format of directories in RVM limits the maximum directory size to about
256KB. Because the maildir names are relatively long, and there is some
additional data stored in the directory for each entry the directory
size will max out somewhere between 2000 and 4000 messages per mailbox.
I often get more than that from linux-kernel in a month!

The 'quick and dirty' patch is to redefine the DIR_PAGESIZE from 2048 to
8192 bytes which makes the max directory size approximately 1MB, and
allows up to 16000 entries per directory (unless all names are longer
than 16 characters, in which case the max is 8000).

!!! This is not advised at all !!!

Noone ever tried it. It breaks both the in-memory and on-the-wire
formats of a directory so much that all clients and server _have_ to run
the patched version. Also backups can only be dumped/restored with
patched tools etc. And existing servers cannot be upgraded, they have to
be reinitialized.

Another solution is to redefine DIR_MAXPAGES from 128 to 512, this will
be on-the-wire compatible with existing clients/servers as long as all
directories are smaller than 256KB. However when a larger directory is
passed to an unpatched client or server it is likely to keel over and
die, or at least do unexpected things like not showing new files.
Because the in-memory format is different an existing server cannot
simply be upgraded by running a patched version, the server will have to
be reinitialized.

Both solutions will push the limit a bit further, but don't reliably
solve the problem, what if I now want 100000 files per directory, reinit
the whole server group again??? Also the forced server reinitialization
is a PITA. I'm still looking for the real solution that will allow
directories to scale to similar sizes as files (about 2^31 bytes).
Directories will have be be stored in container files instead of RVM,
the current 'copy to VM, modify, copy back to RVM' way of manipulating
directories should change to 'log changes in VM, store log in RVM,
reliably apply to container file' etc.

Problem #2, server performance. Coda servers have significant overhead
to reliably commit changes, or actually the kernel seems to have
problems with it's fsync implementation because that's were we're
blocking a lot of the time. I believe that at some point it was measured
that we couldn't do more than about 100 directory modifying operations
per second (create/rename/delete). So handling an incoming email stream
of 50 mails per second is already pushing it. Similarily, when a client
has read new messages, moving them to 'cur' is very slow. It is less
noticable when the imap client is committing the change whenever the
user has read a message (i.e. pine), and more noticable when it batches
these types of updates until it actually changes folders (i.e. mutt).

Don't really know how to optimize this. A lot of this overhead could be
a result of the way directories are currently manipulated (allocate VM/
copy to VM/ modify/ copy back to RVM). But I could be wrong as I haven't
actually profiled any clients or servers.

Jan

Coda File System

Coda + qmail (i.e. maildir)