Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Fri, 23 Mar 2007 21:08:22 -0400

On Fri, Mar 23, 2007 at 09:34:21PM +0100, u+codalist-p4pg_at_chalmers.se wrote:
> On Fri, Mar 23, 2007 at 03:22:00PM -0400, Jan Harkes wrote:
> > > The "lack of cache isolation" is a problem which Coda apparently inherited
> > > from AFS. We should get rid of it.
> > 
> > No it is inherited from the fact that the UNIX VFS layer expects UNIX
> > semantics, which means that applications share the same pagecache/memory
> > mappings when they open the same file.
...
> A conditional copy-on-open[*], per uid, might help - of course it would
> imply deep changes in Venus, but it can be made to behave reasonably, both
> from conveniency, security and *nix-kernel-compatibility view.

The first and probably most complicated barrier is the kernel, and I'm
not talking about the Coda kernel module, but the linux VFS and memory
manager.

Every object is identified by an inode (vnode in BSD) the memory
mappings are linked off of this through a pointer. Venus can already
(and sometimes does) pass different container files when different
applications open the same file. Reads and write actually work
correctly, but a problem occurs when both applications try to mmap that
file. To mmap a file we have to redirect the i_mapping pointer to the
mapping of the underlying container file. Because there is only one Coda
inode and N container files only the first mmap succeeds and the others
all fail.

With Linux 2.6 the situation has become somewhat worse, when a file is
opened the VFS initializes file->f_mapping to the value of
inode->i_mapping, which is I believe used by fsync. Because of this we
have to do the remapping on open instead of lazily when mmap is called.

> [*] Venus can always(?) pretend for the kernel that the file has been
> rename()-ed and recreated since the former open(). Then it is up to Venus
> to decide if it can reuse the same container file or has to make a copy or

Use a unique inode for each user context. With a lot of pain this may be
possible. The pain has to do with some of the fairly strict requirements
for inode numbers.

First of all we identify file objects based on an 128-bit identifier,
but the kernel's inode numbers are 32-bits only. And those 32-bit
numbers have some constraints. when two different objects use the same
inode number, applications like tar or rsync think that both are
hardlinks to the same contents. So any collisions may cause problems
when copying data in or out of /coda.

Also, the classic getpwd implementation does a recursive walk to the
root and at each level checks the inode number of the current directory
against contents of readdir('..'), the entry with the matching number is
used as the name of the current path component. Directory contents is
generated by venus, so venus has to be able to predict which inode
numbers the kernel will pick for objects that we may not even have
cached yet.

Finally, many applications use the inode number to test for possible
modifications. An editor opens a file, reads the contents and closes the
fd. However it caches the stat() information of the file it read and
before we save we call stat() again and if anything has changed
(mtime/size/inode number) the editor assumes someone else also updated
the file and aborts the write. i.e. in vim you'd use :w! to force the
updates to disk, or :e to reopen the new version losing any local
changes. Same thing for programs like tripwire, aide, samhain and
osssec-hids, which check your system for unwanted changes.

Currently we rely on knowing a lot of how file identifiers are assigned
by Coda and use a hash over the 128-bit identifier to get a stable
32-bit inode number that has a low probability of collisions. This hash
has to be identical in both the kernel and venus so that readdir data
matches up with the in-kernel Coda inodes.

One change that I do consider useful now is if an opendir didn't perform
the fid->ino conversion, but returned (name, fid) data to the kernel.
The kernel already needs to do fid->ino mapping when a Coda inode is
allocated so it could perform the same translation when readdir entries
are returned. This would insulate venus from having to know anything
about inode numbers. It also avoids any userspace changes when the
kernel changes to a larger 64- or 128-bit inode number space.

This still doesn't solve your problem, we'd have to define what the
security context is, (a user id, a processgroup id, session id, pag?)
and mix that into the fid -> ino mapping. Even then there are still
quite a few problems remaining, for instance if venus does return the
same container for different contexts we need to keep attributes such
as i_size in sync between different coda_inodes, and coda_inode->i_mutex
no longer guarantees exclusive access to the container file which could
introduce race conditions.

> file differ (that is, hashes fetched via different users' authenticated
> connections surprisingly are different - which means server spoof - Venus

Or a file that was updated on the server, but we haven't received the
callback yet.

Jan

Coda File System

Re: the protection model