(Illustration by Gaich Muramatsu)
On Fri, Mar 23, 2007 at 09:34:21PM +0100, u+codalist-p4pg_at_chalmers.se wrote: > On Fri, Mar 23, 2007 at 03:22:00PM -0400, Jan Harkes wrote: > > > The "lack of cache isolation" is a problem which Coda apparently inherited > > > from AFS. We should get rid of it. > > > > No it is inherited from the fact that the UNIX VFS layer expects UNIX > > semantics, which means that applications share the same pagecache/memory > > mappings when they open the same file. ... > A conditional copy-on-open[*], per uid, might help - of course it would > imply deep changes in Venus, but it can be made to behave reasonably, both > from conveniency, security and *nix-kernel-compatibility view. The first and probably most complicated barrier is the kernel, and I'm not talking about the Coda kernel module, but the linux VFS and memory manager. Every object is identified by an inode (vnode in BSD) the memory mappings are linked off of this through a pointer. Venus can already (and sometimes does) pass different container files when different applications open the same file. Reads and write actually work correctly, but a problem occurs when both applications try to mmap that file. To mmap a file we have to redirect the i_mapping pointer to the mapping of the underlying container file. Because there is only one Coda inode and N container files only the first mmap succeeds and the others all fail. With Linux 2.6 the situation has become somewhat worse, when a file is opened the VFS initializes file->f_mapping to the value of inode->i_mapping, which is I believe used by fsync. Because of this we have to do the remapping on open instead of lazily when mmap is called. > [*] Venus can always(?) pretend for the kernel that the file has been > rename()-ed and recreated since the former open(). Then it is up to Venus > to decide if it can reuse the same container file or has to make a copy or Use a unique inode for each user context. With a lot of pain this may be possible. The pain has to do with some of the fairly strict requirements for inode numbers. First of all we identify file objects based on an 128-bit identifier, but the kernel's inode numbers are 32-bits only. And those 32-bit numbers have some constraints. when two different objects use the same inode number, applications like tar or rsync think that both are hardlinks to the same contents. So any collisions may cause problems when copying data in or out of /coda. Also, the classic getpwd implementation does a recursive walk to the root and at each level checks the inode number of the current directory against contents of readdir('..'), the entry with the matching number is used as the name of the current path component. Directory contents is generated by venus, so venus has to be able to predict which inode numbers the kernel will pick for objects that we may not even have cached yet. Finally, many applications use the inode number to test for possible modifications. An editor opens a file, reads the contents and closes the fd. However it caches the stat() information of the file it read and before we save we call stat() again and if anything has changed (mtime/size/inode number) the editor assumes someone else also updated the file and aborts the write. i.e. in vim you'd use :w! to force the updates to disk, or :e to reopen the new version losing any local changes. Same thing for programs like tripwire, aide, samhain and osssec-hids, which check your system for unwanted changes. Currently we rely on knowing a lot of how file identifiers are assigned by Coda and use a hash over the 128-bit identifier to get a stable 32-bit inode number that has a low probability of collisions. This hash has to be identical in both the kernel and venus so that readdir data matches up with the in-kernel Coda inodes. One change that I do consider useful now is if an opendir didn't perform the fid->ino conversion, but returned (name, fid) data to the kernel. The kernel already needs to do fid->ino mapping when a Coda inode is allocated so it could perform the same translation when readdir entries are returned. This would insulate venus from having to know anything about inode numbers. It also avoids any userspace changes when the kernel changes to a larger 64- or 128-bit inode number space. This still doesn't solve your problem, we'd have to define what the security context is, (a user id, a processgroup id, session id, pag?) and mix that into the fid -> ino mapping. Even then there are still quite a few problems remaining, for instance if venus does return the same container for different contexts we need to keep attributes such as i_size in sync between different coda_inodes, and coda_inode->i_mutex no longer guarantees exclusive access to the container file which could introduce race conditions. > file differ (that is, hashes fetched via different users' authenticated > connections surprisingly are different - which means server spoof - Venus Or a file that was updated on the server, but we haven't received the callback yet. JanReceived on 2007-03-23 21:10:41