Coda File System

Re: problem reading files from coda when disconnected

From: Jan Harkes <>
Date: Wed, 2 May 2007 19:19:52 -0400
On Tue, May 01, 2007 at 11:15:59PM -0400, wrote:
>    > - hoard it all with
>    >     hoard add /coda/ d+
>    Do you use a single volume on the server, or are there multiple volumes?
> Single volume.

Spreading data across multiple volumes can in some cases be useful. Most
operations lock on a volume granularity, there is enough complexity as
is and this simplifies locking considerably. A clear example here is
reintegration which occurs on a per-volume basis and all operations are
sent back in the same order as they were logged. So having several
volumes would allow changes to propagate out of order. For instance a
modified latex file in one volume doesn't have to wait until a large
file in another volume has been reintegrated.

The volume is also used to contain damage, a conflict will only affect
access to a single volume. So if there is a reintegration conflict in
one area, other changes in other voluems can still make it safely back
to the server.

Finally, the rapid cache validation works by checking the volume
version, if we know that no file in a volume has changed we don't have
to check each individual file. So having a volume with mostly inactive
(archived) files will considerably speed up cache revalidation after a

>    So this would indicate that we never actually cached rights for those
>    object, which is strange.
> Yes, that does seem strange. 

I looked through the code (fsobj::Access -> fsobj::CheckAcRights) and
didn't see anything obviously incorrect.

> It seems very weird that the hoard wouldn't cache rights.
> I have also had these timeout / no-permission errors when doing a big
> "cp -pr" copy into the coda filesys.

Copying a lot of data tends to stress the client and servers a bit more,
also in some cases we mispredict the version stamp that a file will have
once it hits the server. So if we have a temporary disconnection and hit
one of those mispredicted objects we may see a timeout or permission
denied error.

The disconnections may be caused by the large client cache, at some
point the client tries to order all cached object so that we throw out
less useful ones first (status and data walk messages in codacon). The
ordering is based on a combination of static (hoard) and dynamic (file
size and last access) priority values and I think this is the area that
may cause the client to temporarily become unresponsive with large
caches. As a result it may miss a message from the servers which would
result in a forced disconnection.

I haven't actually used a large enough cache yet, and I know at least
one case that was mis-diagnosed as a large cache issue which was in fact
a signal handling problem which caused the kernel module to spin in a
tight loop until venus returned the upcall response.

>    >   + codacon shows *no* server->client file motion. 
>    Where there one or more 'validateattr' or 'validatevols' calls after
>    reconnection?
> I don't understand what you are asking here.

Well, if you reconnect I would expect that the client would check if the
cached data has changed on the server during the disconnection. So it
would use validatevols to check on a volume granularity and if we
suspect we missed any changes we would see validateattr calls logged in

> I also am having problems with hoard authentication. The man page says
> the hoard user must be either
>   - root
>   - the console user
>   - the "primary" user from /etc/venus.conf
> but I have a hard time making this work. I'm logged into my computer, with
> X running on the console (vt7, which is standard). But hoard loses:

Right, root is no longer as useful, there used to be a lot of special case
code, root wasn't allowed to have tokens and in some cases we used the
effective uid and in others the real uid. But on Linux we always used
fsuid which mostly tracks euid except for nfsd or samba. With all the
subtle changes the original setuid root hoard binary solution probably

The console user was used on Mach and early BSD systems, which didn't
have virtual consoles. So again Linux did things differently and simply
assumed that whoever is logged into tty1 is considered the console user.
This makes hoard work for people who log in on tty1 and then run startx.

But that doesn't work when the system uses xdm/gdm, so the final
addition was a 'primaryuser' option in the venus configuration. I use
the primaryuser setting myself.

> Then I su and try again:
>     # hoard add /coda/ 
>     canonicalize: chdir(shivers) failed (Permission denied)
> Then I note that I have no tokens while su'd as root, so I clog and try again:
>     # ctokens
>     Tokens held by the Cache Manager for root:
> 	    Not Authenticated
>     # clog
>     username:
>     Password: 
>     # hoard add /coda/ 
>     pioctl:Add(, ./user/shivers/text/3min.txt, 10, 0, 0): Permission denied

Now this is interesting, I think your client probably did end up with
a hoard profile owned by root. So during the next hoard walk it pulled
everything into the cache, but the cached access rights were for local
uid '0'.

Combined with occasional disconnections I can imagine that the initial
connected tree walk may have skipped over some areas especially because
it had to refetch at least the attributes for all objects to get access
rights for your local user id.

I would set the primary user value in /etc/coda/venus.conf to your local
uid, then run 'hoard clear' to get rid of the existing hoard profile and
redo the hoard add as your user. During the add it will try to walk the
tree which may get interrupted due to a disconnection and you would get
a permission denied type error.

That doesn't mean that the add didn't work, we just haven't completely
expanded the profile to the full tree. 'hoard list' should show the
entry. Forcing a server check (cfs cs) followed by a hoard walk will
continue the expansion.

If the hoard walk gets interrupted by yet another disconnection it will
exit with a list of objects we failed to get. Again checkservers and
hoard walk will let it continue where the previous walk failed.

Received on 2007-05-02 19:21:30