(Illustration by Gaich Muramatsu)
On Tue, May 01, 2007 at 12:50:29PM -0400, shivers_at_ccs.neu.edu wrote: > - create a venus setup on an ubuntu x86_64 box with > + 10Gb of cache > + big DATA & LOG: > % ls -l DATA LOG > -rw------- 1 root root 1201865464 2007-05-01 12:07 DATA > -rw------- 1 root root 300468736 2007-05-01 12:09 LOG Impressive, my caches tend to be up to 500Mb or so. I think you may possibly be the first one to ever use a 10Gb client cache. > - hoard it all with > hoard add /coda/lambda.csail.mit.edu d+ Do you use a single volume on the server, or are there multiple volumes? If you run 'hoard list', you'll notice that hoard only gets files relative to a volume root. So if /user/shivers is a separate volume this hoard command would only get everything up to and including the mountpoint information, but nothing from your user volume. > This runs for a while, then terminates successfully. As it runs for a while, I guess everything must be in a single volume. > - Test it again by saying "cfs disconnect" and then redoing the find > tree-walk to read the whole subdir a second time. First of all, cfs disconnect doesn't actually change any state in the client, it simply inserts a 'filter' into the rpc2 layer so that all outgoing and incoming packets get dropped. > This runs along fine for a while, with a silent codacon, then codacon > suddenly outputs > > ValidateAttrsPlusSHA CVS(4.7f000000.baf.28f8) [0] ( 11:56:43 ) > Probe ( 11:57:21 ) The client actually ran up until this point thinking that it was still connected and it hasn't received any callbacks so it doesn't have to check anything with the servers. At some point the client gets suspicious because there hasn't been any communication with the server for a while, and it triggers a probe. We send a ping to the server, which then pings back so we get an end-to-end check of the connectivity. However the outgoing ping packets are dropped on the floor, so we end up disconnecting. I am not sure why it would want to revalidate the attributes of that CVS directory. Something like that would indicate that we either don't have cached access rights for the user who is trying to access it, or that the object is considered 'demoted' (i.e. we got a callback or returned from disconnection). Since there are no callbacks (incoming packets are dropped) I assume it is trying to check if the current user has access. > and the find tree walk hangs. After a minute or two, codacon says > unreachable lambda.csail.mit.edu ( 11:58:10 ) > and the find walk resumes with the following output: > > find: ./research/mrlc/mrlc/spim/CVS: Permission denied ... > find: ./research/mrlc/mrlc/confpaper: Permission denied Interesting, when we are disconnected the permission check is extremely permissive. Even if the cached rights would otherwise be considered stale we still allow access, and if they are missing we fall back on the System:AnyUser rights. I guess your volume is protected by an ACL, so that anonymous users cannot read those files. And we don't actually remove rights, but only mark them as stale (removal of cached rights only happens when you explicitly drop tokens with cunlog). So this would indicate that we never actually cached rights for those object, which is strange. Both the earlier connected treewalk didn't actually walk the complete tree and hoard didn't cache rights. The hoard not caching rights may very well be possible, I may only try to make sure the file attributes and data are 'fresh'. I know GNU find has some optimizations that may make it skip subdirectories when the directory linkcount is off. You can disable those by using 'find -noleaf'. Maybe the workaround we try to use isn't working. > - Then I go poke around in the file system. I now have trouble accessing > the problem directories. For example: > > % ls -ld research/mrlc/mrlc/spim/CVS > drwxr-xr-x 1 shivers nogroup 2048 2007-04-30 16:04 research/mrlc/mrlc/spim/CVS > % ls research/mrlc/mrlc/spim/CVS > ls: research/mrlc/mrlc/spim/CVS: Permission denied > % cfs la research/mrlc/mrlc/spim/CVS > research/mrlc/mrlc/spim/CVS: Connection timed out > > Timed out? Hey, I hoarded the file and *disconnected*. Why is venus > even trying to connect at all? cfs listacl and setacl only work while we are connected. This is because the client doesn't really know anything about users and groups so it is not useful to locally cache ACL data. What the client caches is what set of rights the user that fetched the object had. And if we don't have cached rights for another user on the system he will have to refetch at least the attributes. > - Then I reconnect with > cfs reconnect > Now I can see the problem directories with no trouble. I redo the find > tree-walk a third time and it completes with no problems. > + codacon shows *no* server->client file motion. Where there one or more 'validateattr' or 'validatevols' calls after reconnection? The client will use those calls to check the validity of several objects or a complete volume at a time. On the other hand it is strange, I would have expected at least a GetAttr for the object that previously returned permission denied, maybe there is an inverted test somewhere in the area we check if the current user has access to a file. JanReceived on 2007-05-01 14:43:10