(Illustration by Gaich Muramatsu)
On Thu, Mar 30, 2006 at 10:30:14AM +0300, Denis Chapligin wrote: > 1)"psdev_write: msg (12,545) not found" message in client's dmesg > 2)"08:29:37 GetVolObj: Volume (1000001) already write locked 08:29:37 RS_LockAndFetch: Error 11 during GetVolObj for 1000001.1041.c066" message at the one server's log > 3)And very interesting log on the second server (attached) I've been digging through the log file and found a couple of interesting things. After your server came up, it started to resolve some things from the other replica. Very early on during the resolution it hits a remove/update conflict during rename resolution. 16:53:05 ChkResRenSem: RUConflict on target 2b48.b34a 16:53:05 Error 199 from CheckResolveRenameSemantics 16:53:05 Incorrect Res Rename: src = conf_global.php (2b4c.b34b), tgt = conf_global-bak.php (2b48.b34a)s My guess is that this was an operation like, mv conf_global.php conf_global-bak.php Such a rename operation implicitly removes the target if it already exists, but the Remove/Update conflict tells me that there was a different version of conf_global-bak.php (update) on the server that was restarted and Coda doesn't want to lose that version so it errors out on the rename. 16:53:05 conf_global.php exists already N/N conflict, mark parent 0x203.ad8f inc 16:53:05 InstallVV: Error 199 getting object 203.ad8f Now this is strange, maybe just a logging error, my guess is that it prints the wrong filename and really wanted to say 'conf_global-bak.php' exists already. In any case, it marks the parent directory as a conflict. And because of the conflict we do not update the version vector. Then we get a lot of 16:55:02 VGetVnode: vnode 2000001.5672 is not allocated This is probably because of clients revalidating their caches and hitting files that haven't been resolved yet. Files cannot get resolved before their parent directory, otherwise we'd end up with orphaned file objects that might never get removed. So the server is simply returning errors in the hope that at some point the client hits the parent directory. The resolution of a parent will replay the create operations which creates empty file objects and then the file resolution can take over and propagate the data. We see a couple of 16:55:03 ComputeCompOps: fid(0x7f000000.1f05.a50e) Indicating that in fact we are hitting some directories once in a while. There actually don't seem to be that many missing objects, the client(s) just seem to be very persistent in trying to validate them over and over again. This would indicate that they are unable to recurse up to the parent directory, my guess is that the parent is possibly the directory that was marked as a conflict earlier on. Around 17:15 we finally we get to see a couple of FileResolve operations. This is almost 20 minutes after the server was started, that that seems like a pretty long time. Maybe this is cause by some client that is walking up the tree (possibly your client as you are trying to figure out why things aren't working). Another 20 minutes later, we start to see that one of the servers is returning EAGAIN errors, not sure what that is about. 17:35:58 CheckRetCodes: server 10.0.100.3 returned error 11 The other interesting thing here is that some client seems to be repeatedly asking us to resolve the same object over and over again, 17:36:13 Entering RecovDirResolve 7f000000.1055.c06b 17:36:13 ComputeCompOps: fid(0x7f000000.1055.c06b) 17:36:13 RS_ShipLogs - returning 0 ... 17:36:33 Entering RecovDirResolve 7f000000.1041.c066 17:36:33 RegDirResolution: VECTORS ARE ALREADY EQUAL {[ 18 1 0 0 0 0 0 0 ] [ 167797763 1143613965 ] [ 0 ]} 17:36:33 RecovDirResolve: RegDirResolution succeeded ... 17:36:34 Entering RecovDirResolve 7f000000.1041.c066 17:36:34 RegDirResolution: VECTORS ARE ALREADY EQUAL {[ 18 1 0 0 0 0 0 0 ] [ 167797763 1143613965 ] [ 0 ]} {[ 18 1 0 0 0 0 0 0 ] [ 167797763 1143613965 ] [ 0 ]} 17:36:34 RecovDirResolve: RegDirResolution succeeded ... ... snipped a couple of dozen attempts. ... 17:48:28 Entering RecovDirResolve 7f000000.1055.c06b ... ... 08:29:23 Entering RecovDirResolve 7f000000.1055.c06b So according to the server/version vectors this directory is already resolved, but for some reason the client think something is wrong. Maybe they are getting a VNOVNODE error from one of the directory's children and are trying to recurse up the tree so that directory resolution can create the file. I also never saw a repair operation for the directory that was marked as inconsistent by the rename conflict 7f000000.203.ad8f, This could be related, maybe that is a sub- or parent directory of the one that we're repeatedly trying to resolve. I would first try to find the parent directory of conf_global.php, and make sure it is repaired. Then you can try to find the directory that we are repeatedly trying to resolve. You might be able to find it with, find /coda/realm -noleaf -type d -print0 | \ # get a list of all directories xargs -0 cfs getfid | \ # pass it to cfs getfid grep 7f000000.1055.c06b # and grep the one we want This might be kind of slow, since it will basically query every single directory. Another way is to hope that venus already has the object cached. coda-client-logrotate # rotate the venus.log file vutil stats # dump detailed statistics and cache contents # to venus.log less /usr/coda/etc/venus.log /1055\.c06b If this finds the object it has a lot of information that may help figuring out what/where it is in the tree. (comp = .. is the filename, parent = .. is the fid of the parent directory, etc) JanReceived on 2006-04-07 00:22:59