(Illustration by Gaich Muramatsu)
On Wed, Dec 11, 2002 at 09:31:35AM -0500, Greg Troxel wrote: > I am having similar problems. It has seemed to me that coda > reliablity, specifically repair and conflicts arising when they > shouldn't, has been worse recently (than say a year ago). It could > also be that I'm just stressing it more. Or your network is deteriorating under the load of spam email and gnutella clients ;) > found the normal directory-instead-of-object. local had the right > contents, and global was a symlink to a volume id. I was able to do If global is a symlink to the volumeid, the client hasn't been able to mount the volume. Perhaps it timed out during the volume lookup or something. Another reason would be a server-server conflict, but as you only have a single server that probably isn't the case. > down venus and tried to run norton, but it seems to be only a server > thing. I then tried 'cfs fl .' in the directory, hoping to just Yup, norton is only a server thing. 'cfs fl' is in many cases an evil operation, it flushes the data of cached objects in the specified subtree. It shouldn't touch 'dirty' objects (i.e. the ones with associated CML entries), but you never know. cfs fl is mostly useful for debugging, it can be used to push an object out of the cache, so that I can test or time the fetching of an object. The CML is an ordered log, and you cannot simply kill an entry within the log. The only possible operations are by stepping through the operations with 'repair', or 'cfs purgeml'. Maybe you could try 'removeinc', I know someone here tried to make that work for local-global conflicts, but I'm not sure it ever worked. > poblano gdt 247 ~/%co/HARDWARE/POBLANO > cfs gf wi0.dump > VIOC_GETFID: Connection timed out > > [ W(19) : 0000 : 09:15:58 ] fsdb::Get: Locally created fid (0x7f000001.0xffffffff.0x80001) not found! > [ W(19) : 0000 : 09:16:00 ] fsdb::Get: Locally created fid (0x7f000001.0xffffffff.0x80001) not found! > [ W(19) : 0000 : 09:16:01 ] fsdb::Get: Locally created fid (0x7f000001.0xffffffff.0x80001) not found! > > [ L(18) : 0000 : 09:16:02 ] LocalInconsistentObj: objFid=7f000001.4830.988c Ugh, these are local 'fake' fids, but during reintegration these should have been replaced with correct global fids. Maybe your kernel is a bit too agressive caching the directory contents, or the objects got lost as a result of the cfs fl or failed repairs. > The cml file from 'cfs ck': > Create ???/local > Chown ???/local (owner = 0) > Store ???/local (length = 2836) > and the tar file is empty. Ok, so we have a CML entry to create a file named 'local', but the client is unable to find the associated container file. That's a pretty bad state right there. The name is typically the last name used to access the object, I guess this got expanded as a file conflict and the 'local' directory entry was the last name used to access wi0.dump. It could be that the object still exists, but has a correct global fid. However the way fids are translated from local fake to global fids, I can't see how the CML references could have been missed during the update. > 08:56:56 Callback failed RPC2_DEAD (F) for ws [client-in-question]:65516 > 09:01:19 Callback failed RPC2_DEAD (F) for ws [client-in-question]:64967 These are a result of timeouts, for some reason the client is not responding (or not receiving) rpc2 callback messages. As a result the server will kill all incoming connections from that client. > Having to reinit venus is a real problem when I don't have a > fast link (28.8kb/s right now). And that's when it breaks, usually. > I realize my connectivity is lame for 2002 standards, but it seems > supporting this situation is one of the design goals for coda. Hey I've got 33k6, I'll throttle it for 28.8 for a while to see if I get hit by similar problems. > conflicts due to my work style; I only get pseudo-conflicts, I think > due to a reintegration failing on the client but succeeding on the > server. And the latter kind can never be fixed, in my experience.) The server should deal with retried reintegrations, not sure why it doesn't seem to do that in your case. > I am running code from 2002-10-14; I'll upgrade to -current CVS and > see if things are better. changes that went into rpc2 and rvm that might help... RPC2 _tries_ to deal with asyncronous connections. I don't have ADSL so I haven't been able to test this myself, but the changes might have improved the robustness over slow links as well. Your server might have a relatively faster bandwidth because it can encrypt the ipsec traffic faster than your client? Added an extra scheduling point in RVM, this should improve the responsiveness of the rpc2_listener thread, i.e. the one that tells the other side that it got the message and is 'busy'. But I haven't fixed anything that I can directly link to the problems you are experiencing. JanReceived on 2002-12-11 12:35:21