(Illustration by Gaich Muramatsu)
Not much spam and no gnutella, but probably a web browser open on the weather page that updates occasionally. IPsec speed should be ok; server is a PPro-200 and client a Pentium IV 2GHz, which should be able to keep up (doing AES and HMAC-SHA1). Seriously, it's not the cpu speed, although there could be hiccups during SA renegotiation (but I don't think so). If global is a symlink to the volumeid, the client hasn't been able to mount the volume. Perhaps it timed out during the volume lookup or something. Another reason would be a server-server conflict, but as you only have a single server that probably isn't the case. This seemed to persist, even though I could do 'cfs lv'. My real problem isn't that something bad happened, it's that I could not recover. Yup, norton is only a server thing. 'cfs fl' is in many cases an evil operation, it flushes the data of cached objects in the specified subtree. It shouldn't touch 'dirty' objects (i.e. the ones with associated CML entries), but you never know. So really 'cfs fl' should only discard cached data that is still in the 'read-only' state, and thus should be safe at any time. If not, it probably should be fixed. cfs fl is mostly useful for debugging, it can be used to push an object out of the cache, so that I can test or time the fetching of an object. sure, so I was doing the wrong thing with it, but still it should not have caused trouble. The CML is an ordered log, and you cannot simply kill an entry within the log. The only possible operations are by stepping through the operations with 'repair', or 'cfs purgeml'. Maybe you could try 'removeinc', I know someone here tried to make that work for local-global conflicts, but I'm not sure it ever worked. I should have tried purgeml, but given that I had a 'local inconsistent object' I bet it would not have worked. Perhaps there should be some way to drop the head entry off the CML, which seems to be analogous to 'discardlocal' in repair. So maybe there is, and the problem is that entering repair mode can fail. In my book, entering repair mode should succeed any time there is a conflict. Ugh, these are local 'fake' fids, but during reintegration these should have been replaced with correct global fids. Maybe your kernel is a bit too agressive caching the directory contents, or the objects got lost as a result of the cfs fl or failed repairs. This is on NetBSD 1.6, but I have had similar experiences on FreeBSD. Ok, so we have a CML entry to create a file named 'local', but the client is unable to find the associated container file. That's a pretty bad state right there. The name is typically the last name used to access the object, I guess this got expanded as a file conflict and the 'local' directory entry was the last name used to access wi0.dump. I never created anything called local. I am pretty sure this is from the failed repair session. > 08:56:56 Callback failed RPC2_DEAD (F) for ws [client-in-question]:65516 > 09:01:19 Callback failed RPC2_DEAD (F) for ws [client-in-question]:64967 These are a result of timeouts, for some reason the client is not responding (or not receiving) rpc2 callback messages. As a result the server will kill all incoming connections from that client. This probably happened during congestion on the link. That's life, and shouldn't cause lasting trouble (I'd expect going from WD to disconnected, and then back when the server probe works, picking up reintegrating). Hey I've got 33k6, I'll throttle it for 28.8 for a while to see if I get hit by similar problems. I don't think that will make the difference. Mine is nominally 33.6, but I get 28 or 26. It could be that the BSD kernel support is buggy. That would be fair enough after all the trouble I've seen with Linux kernels and coda over the years. The server should deal with retried reintegrations, not sure why it doesn't seem to do that in your case. Do you mean modification on client while WD try to reintegrate Backfetch times out, causing Store to fail pause try to retintegrate the same Store Backfetch works this time [successful store with no conflict] is what you think happens with the current code? My issue isn't I get timeouts once in a while and go disconnected when I think I shouldn't, but venus reintegrates stuff later so it's only annoying. This would be an annoyance. It's more like When I modify files in the client, and the client was WD to start with, and no one else is writing to that volume, I end up with conflicts that I cannot repair and I have to reinit venus. which would cause me to stop using coda if I hadn't already integrated into how I work. What's the state of the realms branch and the future repair changes? It seems like repair (venus's representation of stuff) is bletcherous now. But, it may be that the problem is in the NetBSD kernel code. I wonder if putting some more aggressive cache flushing into venus/netbsd would help. I'd take not losing over performance happily, and then we'd know where to fix. I admit I have assumed that the problem is in venus, and that isn't necessairly clear. Greg Troxel <gdt_at_ir.bbn.com>Received on 2002-12-12 09:39:20