(Illustration by Gaich Muramatsu)
On Thu, Sep 19, 2002 at 10:39:12AM -0400, Greg Troxel wrote: > There is no intrinsic fragmentation problem, but IPsec does add a > header and thus possibly fragment. But I have turned down the maximum > piggybacked validations, and most ops use 1024 bytes of data. > So this might cause timeouts/disconnection, but that's not my real > complaint - it is the unresolvable conflicts. Actually local-global conflicts are never automatically resolved, resolution only occurs between servers. But I know what you mean. The problem is probably related to the 'store-id'. Every operation is tagged with a unique store-id so that the server can detect when an operation is retried and simply returns EALREADY. What you are seeing seems to be caused by the client timing out on the connected store operation, then switching back to write-disconnected, logging the store in the CML. It looks like the CML assigns a different storeid to the same operation, so by the time the client tries to reintegrate the logged operations, they fail on the server because it has already committed a change to that file with a different id (i.e. really the same operation, but the server doesn't know that and the client forgot it). I'll look at where these storeid's are assigned and see whether they can be made somewhat more persistent. The second problem that I see is that obviously the reply packets are getting lost somehow, either that or the server is not yielding to other LWPs for more than 15-30 seconds, so that it is unable to send back RPC2_BUSY's when the client sends it's retries. The ACK's and BUSY's should be pretty small, so this cannot be fragmentation related. Now I'll add some more explainations to your initial email, > Create /coda/home/gdt/secret/path-a/path-b/path-c2 > Chmod /coda/home/gdt/secret/path-a/path-b/path-c2 (mode = 644) > Remove /coda/home/gdt/secret/path-a/path-b/path-c3 > Rename /coda/home/gdt/secret/path-a/path-b/path-c2 (to: /coda/home/gdt/secret/path-a/path-b/path-c3) > Remove /coda/home/gdt/secret/path-a/path-b/.pvect_path-c3 > Symlink /coda/home/gdt/secret/path-a/path-b/.pvect_path-c3 (--> e9c222dc) > > The .pvect symlink to nowhere stores the IV for the file. > > Store /coda/home/gdt/secret/path-a/path-b/path-c3 (length = 401) > > It seems odd how the remove of c3 is before the rename and store; I > would expect rcs to create a new file and do an atomic rename at the > end to avoid losing the ,v file. The remove before the rename isn't that strange. It happens when an object is renamed over an existing object. Any existing target is then automatically removed. The remove is added whenever the client knows it is renaming over an existing object, and the rename reintegration explicitly does _not_ overwrite existing target objects. The trick is that this is the only way that a rename reintegration actually gets a conflict if it renames over an object the client never knew about. So it is the old version of c3 that is removed. It looks like rcs is doing something like the following, fd = open("path-c2", O_CREAT | O_WRONLY, 0); -> Create write(fd, data); fchmod(fd, 0644); -> Chmod rename("path-c2", "path-c3"); -> Remove/Rename unlink(".pvect_path-c3"); -> Remove symlink("e9c222dc", ".pvect_path-c3"); -> Symlink close(fd); -> Store (I guess it is doing the atomic rename at the end, but forgets to close the filedescriptor, maybe it is only closed when the application exits?) > I then did 15 more preservelocals without doing checklocal, and then > typed 'end'. repair hung for a while (a minute?) and then exited with > an ioctl error. Venus had then unmounted /coda and was wedged. Probably a generic *BSD bug, I remember Peter Braam and Bob Baron arguing about how to handle blocking on venus's reply messages in the kernel. The problem is that emacs or xemacs is continually sending SIGALRM to emulate non-blocking behaviour. This has the side effect that Venus actually tries to abort the operation. The Peter solution (Linux) was to block signals for a period up to 30 seconds and then to continue waiting in interruptable state. Special exceptions here upcalls like CODA_CLOSE, which shouldn't be aborted at all because if venus doesn't see these it's internal refcounted gets screwed up. The Bob solution (*BSD) was to block everything for a period of 1 minute and then automatically return to the user with an error. This is only as far as I can remember (this whole discussion was over 4 years ago). And I don't see why venus would be unmounting /coda and wedging as a result, so I could be completely off here. JanReceived on 2002-09-23 13:09:53