(Illustration by Gaich Muramatsu)
On Wed, Mar 28, 2007 at 12:39:53AM +0200, Enrico Weigelt wrote: > well, I just had several problems again: > > First the writing hung, after restarting venus, is worked again, > and then a few mins later the coda directory was completely > inaccessible. > > But here's my venus.err log from this time. > > Hope that helps a bit. Yes it does, > Date: Wed 03/28/2007 > > 00:22:48 Coda Venus, version 6.0.15 ... > 00:22:49 31 CML entries on free-list > 00:22:49 starting FSDB scan (4166, 100000) (25, 75, 4) > 00:22:50 (history.dat, 50868448.7f000000.1e20.46be) found owrite object, discarding > 00:22:50 (lost file data backed up to /usr/coda/spool/history.dat-7) > 00:22:50 (.parentlock, 50868448.7f000000.2f6c.4c3c) found owrite object, discarding > 00:22:50 (lost file data backed up to /usr/coda/spool/.parentlock-8) > 00:22:50 (appreg, 50868448.7f000000.f84.c33) found owrite object, discarding > 00:22:50 (lost file data backed up to /usr/coda/spool/appreg-8) Looks like the last time the client crashed while several files were still open for writing. Since we're not sure if their content is consistent we save a backup copy of the file in /usr/coda/spool. This has clearly happened before, the -N extension is used to avoid copying over previous backup copies. I did a quick locate and both appreg and history.dat are firefox/mozilla related files, I'm guessing .parentlock is probably from the same application. There also are about 31 changes that have not yet been propagated back to the server. ... > 00:29:48 fatal error -- Assertion failed: file "fso_cfscalls2.cc", line 269 This is during the same run and this assertion is interesting, it indicates that we received a close upcall for a file that we never knew was opened. This should never happen, which is why the fatal assertion triggers (i.e. we don't even know if the file was opened for reading or writing). Not sure how this could happen, unless venus was restarted with an (intentionally) undocumented flag to rebind to an existing mountpoint. If you started venus normally there must be some sequence of operations, probably from firefox/mozilla that happens to trigger a bug where we forget that a file was previously opened. > Date: Wed 03/28/2007 ... > 00:30:58 Coda Venus, version 6.0.15 Ok, venus is restarted. ... > 00:30:58 32 CML entries on free-list Ah, one additional change was added during the previous run before we crashed. > 00:30:58 starting FSDB scan (4166, 100000) (25, 75, 4) > 00:30:58 (key3.db, 50868448.7f000000.f9e.c40) found owrite object, discarding > 00:30:58 (lost file data backed up to /usr/coda/spool/key3.db-8) And yet another mozilla/firefox file that was open for writing during the crash. Maybe this was the one for which we got mixed up about the open count, but it also may be that the new CML record was the problematic one. i.e. we may have thought we saw the final close for an owrite file, and added a CML_STORE record and then got an unexpected additional close upcall. Clearly we've crashed before while this file was still open for writing. ... > 00:31:09 Resolved realm 'jupiter.local' > 00:33:08 Reintegrate: /, 3/3 records, result = SUCCESS About 2 minutes after we reconnect to the realm we manage to reintegrate 3 records. In a way this is good news, there was no conflict. On the other hand there is no clear indication why the client is unresponsive at this point. There should be at least an additional 29 records waiting to be reintegrated but that doesn't seem to be happening. Now 6.0.15 is an older release, so I'm looking through the commit logs to see if any fixes were committed that may be related to these problems. I don't see anything obvious, one that may affect you in some cases is the following, commit 58ad8b0f0ce88a28c8a1f5a27f929b49deb1a5a9 Author: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Wed Jun 7 16:56:04 2006 -0400 Fix client crash in fsobj::Matriculate We were calling Matriculate outside of an RVM transaction. This is an interesting one that would cause another assertion to trigger. If we get 2 (almost) simultaneous requests for an uncached object, the first one will send an RPC2 call to the server to get the object metadata. At this point the object is in an uninitialized state. The second opener will find the uninitialized object and block until the first thread has received the server response finalized the object. This is done by matriculate. Now when we are reintegrating changes, the server sends callbacks to ensure that the client revalidates that it will have the same object as present on the servers. So shortly after a reintegration there is a good chance that we encounter such a situation. This fix was released as part of Coda-6.0.16. The only other change that may be relevant is if you are currently using the Coda kernel module that came as part of Linux-2.6, you might want to try to build the experimental one from http://www.coda.cs.cmu.edu/pub/coda/linux/kernel/linux-coda-6.5.tar.gz One of the experimental changes in this version involves the way we block signals while we are waiting for venus to read an upcall. I think there is be a chance that with the old method a signal may restart the close() systemcall which could cause venus to see a duplicated close upcall. JanReceived on 2007-03-27 21:55:33