(Illustration by Gaich Muramatsu)
On Tue, May 31, 2005 at 09:09:35AM -0600, Patrick Walsh wrote: > OK, over the weekend coda hung again). On two machines with two > different problems (a signal 11 and an assert failure in realms.cc), but I've been working on the iterators over the weekend. The 'safe deletion' parts are removed from the core iterator functions and I think I got all places where we were doing unsafe deletion. I also improved the conn_iterator and volent_iterator objects to keep references to the objects while we walk the list, those two looked like the most intensive users of the problematic sequence for entry in list: call yielding function delete entry One thing I couldn't figure out until just now is why you actually have conn entries, which are used for non-replicated or backup volumes and during weak-reintegration, instead of mgrp entries which are used for replicated volumes. That turned out to be quite simple, there is a special 'V_UID' that is used for some operation when we don't know the actual user who triggered the operation or for certain background operations. Some of the obvious ones are when we get volume information and during the periodic server probes/backprobes. And V_UID is set to '0', which means that the root user has a bunch of connections associated with it. Now most people probably aren't giving Coda tokens to root, so this explains why you seem to be seeing these problems a lot more than most people. I changed those places to use ANYUSER_UID, which means that it will try to use the first available (authenticated?) connection or else allocate one that should belong to nobody. I haven't been able to get my client to segfault yet, even when hitting it concurrently with disconnections, reauthentications and read and write operations. But then again, your 'test-environment' might be hitting it harder than I ever can. > one of the machines we'll ignore for now since I haven't updated it with > all the newest RPMs. Unfortunately, it seems that the debugging symbols > are still not in the venus binary, though I don't have the slightest > idea why not. I'll look into this yet again. I think rpmbuild actually strips the binaries before it packs them in the RPM, the unstripped versions seem to be placed in an associated 'debuginfo' RPM, although that might only be for libraries. > In the meantime, I've noticed a trend. These signal 11's seem to > happen almost always at 00:30:01. I've tried running all our cronjobs ... > somewhere outside of cron. There is nothing in the syslog or cronlog to > indicate anything happening at that time. Is that a special time in > Coda? Here's the latest batch of logs and backtraces: Not really, there are periodic hoard walks and server probes, but those are relative to when venus starts, so unless you always start your client at exactly the same time I wouldn't expect them to coincide so nicely on 00:30 real-time intervals. > 18:30:00 Coda token for user 0 has been discarded > 19:00:00 Coda token for user 0 has been discarded > 20:00:01 Coda token for user 0 has been discarded If the venus.log contains userent::Connect: Authenticated bind failure, uid = 0 Then this is a result of a server rejecting the token, otherwise it is because of an explicit call to cunlog. > Date: Sat 05/28/2005 > > 00:30:01 Fatal Signal (11); pid 8041 becoming a zombie... > 00:30:01 You may use gdb to attach to 8041 > > [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA (dir130) > [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 > [ D(829) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1 > [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA (dir129) > [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 > [ D(829) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1 > [ W(190) : 0000 : 00:30:01 ] ***** FATAL SIGNAL (11) ***** 'D' is the probe-daemon thread, and 'W' is a worker thread. My guess is that 'D' is sending out periodic server probes (using the V_UID = 0 connections), while at the same time the user calls 'cunlog' for the root user which triggers the worker thread to destroy any connections owned by root, which to me indicates that I'm fixing the correct bugs in the iterators. JanReceived on 2005-05-31 14:16:10