(Illustration by Gaich Muramatsu)
On Tue, 2005-05-31 at 14:15 -0400, Jan Harkes wrote: > I've been working on the iterators over the weekend. The 'safe deletion' > parts are removed from the core iterator functions and I think I got all > places where we were doing unsafe deletion. Is there a patch I should apply to 6.0.10? > One thing I couldn't figure out until just now is why you actually have > conn entries, which are used for non-replicated or backup volumes and > during weak-reintegration, instead of mgrp entries which are used for > replicated volumes. > > That turned out to be quite simple, there is a special 'V_UID' that is > used for some operation when we don't know the actual user who triggered > the operation or for certain background operations. Some of the obvious > ones are when we get volume information and during the periodic server > probes/backprobes. And V_UID is set to '0', which means that the root > user has a bunch of connections associated with it. Now most people > probably aren't giving Coda tokens to root, so this explains why you > seem to be seeing these problems a lot more than most people. Hmmmm. It never occurred to me that giving tokens to root might cause a problem... > reauthentications and read and write operations. But then again, your > 'test-environment' might be hitting it harder than I ever can. At this point, the test environment isn't abusing the system. We plan stress tests, but haven't done them yet. The read/write load is actually pretty light for the most part. We download rpm/virus sig/other updates to coda regularly though. > I think rpmbuild actually strips the binaries before it packs them in > the RPM, the unstripped versions seem to be placed in an associated > 'debuginfo' RPM, although that might only be for libraries. I've solved this. I use a custom Makefile to build coda and its dependencies. I had added a --debug flag to the build-coda-rpm.pl script call, but I didn't realize that it produces a different spec file (with -debug in its name) so I was still building the RPM off the non- debug spec file. Anyway, I've got coda-debug clients running on each of the client machines now. Hopefully any future crashes will be easier to trace. > > In the meantime, I've noticed a trend. These signal 11's seem to > > happen almost always at 00:30:01. I've tried running all our cronjobs > Not really, there are periodic hoard walks and server probes, but those > are relative to when venus starts, so unless you always start your > client at exactly the same time I wouldn't expect them to coincide so > nicely on 00:30 real-time intervals. Hmmmm. Very strange then. I'll keep looking to see what might happen at 12:30:01. It has been the exact time for the signal 11 on at least two occasions. > If the venus.log contains > > userent::Connect: Authenticated bind failure, uid = 0 > > Then this is a result of a server rejecting the token, otherwise it is > because of an explicit call to cunlog. We are still getting some of these messages. The last crash, which took place Friday night (Saturday morning) at 12:30, had an Authenticated bind failure for uid 0 at several times, the last one before the crash being at 8:00:01pm Friday night. Since I restarted venus this morning, there have been two of these messages on the machine that had the signal 11. It looks like all of our clients have these messages in their logs though. The only thing going on on some of these machines is a once every 3 hours clog operation. > 'D' is the probe-daemon thread, and 'W' is a worker thread. > ... > My guess is that 'D' is sending out periodic server probes (using the > V_UID = 0 connections), while at the same time the user calls 'cunlog' > for the root user which triggers the worker thread to destroy any > connections owned by root, which to me indicates that I'm fixing the > correct bugs in the iterators. I hope so. Just for your information, I grepped through the log just for the worker thread to see what it was doing before the crash and this is what I got: [ W(190) : 0000 : 00:00:00 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ W(190) : 0000 : 00:00:00 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ W(190) : 0000 : 00:10:00 ] userent::Connect: ViceGetAttrPlusSHA (dir129) [ W(190) : 0000 : 00:10:00 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ W(190) : 0000 : 00:10:00 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ W(190) : 0000 : 00:10:00 ] userent::Connect: ViceGetAttrPlusSHA (dir129) [ W(190) : 0000 : 00:10:00 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ W(190) : 0000 : 00:10:00 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ W(190) : 0000 : 00:15:00 ] userent::Connect: ViceGetAttrPlusSHA (dir130) [ W(190) : 0000 : 00:15:00 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ W(190) : 0000 : 00:15:00 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ W(190) : 0000 : 00:15:00 ] userent::Connect: ViceGetAttrPlusSHA (dir129) [ W(190) : 0000 : 00:15:00 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ W(190) : 0000 : 00:15:00 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ W(190) : 0000 : 00:20:00 ] userent::Connect: ViceGetAttrPlusSHA (dir129) [ W(190) : 0000 : 00:20:00 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ W(190) : 0000 : 00:20:00 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ W(190) : 0000 : 00:20:00 ] userent::Connect: ViceGetAttrPlusSHA (dir130) [ W(190) : 0000 : 00:20:00 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ W(190) : 0000 : 00:20:00 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ W(190) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA (dir129) [ W(190) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ W(190) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ W(190) : 0000 : 00:30:01 ] ***** FATAL SIGNAL (11) ***** Is it odd that though the servers are supposed to be checked every five minutes, there was no check at 00:05 or 00:25? That's the only thing I can think of that might be unusual there. Thanks, Jan, for all your help so far. -- Patrick Walsh eSoft Incorporated 303.444.1600 x3350 http://www.esoft.com/Received on 2005-05-31 15:40:34