(Illustration by Gaich Muramatsu)
On Thu, May 26, 2005 at 04:32:13PM -0600, Patrick Walsh wrote: > 14:50:01 Coda token for user 0 has been discarded > 15:00:00 Coda token for user 0 has been discarded > 15:00:00 Coda token for user 0 has been discarded > 15:00:00 Coda token for user 0 has been discarded > 15:00:01 Coda token for user 0 has been discarded > 15:10:00 Coda token for user 0 has been discarded > 15:15:00 Coda token for user 0 has been discarded > 15:20:00 Coda token for user 0 has been discarded I wonder why these tokens are being discarded, this message is only shown in 2 cases. - A server believes the token is invalid (expired or unable to decrypt) - A user has explicitly called cunlog Since it seems to happen so regularily (every 5 minutes), my guess is that this is a result of a cron job that uses cunlog. But when why don't we see it at 14:55 or 15:05, but are seeing 4 calls at 15:00. Multiple cronjobs that are each using 'clog root ; do_work ; cunlog root'? And 4 happen to run simultaneously at 15:00? > Assertion failed: nlink != (olink *)-1, file > "/home/pwalsh/working/coda/BUILD/coda-6.0.10/coda-src/util/olist.cc", > line 257 > Sleeping forever. You may use gdb to attach to process 8378. Crash in a list iterator, we use olists in many places, I think it is a singly linked list. > And gdb gives this (I'm sure it compiled with -g so I don't know why > we're not getting symbols): Are you still creating an RPM package? It could be that rpmbuild implicitly strips everything before it packages the binaries. If you still have the build tree around somewhere you should be able to use the venus binary in the build tree even when the running venus is stripped. > (gdb) bt > #0 0xb747d761 in __libc_nanosleep () from /lib/libc.so.6 > #1 0xb747d6ae in __sleep (seconds=1) > at ../sysdeps/unix/sysv/linux/sleep.c:70 > #2 0x080cce1c in strcpy () at ../sysdeps/generic/strcpy.c:31 > #3 0x080c8ca5 in strcpy () at ../sysdeps/generic/strcpy.c:31 > #4 0x0804e55d in strcpy () at ../sysdeps/generic/strcpy.c:31 > #5 0x080ac414 in strcpy () at ../sysdeps/generic/strcpy.c:31 This is definitely a different trace compared to the other one, since we don't have that 'callback function' style jump through a high address. It is also not a segfault, but a more normal assertion. Argh, I think I know what is going on... The conn_iterator (which iterates over the list of connections) is derived from the olist_iterator. And it is internally doing the same 'trick' where it saves the 'next' pointer for the next iteration. But it doesn't know anything about locking down objects. So the pinning down that is done while we walk the list is completely useless, since the iterator itself doesn't really need the current object (except that it does because it still tests 'current == last()'). But for the iteration it uses the saved next ptr, which might be unlinked/destroyed because it wasn't pinned down with a refcount. Soooo, now I have to go through the complete code and identify all places where we might be using these olist_iterators either directly or indirectly and check if they are already tracking the next ptr themselves (like we do when destroying connections) and if the objects are locked when we yield. And then I can remove the useless next-ptr bit. All these iterators that support 'safe deletion of the current object' never work right and are causing some very nasty race conditions when there are multiple threads involved. Luckily Coda has cooperative threading, so we typically don't yield, except in a few cases. One of which is ofcourse when we destroy rpc2 connections. Quick fix for you, in coda-src/venus/user.cc around line 372. tc = next(); /* read ahead */ if (tc) tc->GetRef(); /* make sure we don't lose the next connent */ (void)c->Suicide(1); if (tc) tc->PutRef(); Change that Suicide(1) to Suicide(0), this way the client won't tell the server is is disconnecting, so we won't make an RPC2 call, and as a result will not yield. The problem with this fix is that the server will slowly build up a lot of old connections that will stick around until either the server is rebooted, or the client is disconnected for a couple of minutes. JanReceived on 2005-05-27 13:46:13