(Illustration by Gaich Muramatsu)
Hello Patrick, > We're going to have to abandon coda and start investigating > commercial solutions soon if we can't resolve this. it would be a pity indeed. > The clients have hung again, but because we were in the middle of > testing some other things, I couldn't take the time to gdb it. It seems > somewhat coordinated since 3 out of 4 coda clients were all hung and > needed to be restarted. I did not observe that before. Was that the clients doing updates or those just accessing Coda readonly? If some clients aggressively fetch files from Coda servers at the same time as other clients update files, either of them may see big variations in server response time and possibly disconnect. It might lead to unexpected conflicts, but I did not see crashed or hangs because of that. > Another issue: although we have a cron job that gets fresh tokens 3 > times per day, root (and possibly other users) sometimes lose their > tokens. I suspect this is because we run several clog commands at the That should mean that clog fails three times one after another... Can you see the authentication attempts/fails in the AuthLog? Which clog do you use? The default one with Coda password authentication? You may want to try the modular one, see if it helps and otherwise motivate me badly to fix it :) > same time for different users (user nobody, user root, etc., all try to > get tokens at the same time). Is it possible that this would cause a > problem? A good question, never tried. My scripts do clog for several accounts in a "for" loop, so that there are no simultaneous clogs from cron. They should just work, but you never know. > > Twice in recent times the coda client has hung. Restarting venus fixed > > the problem. When this happens next time I'll attach gdb to the process > > to try to see what happened. In the meantime, all I have is the console > > and venus log files. > > The very end of venus.log looks like this: > > > > [ W(1783) : 0000 : 21:54:57 ] Cachefile::SetLength 7016538 > > [ D(1804) : 0000 : 21:55:00 ] WAITING(SRVRQ): > > [ W(821) : 0000 : 21:55:00 ] WAITING(SRVRQ): > > [ W(823) : 0000 : 21:55:00 ] ***** FATAL SIGNAL (11) ***** I would not just restart venus after a crash, but reinit instead - as rvm state is probably corrupted and you can expect another crash or other weird behaviour. Some kind of "universal cure" is to watch for venus crashing (recompile to remove zombying?), then set the reinit flag for venus and reboot the machine. (hm, one more reason to not put clients on the same machines as servers - though in theory a server reboot might be acceptable, it is to no good anyway). If your services are redundant, you would be able to survive even with such drastic measures. > > Finally, to my questions: 1) is there something I can do to prevent > > future signal 11's? 2) If such a signal (whatever it means) happens, To distinguish between clients doing writes and readonly clients will definitely make the "readonly" clients a lot more stable. > > can coda just restart itself instead of going into a zombie state and > > causing httpd and proftpd to hang? Unfortunately that is hardly possible without rebooting the computer, and then you might be better off reinitializing venus at the same time. It will reduce performance and cause extra load on the servers, but it is at least a certain way to recover a readonly client or a client with local-global conflicts. Hope it helps, somehow. On the other side, if you manage to get your system running even being annoyed with the drawbacks - it will certainly help to fix the bugs, the sooner the more they are exposed. In an extreme, your company might consider contributing to the development, say by setting a price tag on a fix for a certain problem... - of course right now it is only Jan who can - or can't - help. More time would give more certainty in finding a qualified volunteer and fighting the problem. Regards, -- IvanReceived on 2005-05-24 11:25:46