(Illustration by Gaich Muramatsu)
To recap: we are setting up a cluster of servers using coda as the shared filesystem. This cluster of servers uses coda for html fiels, ftp files, etc. We have two dedicated coda servers. These servers haven't moved into production yet as we want to make sure they are absolutely stable. Alas, it seems they are not. Twice in recent times the coda client has hung. Restarting venus fixed the problem. When this happens next time I'll attach gdb to the process to try to see what happened. In the meantime, all I have is the console and venus log files. (We're using client version 6.0.8.) The venus log file is extremely long with lots of messages like this: [ H(07) : 0207 : 04:42:20 ] Hoard Walk interrupted -- object missing! <606e1fc8.7f000001.964.56d1> [ H(07) : 0207 : 04:42:20 ] Number of interrupt failures = 131 and like this: [ W(823) : 0000 : 11:13:17 ] Cachefile::SetLength 552280 [ W(823) : 0000 : 11:13:17 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28), VVs differ [ W(823) : 0000 : 11:13:19 ] Cachefile::SetLength 552933 [ W(823) : 0000 : 11:13:20 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28), VVs differ changing to this: [ W(1783) : 0000 : 17:17:18 ] Cachefile::SetLength 3243845 [ W(1783) : 0000 : 17:17:19 ] *** Long Running (Multi)Store: code = -2001, elapsed = 1252.4 *** [ W(1783) : 0000 : 17:17:19 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28), VVs differ and then eventually to this: [ W(1783) : 0000 : 21:54:50 ] Cachefile::SetLength 7015276 [ W(1779) : 0000 : 21:54:53 ] WAITING(606e1fc8.7f000002.e.28): level = RD, readers = 0, writers = 1 [ W(1783) : 0000 : 21:54:53 ] *** Long Running (Multi)Store: code = -2001, elapsed = 3633.3 *** [ W(1783) : 0000 : 21:54:53 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28), VVs differ [ W(1779) : 0000 : 21:54:53 ] WAIT OVER, elapsed = 361.2 The very end of venus.log looks like this: [ W(1783) : 0000 : 21:54:57 ] Cachefile::SetLength 7016538 [ D(1804) : 0000 : 21:55:00 ] WAITING(SRVRQ): [ W(821) : 0000 : 21:55:00 ] WAITING(SRVRQ): [ W(823) : 0000 : 21:55:00 ] ***** FATAL SIGNAL (11) ***** Most of the complaints I think are harmless and I think result from this file: 606e1fc8.7f000002.e.28, which I believe is the apache log file. Here's the end of console.log: 12:55:00 root acquiring Coda tokens! 12:55:01 root acquiring Coda tokens! 12:55:01 Coda token for user 0 has been discarded 15:55:00 root acquiring Coda tokens! 15:55:00 root acquiring Coda tokens! 18:55:00 root acquiring Coda tokens! 18:55:00 root acquiring Coda tokens! 21:55:00 root acquiring Coda tokens! 21:55:00 root acquiring Coda tokens! 21:55:00 Fatal Signal (11); pid 1708 becoming a zombie... 21:55:00 You may use gdb to attach to 1708 Finally, to my questions: 1) is there something I can do to prevent future signal 11's? 2) If such a signal (whatever it means) happens, can coda just restart itself instead of going into a zombie state and causing httpd and proftpd to hang? Thanks for your help. -- Patrick Walsh eSoft Incorporated 303.444.1600 x3350 http://www.esoft.com/Received on 2005-05-20 12:45:49