Coda File System

From: Patrick Walsh <pwalsh_at_esoft.com> Date: Fri, 20 May 2005 10:43:40 -0600

	To recap: we are setting up a cluster of servers using coda as the
shared filesystem.  This cluster of servers uses coda for html fiels,
ftp files, etc.  We have two dedicated coda servers.

	These servers haven't moved into production yet as we want to make sure
they are absolutely stable.  Alas, it seems they are not.

	Twice in recent times the coda client has hung.  Restarting venus fixed
the problem.  When this happens next time I'll attach gdb to the process
to try to see what happened.  In the meantime, all I have is the console
and venus log files.  (We're using client version 6.0.8.)  The venus log
file is extremely long with lots of messages like this:

[ H(07) : 0207 : 04:42:20 ] Hoard Walk interrupted -- object missing!
<606e1fc8.7f000001.964.56d1>
[ H(07) : 0207 : 04:42:20 ] Number of interrupt failures = 131

and like this:

[ W(823) : 0000 : 11:13:17 ] Cachefile::SetLength 552280
[ W(823) : 0000 : 11:13:17 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28),
VVs differ
[ W(823) : 0000 : 11:13:19 ] Cachefile::SetLength 552933
[ W(823) : 0000 : 11:13:20 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28),
VVs differ

changing to this:

[ W(1783) : 0000 : 17:17:18 ] Cachefile::SetLength 3243845
[ W(1783) : 0000 : 17:17:19 ] *** Long Running (Multi)Store: code =
-2001, elapsed = 1252.4 ***
[ W(1783) : 0000 : 17:17:19 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28),
VVs differ

and then eventually to this:

[ W(1783) : 0000 : 21:54:50 ] Cachefile::SetLength 7015276
[ W(1779) : 0000 : 21:54:53 ] WAITING(606e1fc8.7f000002.e.28): level =
RD, readers = 0, writers = 1
[ W(1783) : 0000 : 21:54:53 ] *** Long Running (Multi)Store: code =
-2001, elapsed = 3633.3 ***
[ W(1783) : 0000 : 21:54:53 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28),
VVs differ
[ W(1779) : 0000 : 21:54:53 ] WAIT OVER, elapsed = 361.2

	The very end of venus.log looks like this:

[ W(1783) : 0000 : 21:54:57 ] Cachefile::SetLength 7016538
[ D(1804) : 0000 : 21:55:00 ] WAITING(SRVRQ):
[ W(821) : 0000 : 21:55:00 ] WAITING(SRVRQ):
[ W(823) : 0000 : 21:55:00 ] *****  FATAL SIGNAL (11) *****

	Most of the complaints I think are harmless and I think result from
this file: 606e1fc8.7f000002.e.28, which I believe is the apache log
file.

	Here's the end of console.log:

12:55:00 root acquiring Coda tokens!
12:55:01 root acquiring Coda tokens!
12:55:01 Coda token for user 0 has been discarded
15:55:00 root acquiring Coda tokens!
15:55:00 root acquiring Coda tokens!
18:55:00 root acquiring Coda tokens!
18:55:00 root acquiring Coda tokens!
21:55:00 root acquiring Coda tokens!
21:55:00 root acquiring Coda tokens!
21:55:00 Fatal Signal (11); pid 1708 becoming a zombie...
21:55:00 You may use gdb to attach to 1708

	Finally, to my questions: 1) is there something I can do to prevent
future signal 11's?  2) If such a signal (whatever it means) happens,
can coda just restart itself instead of going into a zombie state and
causing httpd and proftpd to hang?

	Thanks for your help.

-- 
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/

Coda File System

coda client hangs