(Illustration by Gaich Muramatsu)
Hello, During the last years we had sometimes Coda servers becoming unresponsive and causing their clients to hang indefinitely. For a long time we believed that the problems were caused by locking races at updates. Despite all the efforts and the help of Jan to analyse and trace down the problems we did not find the reason. Recently we have made two observations which possibly sched a new light on this. 1. The logs showed that the phenomenon has faded away. The only plausible reason seemed to be the reduction of peak load previously caused by other processes on the server machine. 2. We observed also the same (or a very similar?) behaviour in another realm, in the absense of any updates. The triggering event seems to be flaky network connectivity. In the actual case there were several hundred client hosts but no write traffic. So the breakage seems to happen when the server either misses incoming packets or its sent packets get lost - both are likely to happen when the server process does not get enough CPU or/and when the network is temporarily cut off. The server process becomes mostly idle but it sends BUSY to the clients. This should mean that it waits for some condition which never happens. I wonder if there is a missing/unreliable timeout check in some codepath. The server logs in that state look like --------------------------------------------------------------------- ... 16:04:45 There are currently 921 connections in use 16:04:45 There are 540 workstations and 5 are active (req in < 15 mins) ... 16:05:20 AuthLWP-0 received new connection 753826942 from NN.NN.NN.NN:42411 16:05:20 client_GetVenusId: got new host NN.NN.NN.NN:42411 16:05:20 Building callback conn. 16:09:34 New Data Base received [the network failure (there was a failure around that time)] <=========== 16:14:28 Callback failed RPC2_DEAD (F) for ws NN.NN.NN.NN:35910 16:14:28 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.35910 rpcid 1472989626 16:18:16 AuthLWP-4 received new connection 81613927 from NN.NN.NN.NN:47196 16:18:16 AuthLWP-3 received new connection 335328897 from NN.NN.NN.NN:50819 ... [> 500 "new connection" here] <========================================== ... 16:21:00 AuthLWP-0 received new connection 1256344367 from NN.NN.NN.NN:54369 16:21:00 client_GetVenusId: got new host NN.NN.NN.NN:54369 16:22:31 Building callback conn. 16:22:31 Building callback conn. 16:22:31 Building callback conn. 16:22:31 Building callback conn. 16:22:31 Building callback conn. 16:22:31 Building callback conn. 16:22:31 Building callback conn. 16:22:31 Building callback conn. 16:22:31 Building callback conn. 16:22:31 Building callback conn. 16:22:31 AuthLWP-2 received new connection 1127630056 from NN.NN.NN.NN:46101 16:22:31 client_GetVenusId: got new host NN.NN.NN.NN:46101 [the new connections (apparently reconnections) end here] <================ [ there are much fewer "Building callback conn" than "received new connection" between 16:18:16 and 16:22:31 above ] 16:45:49 Callback failed RPC2_DEAD (F) for ws NN.NN.NN.NN:37758 ... 19:22:41 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.58732 rpcid 1573500488 19:22:41 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.35548 rpcid 1800508411 19:22:41 Building callback conn. 19:22:41 AuthLWP-4 received new connection 80554604 from NN.NN.NN.NN:59457 19:22:42 client_GetVenusId: got new host NN.NN.NN.NN:59457 19:23:43 AuthLWP-3 received new connection 743361068 from NN.NN.NN.NN:58732 19:23:43 client_GetVenusId: got new host NN.NN.NN.NN:58732 19:29:55 AuthLWP-0 received new connection 1672927627 from NN.NN.NN.NN:38592 19:29:55 client_GetVenusId: got new host NN.NN.NN.NN:38592 19:43:48 AuthLWP-2 received new connection 1433944362 from NN.NN.NN.NN:49381 19:43:48 client_GetVenusId: got new host NN.NN.NN.NN:49381 19:52:32 SmonDaemon timer expired 19:52:32 Entered CheckRVMResStat 19:52:32 Starting SmonDaemon timer 20:22:35 Building callback conn. 20:22:35 Building callback conn. 20:22:35 Building callback conn. 20:22:35 Building callback conn. 20:22:35 Building callback conn. 20:22:35 Building callback conn. 20:22:35 Building callback conn. 20:22:35 Building callback conn. 20:22:35 AuthLWP-1 received new connection 173156798 from NN.NN.NN.NN:49932 20:22:35 client_GetVenusId: got new host NN.NN.NN.NN:49932 20:40:00 RPC2_Bind to NN.NN.NN.NN port 52289 for callback failed RPC2_NOBINDING (F) 20:40:00 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.52289 rpcid 1412666908 20:40:00 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.52289 rpcid 624175938 20:40:00 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.52289 rpcid 2023617723 20:40:01 Worker9: Unbinding RPC connection 29215551 20:40:01 Deleting client entry for user System:AnyUser at 0.0.0.0.0 rpcid 29215551 20:40:01 Building callback conn. 20:40:01 Callback failed RPC2_DEAD (F) for ws NN.NN.NN.NN:41861 20:40:01 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.41861 rpcid 1653698504 20:40:01 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.41861 rpcid 174909523 20:40:01 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.41861 rpcid 1135798093 20:40:01 Deleting client entry for user System:AnyUser at NN.NN.NN.NN.41861 rpcid 1323397277 20:40:01 Building callback conn. 20:40:01 AuthLWP-4 received new connection 1963123312 from NN.NN.NN.NN:51701 20:40:01 client_GetVenusId: got new host NN.NN.NN.NN:51701 20:40:45 AuthLWP-3 received new connection 1810381824 from NN.NN.NN.NN:41861 20:40:46 client_GetVenusId: got new host NN.NN.NN.NN:41861 20:41:01 Shutdown received <=========================================== ... 20:41:01 There are currently 1456 connections in use 20:41:01 There are 528 workstations and 3 are active (req in < 15 mins) ... --------------------------------------------------------------------- It looks like there is a chance for a deadlock when a client reconnects while the server is "too slow" to answer. Somehow they wait on each other (?), sending BUSY and keeping each other waiting. Can the timeouts for old connections be reset/kept from expiring by new packets coming from the same client? Note that the server got over 500 new connections while apparently keeping most of the old ones (which should be non-functional at that time). Hope Jan might take a quick look at this. Thanks for Coda, it is otherwise serving us well and is extremely useful. Regards, RuneReceived on 2011-11-25 07:01:03