(Illustration by Gaich Muramatsu)
On Sun, Mar 05, 2006 at 09:12:00PM +0100, Martin Ginkel wrote: > Dear Jan, > > I found the place, where my server hangs in Lock (???) > for the problem I described in the last problem report. > All packets sent by client and server get through without > problem. (I flood-pinged the link for 30min with no loss) > > IMHO the symptom comes from srvproc2.cc:615 > > SLog(1, "ViceGetTime for user %s at %s:%d on conn %d.", > client->UserName, inet_ntoa(client->VenusId->host), > ntohs(client->VenusId->port), RPCid); > > if (!errorCode && client) { > SLog(1, "ViceGetTime before lock"); > /* we need a lock, because we cannot do concurrent RPC2 calls on > * the same connection */ > ObtainWriteLock(&client->VenusId->lock); > SLog(1, "ViceGetTime after lock"); Ah, nice find. What happens is that the client detects whether a server is up or down based on the existence of a callback connection. So when the client sends a probe, the server pings back on the callback connection. However backfetches are using the same connection, and your backfetch is taking very long. So the server is unable to send the ping back to the client. This shouldn't be a problem because the server should be responding with RPC2_BUSY which will make the client wait an extra 15 seconds or so. I guess at some point the client did give up, returned ETIMEDOUT and disconnected. Now there are 2 problems here. First of all, if the store takes that long, it should have been broken up by the client. This is actually a known bug (introduced somewhere between 6.0.9 and 6.0.12), the fix is fairly simple, just removing an unnecessary test. I've attached the patch. The second problem is that the client clearly must have timed out on an RPC2 call that was still in progress. This is probably some deeper issue with the timeout/retransmit handling in RPC2, which will take some time to figure out. And arguably, the client shouldn't even have to probe the server because clearly there is still traffic between the two. But that is more of an optimization and not really a correctness issue. > As long as my vice/venus are playing reintegration with backfetch, > the ViceGetTime call from venus via cfs cs will not get through this > point. Venus will beleave in a disconnected server then. > Do you have any clue for me, what puts the long duration > WriteLock during reintegrate? This is actually not a reintegration write lock, this is caused by the fact that there is only a single RPC2 connection from the server to the client, so it can only do one thing at a time. Fetch a file, or send a callback probe. Jan ================================================================= ReadyToReintegrate is never true when reintegration is already in progress, so checking this is incorrect in cmlent::GetFatHead. --- a/coda-src/venus/vol_cml.cc +++ b/coda-src/venus/vol_cml.cc @@ -360,9 +360,6 @@ cmlent *ClientModifyLog::GetFatHead(int cml_iterator next(*this, CommitOrder); unsigned long bw; /* bandwidth in bytes/sec */ - if (!vol->ReadyToReintegrate()) - return NULL; - /* Get the first entry in the CML */ m = next();Received on 2006-03-06 11:37:41