Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Mon, 6 Mar 2006 11:32:50 -0500

On Sun, Mar 05, 2006 at 09:12:00PM +0100, Martin Ginkel wrote:
> Dear Jan,
> 
> I found the place, where my server hangs in Lock (???)
> for the problem I described in the last problem report.
> All packets sent by client and server get through without
> problem. (I flood-pinged the link for 30min with no loss)
> 
> IMHO the symptom comes from srvproc2.cc:615
> 
>    SLog(1, "ViceGetTime for user %s at %s:%d on conn %d.",
>         client->UserName, inet_ntoa(client->VenusId->host),
>         ntohs(client->VenusId->port), RPCid);
>    
>    if (!errorCode && client) {
>      SLog(1, "ViceGetTime before lock");
>      /* we need a lock, because we cannot do concurrent RPC2 calls on
>       * the same connection */
>      ObtainWriteLock(&client->VenusId->lock);
>      SLog(1, "ViceGetTime after lock");

Ah, nice find.

What happens is that the client detects whether a server is up or down
based on the existence of a callback connection. So when the client
sends a probe, the server pings back on the callback connection.

However backfetches are using the same connection, and your backfetch is
taking very long. So the server is unable to send the ping back to the
client. This shouldn't be a problem because the server should be
responding with RPC2_BUSY which will make the client wait an extra 15
seconds or so. I guess at some point the client did give up, returned
ETIMEDOUT and disconnected.

Now there are 2 problems here. First of all, if the store takes that
long, it should have been broken up by the client. This is actually a
known bug (introduced somewhere between 6.0.9 and 6.0.12), the fix is
fairly simple, just removing an unnecessary test. I've attached the
patch.

The second problem is that the client clearly must have timed out on an
RPC2 call that was still in progress. This is probably some deeper issue
with the timeout/retransmit handling in RPC2, which will take some time
to figure out.

And arguably, the client shouldn't even have to probe the server because
clearly there is still traffic between the two. But that is more of an
optimization and not really a correctness issue.

> As long as my vice/venus are playing reintegration with backfetch,
> the ViceGetTime call from venus via cfs cs will not get through this
> point. Venus will beleave in a disconnected server then.
> Do you have any clue for me, what puts the long duration
> WriteLock during reintegrate?

This is actually not a reintegration write lock, this is caused by the
fact that there is only a single RPC2 connection from the server to the
client, so it can only do one thing at a time. Fetch a file, or send a
callback probe.

Jan

=================================================================
ReadyToReintegrate is never true when reintegration is already in
progress, so checking this is incorrect in cmlent::GetFatHead.

--- a/coda-src/venus/vol_cml.cc
+++ b/coda-src/venus/vol_cml.cc
@@ -360,9 +360,6 @@ cmlent *ClientModifyLog::GetFatHead(int 
     cml_iterator next(*this, CommitOrder);
     unsigned long bw; /* bandwidth in bytes/sec */
 
-    if (!vol->ReadyToReintegrate())
-        return NULL;
-
     /* Get the first entry in the CML */
     m = next();
 


Coda File System

Re: Behaviour of coda with large files