Coda File System

Re: Unable to do beginrepair...

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 11 Dec 2002 12:33:54 -0500
On Wed, Dec 11, 2002 at 09:31:35AM -0500, Greg Troxel wrote:
> I am having similar problems.  It has seemed to me that coda
> reliablity, specifically repair and conflicts arising when they
> shouldn't, has been worse recently (than say a year ago).  It could
> also be that I'm just stressing it more.

Or your network is deteriorating under the load of spam email and
gnutella clients ;)

> found the normal directory-instead-of-object.  local had the right
> contents, and global was a symlink to a volume id.  I was able to do

If global is a symlink to the volumeid, the client hasn't been able to
mount the volume. Perhaps it timed out during the volume lookup or
something. Another reason would be a server-server conflict, but as you
only have a single server that probably isn't the case.

> down venus and tried to run norton, but it seems to be only a server
> thing.  I then tried 'cfs fl .' in the directory, hoping to just

Yup, norton is only a server thing. 'cfs fl' is in many cases an evil
operation, it flushes the data of cached objects in the specified
subtree. It shouldn't touch 'dirty' objects (i.e. the ones with
associated CML entries), but you never know.

cfs fl is mostly useful for debugging, it can be used to push an object
out of the cache, so that I can test or time the fetching of an object.

The CML is an ordered log, and you cannot simply kill an entry within
the log. The only possible operations are by stepping through the
operations with 'repair', or 'cfs purgeml'. Maybe you could try
'removeinc', I know someone here tried to make that work for
local-global conflicts, but I'm not sure it ever worked.

> poblano gdt 247 ~/%co/HARDWARE/POBLANO > cfs gf wi0.dump 
> VIOC_GETFID: Connection timed out
> 
> [ W(19) : 0000 : 09:15:58 ] fsdb::Get: Locally created fid (0x7f000001.0xffffffff.0x80001) not found!
> [ W(19) : 0000 : 09:16:00 ] fsdb::Get: Locally created fid (0x7f000001.0xffffffff.0x80001) not found!
> [ W(19) : 0000 : 09:16:01 ] fsdb::Get: Locally created fid (0x7f000001.0xffffffff.0x80001) not found!
> 
> [ L(18) : 0000 : 09:16:02 ] LocalInconsistentObj: objFid=7f000001.4830.988c

Ugh, these are local 'fake' fids, but during reintegration these should
have been replaced with correct global fids. Maybe your kernel is a bit
too agressive caching the directory contents, or the objects got lost as
a result of the cfs fl or failed repairs.

> The cml file from 'cfs ck':
> Create  ???/local
> Chown   ???/local (owner = 0)
> Store   ???/local (length = 2836)
> and the tar file is empty.

Ok, so we have a CML entry to create a file named 'local', but the
client is unable to find the associated container file. That's a pretty
bad state right there. The name is typically the last name used to
access the object, I guess this got expanded as a file conflict and the
'local' directory entry was the last name used to access wi0.dump.

It could be that the object still exists, but has a correct global fid.
However the way fids are translated from local fake to global fids, I
can't see how the CML references could have been missed during the
update.

> 08:56:56 Callback failed RPC2_DEAD (F) for ws [client-in-question]:65516
> 09:01:19 Callback failed RPC2_DEAD (F) for ws [client-in-question]:64967

These are a result of timeouts, for some reason the client is not
responding (or not receiving) rpc2 callback messages. As a result the
server will kill all incoming connections from that client.

> Having to reinit venus is a real problem when I don't have a
> fast link (28.8kb/s right now).  And that's when it breaks, usually.
> I realize my connectivity is lame for 2002 standards, but it seems
> supporting this situation is one of the design goals for coda.

Hey I've got 33k6, I'll throttle it for 28.8 for a while to see if I get
hit by similar problems.

> conflicts due to my work style; I only get pseudo-conflicts, I think
> due to a reintegration failing on the client but succeeding on the
> server.   And the latter kind can never be fixed, in my experience.)

The server should deal with retried reintegrations, not sure why it
doesn't seem to do that in your case.

> I am running code from 2002-10-14; I'll upgrade to -current CVS and
> see if things are better.

changes that went into rpc2 and rvm that might help...

RPC2 _tries_ to deal with asyncronous connections. I don't have ADSL so
I haven't been able to test this myself, but the changes might have
improved the robustness over slow links as well. Your server might have
a relatively faster bandwidth because it can encrypt the ipsec traffic
faster than your client?

Added an extra scheduling point in RVM, this should improve the
responsiveness of the rpc2_listener thread, i.e. the one that tells the
other side that it got the message and is 'busy'.

But I haven't fixed anything that I can directly link to the problems
you are experiencing.

Jan
Received on 2002-12-11 12:35:21