Coda File System

Re: Problems with replication on two servers.

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Thu, 23 Apr 2009 16:52:48 -0400
On Thu, Apr 23, 2009 at 11:42:22AM +0200, Marc SCHLINGER wrote:
> The root volume is created at the end of the scm installation so I guess  
> it's not replicated on the replica.

Right, that is done to simplify the common case of a new user setting
up just one server.

You would have to remove the root volume and create a new replicated
volume to replace it with, and then probably reinitialize the clients so
that they actually forget about the old single replica root.

> root_at_client# cfs mkmount /coda/myrealm.yeh/test test
> root_at_client# ls /coda/myrealm.yeh/test
>
> Until this step it's okay. I can create files in volume test.

That is a good start.

> It becomes complicated when on the scm, I block all traffic using iptables.
> I see the client starting sending messages to the replica(via tcpdump).  
> But when I unblock the traffic on the scm I always get the same error.
> On the scm:
> 18:25:10 GetVolObj: Volume (1000002) already write locked
> 18:25:10 RS_LockAndFetch: Error 11 during GetVolObj for 1000002.1.1
> 18:25:46 LockQueue Manager: found entry for volume 0x1000002

The volume xxx already write locked sounds very ominous, but it is
really just a debugging message added to help debug Rune's issues.
It happens whenever we get a read operation for a volume that is write
locked, at this point we used to start waiting for the write to complete
which uses up a server thread. However Rune was describing some sort of
a deadlock issue, so instead of silently sleeping we now loudly complain
and return an error and leave it up to the client to retry the operation.

He is running a non-replicated server, so his testing never hit the
resolution case, and either way it doesn't seem to have solved his
issues, so I'll probably revert this change. Especially as now there is
no queueing on these locks so readers are in some cases not able to
obtain the lock.

> On the replica:
> 18:34:36 Going to spool log entry for phase3
> 18:34:38 CheckRetCodes: server 132.227.168.169 returned error 11
> 18:34:38 ViceResolve: Couldnt lock volume 7f000001 at all accessible servers

Ok, so it fails to lock on the SCM continues resolving with only the
remaining servers (replica), which of course doesn't really help much.
Looks like resolution doesn't really like to get bounced back because the
lock happened to be taken.

> On the client I got a dangling symlink for volume test.

Right, resolution failed to get all replicas in sync, so the client is
still seeing different copies on different sites and shows the dangling
symlink to indicate that the user should 'repair' the problem.

In this case repair would probably be something like,

    $ repair
    repair> beginrepair /coda/myrealm.yeh
    repair> comparedirs /tmp/fix
    repair> dorepair
    repair> end

Of course it would have been nice if resolution had succeeded.

Jan
Received on 2009-04-23 16:53:36