(Illustration by Gaich Muramatsu)
On Thu, Apr 23, 2009 at 11:42:22AM +0200, Marc SCHLINGER wrote: > The root volume is created at the end of the scm installation so I guess > it's not replicated on the replica. Right, that is done to simplify the common case of a new user setting up just one server. You would have to remove the root volume and create a new replicated volume to replace it with, and then probably reinitialize the clients so that they actually forget about the old single replica root. > root_at_client# cfs mkmount /coda/myrealm.yeh/test test > root_at_client# ls /coda/myrealm.yeh/test > > Until this step it's okay. I can create files in volume test. That is a good start. > It becomes complicated when on the scm, I block all traffic using iptables. > I see the client starting sending messages to the replica(via tcpdump). > But when I unblock the traffic on the scm I always get the same error. > On the scm: > 18:25:10 GetVolObj: Volume (1000002) already write locked > 18:25:10 RS_LockAndFetch: Error 11 during GetVolObj for 1000002.1.1 > 18:25:46 LockQueue Manager: found entry for volume 0x1000002 The volume xxx already write locked sounds very ominous, but it is really just a debugging message added to help debug Rune's issues. It happens whenever we get a read operation for a volume that is write locked, at this point we used to start waiting for the write to complete which uses up a server thread. However Rune was describing some sort of a deadlock issue, so instead of silently sleeping we now loudly complain and return an error and leave it up to the client to retry the operation. He is running a non-replicated server, so his testing never hit the resolution case, and either way it doesn't seem to have solved his issues, so I'll probably revert this change. Especially as now there is no queueing on these locks so readers are in some cases not able to obtain the lock. > On the replica: > 18:34:36 Going to spool log entry for phase3 > 18:34:38 CheckRetCodes: server 132.227.168.169 returned error 11 > 18:34:38 ViceResolve: Couldnt lock volume 7f000001 at all accessible servers Ok, so it fails to lock on the SCM continues resolving with only the remaining servers (replica), which of course doesn't really help much. Looks like resolution doesn't really like to get bounced back because the lock happened to be taken. > On the client I got a dangling symlink for volume test. Right, resolution failed to get all replicas in sync, so the client is still seeing different copies on different sites and shows the dangling symlink to indicate that the user should 'repair' the problem. In this case repair would probably be something like, $ repair repair> beginrepair /coda/myrealm.yeh repair> comparedirs /tmp/fix repair> dorepair repair> end Of course it would have been nice if resolution had succeeded. JanReceived on 2009-04-23 16:53:36