Coda File System

Re: Unresponsive repair operation lets CML grow

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 01 Apr 2011 16:19:42 -0400
On 04/01/2011 03:15 PM, Simon de Hartog wrote:
> since a few weeks, we have repeatedly had problems with one Coda client
> that doesn't seem to push his updates to the server. We have monitoring
> on every client and get a call when the CML entries go over 25. I've
> found a (from what I think) local/global conflict. I'll just post some
> info, not sure what you need to be able to point me in the right direction.
> 
> We have two servers and currently about 8 clients. The problem client is
> called cmp06. The volume with the conflict is named cmpprod. This
> already happened before. The actions we resorted to the last two times
> were stop all apps using files in /coda, stop venus, de-install venus
> and "rm -rf /var/log/coda /var/lib/coda /var/cache/coda" and then
> reinstall venus again from scratch. This worked for a while,
> modifications were correctly pushed to the servers and showed up on
> other clients.

The 'easy way' to drop the pending writes from the CML is to run
	cfs purgeml /coda/..../path/to/volume

This of course will make you lose any pending writes that have not been
sent to the server, but if you are stopping venus, and reinstalling it
from scratch you clearly aren't too worried about keeping that data and
the purgeml operation doesn't require you to kill any locally running
processes.

> root_at_cmp06:/# cfs lv /coda/nkh.spup.net/cmpprod
>   Status of volume 7f000004 (2130706436) named "cmpprod"
>   Volume type is ReadWrite
>   Connection State is Reachable
>   Reintegration age: 0 sec, time 15.000 sec
>   Minimum quota is 0, maximum quota is unlimited
>   Current blocks used are 2965098
>   The partition has 7823104 blocks available out of 11756312
>   *** There are pending conflicts in this volume ***
>   There are 30 CML entries pending for reintegration (3617288 bytes)
> 
> The command cfs listlocal /coda/nkh.spup.net/cmpprod never returns and
> gives no output at all (waited for a little over 30 minutes)

That is sort of unusual and seem to indicate something else is wrong.
This may mean that purgeml wouldn't work either because the volume has
an active writelock for some reason. Listlocal is really essential to
figure out what operation is marked as a conflict to begin with.

In some cases a simple 'cfs checkservers ; cfs forcereintegrate' is
enough to force a retry on a failed operation. This can happen when a
new file is created on only a single server and we are disconnected from
that server before the create has resolved to the second server. In that
case when we try to store the file data there is no valid destination
and the client will declare local-global conflict and block
reintegration. The checkservers would make both servers visible again
and the forcereintegrate will retry the store operation which then can
succeed on the server where the file was created. This is when the head
of the CML is a STORE operation.

Similar cases happen when a file is created, but we disconnect before
the operation has been acknowledged, but it did successfully resolve.
Then when we retry the create on the other server the operation id
doesn't match and it declares a conflict because the file already
exists. In this case you would want to run cfs discardlocal to drop the
unnecessary CREATE operation that is at the head of the CML.

> The /var/log/coda/venus.log is filled with entries like these:
> 
> [ W(177) : 0000 : 21:08:54 ] WAIT OVER, elapsed = 5005.9
> [ W(177) : 0000 : 21:08:54 ] WAITING(VOL): cmpprod, state = Reachable,
> [0, 0], counts = [0 0 5 0]

Yeah, those counts indicate that there currently are 5 threads waiting
to enter the volume so any operation is going to queue up along with
those waiters.

> [ W(177) : 0000 : 21:08:54 ] CML= [30, 103], Res = 0
> [ W(177) : 0000 : 21:08:54 ] WAITING(VOL): shrd_count = 0, excl_count =
> 0, excl_pgid = 0

But this seems to indicate that nobody actually holds a read or write
lock. So nothing will ever wake up the waiting threads. I wonder how the
client got into that state. It must be a not frequently used code path
otherwise we'd see this all the time. Maybe when it hit the conflict we
aren't properly waking up waiting threads.

> I'm not sure whether this is too much, too little or "sufficient" debug
> info. If anyone needs more info, please let me know so I can provide it.

Good info, especially those 4 lines of the log are very interesting. You
could stop and restart the client without reinstalling, you need to kill
the processes using /coda though. As the volume locks are not
persistent, this will make it possible to run cfs listlocal and see what
operation we hit a conflict on.

Jan
Received on 2011-04-01 16:19:53