(Illustration by Gaich Muramatsu)
On 04/01/2011 03:15 PM, Simon de Hartog wrote: > since a few weeks, we have repeatedly had problems with one Coda client > that doesn't seem to push his updates to the server. We have monitoring > on every client and get a call when the CML entries go over 25. I've > found a (from what I think) local/global conflict. I'll just post some > info, not sure what you need to be able to point me in the right direction. > > We have two servers and currently about 8 clients. The problem client is > called cmp06. The volume with the conflict is named cmpprod. This > already happened before. The actions we resorted to the last two times > were stop all apps using files in /coda, stop venus, de-install venus > and "rm -rf /var/log/coda /var/lib/coda /var/cache/coda" and then > reinstall venus again from scratch. This worked for a while, > modifications were correctly pushed to the servers and showed up on > other clients. The 'easy way' to drop the pending writes from the CML is to run cfs purgeml /coda/..../path/to/volume This of course will make you lose any pending writes that have not been sent to the server, but if you are stopping venus, and reinstalling it from scratch you clearly aren't too worried about keeping that data and the purgeml operation doesn't require you to kill any locally running processes. > root_at_cmp06:/# cfs lv /coda/nkh.spup.net/cmpprod > Status of volume 7f000004 (2130706436) named "cmpprod" > Volume type is ReadWrite > Connection State is Reachable > Reintegration age: 0 sec, time 15.000 sec > Minimum quota is 0, maximum quota is unlimited > Current blocks used are 2965098 > The partition has 7823104 blocks available out of 11756312 > *** There are pending conflicts in this volume *** > There are 30 CML entries pending for reintegration (3617288 bytes) > > The command cfs listlocal /coda/nkh.spup.net/cmpprod never returns and > gives no output at all (waited for a little over 30 minutes) That is sort of unusual and seem to indicate something else is wrong. This may mean that purgeml wouldn't work either because the volume has an active writelock for some reason. Listlocal is really essential to figure out what operation is marked as a conflict to begin with. In some cases a simple 'cfs checkservers ; cfs forcereintegrate' is enough to force a retry on a failed operation. This can happen when a new file is created on only a single server and we are disconnected from that server before the create has resolved to the second server. In that case when we try to store the file data there is no valid destination and the client will declare local-global conflict and block reintegration. The checkservers would make both servers visible again and the forcereintegrate will retry the store operation which then can succeed on the server where the file was created. This is when the head of the CML is a STORE operation. Similar cases happen when a file is created, but we disconnect before the operation has been acknowledged, but it did successfully resolve. Then when we retry the create on the other server the operation id doesn't match and it declares a conflict because the file already exists. In this case you would want to run cfs discardlocal to drop the unnecessary CREATE operation that is at the head of the CML. > The /var/log/coda/venus.log is filled with entries like these: > > [ W(177) : 0000 : 21:08:54 ] WAIT OVER, elapsed = 5005.9 > [ W(177) : 0000 : 21:08:54 ] WAITING(VOL): cmpprod, state = Reachable, > [0, 0], counts = [0 0 5 0] Yeah, those counts indicate that there currently are 5 threads waiting to enter the volume so any operation is going to queue up along with those waiters. > [ W(177) : 0000 : 21:08:54 ] CML= [30, 103], Res = 0 > [ W(177) : 0000 : 21:08:54 ] WAITING(VOL): shrd_count = 0, excl_count = > 0, excl_pgid = 0 But this seems to indicate that nobody actually holds a read or write lock. So nothing will ever wake up the waiting threads. I wonder how the client got into that state. It must be a not frequently used code path otherwise we'd see this all the time. Maybe when it hit the conflict we aren't properly waking up waiting threads. > I'm not sure whether this is too much, too little or "sufficient" debug > info. If anyone needs more info, please let me know so I can provide it. Good info, especially those 4 lines of the log are very interesting. You could stop and restart the client without reinstalling, you need to kill the processes using /coda though. As the volume locks are not persistent, this will make it possible to run cfs listlocal and see what operation we hit a conflict on. JanReceived on 2011-04-01 16:19:53