(Illustration by Gaich Muramatsu)
On 04/01/11 22:19, Jan Harkes wrote: > The 'easy way' to drop the pending writes from the CML is to run > cfs purgeml /coda/..../path/to/volume > This of course will make you lose any pending writes that have not been > sent to the server, but if you are stopping venus, and reinstalling it > from scratch you clearly aren't too worried about keeping that data and > the purgeml operation doesn't require you to kill any locally running > processes. Hi Jan, we are quite worried about dropping data, but if the current situation only becomes worse, we choose to discard some data indeed. > That is sort of unusual and seem to indicate something else is wrong. > This may mean that purgeml wouldn't work either because the volume has > an active writelock for some reason. Listlocal is really essential to > figure out what operation is marked as a conflict to begin with. I checked whether any processes keep files open via "lsof | grep coda" but this resulted in output that states that only venus is using files under /var/lib/coda and /var/log/coda, not /coda. So no open files in /coda/* > In some cases a simple 'cfs checkservers ; cfs forcereintegrate' is > enough to force a retry on a failed operation. This can happen when a > new file is created on only a single server and we are disconnected from > that server before the create has resolved to the second server. In that > case when we try to store the file data there is no valid destination > and the client will declare local-global conflict and block > reintegration. The checkservers would make both servers visible again > and the forcereintegrate will retry the store operation which then can > succeed on the server where the file was created. This is when the head > of the CML is a STORE operation. Ok, I didn't think it would be that simple! I tried a "cfs cs; cfs fr" and that simply solved the problem. All changes have been reintegrated into the servers, thank you very much. I immediately added that command to our "Emergency procedures" Wiki page ;) > Yeah, those counts indicate that there currently are 5 threads waiting > to enter the volume so any operation is going to queue up along with > those waiters. > But this seems to indicate that nobody actually holds a read or write > lock. So nothing will ever wake up the waiting threads. I wonder how the > client got into that state. It must be a not frequently used code path > otherwise we'd see this all the time. Maybe when it hit the conflict we > aren't properly waking up waiting threads. Is there any info I can provide about the possible code paths that were followed? We currently have a software telephony application (PBX) running on cmp06 (and cmp01 through cmp08) which stores voicemail files (.wav) on Coda, so it's available on all systems for listening back voicemails. What happens is that a file is created to store the .wav info and then after it has been closed it is quickly renamed so that the filename contains the length of the voicemail in seconds. Could this be an issue that not always goes smoothly, i.e., creating a file and renaming it before it is created and reintegrated on the servers? I can't really imagine it would be a problem, but I'm not familiar with Coda sources. > Good info, especially those 4 lines of the log are very interesting. You > could stop and restart the client without reinstalling, you need to kill > the processes using /coda though. As the volume locks are not > persistent, this will make it possible to run cfs listlocal and see what > operation we hit a conflict on. As said, the "listlocal" didn't work, so I used a find command to locate the object in conflict with: find -L /coda/nkh.spup.net/cmpprod -type l So in short, "cfs cs: cfs fr" worked fine and if I can help debug this issue further, please let me know what you need from me. I'm available on IRCnet as St5. > Jan Thanks again for your help. Kind regards, Simon de Hartog SpeakUp BVReceived on 2011-04-02 14:16:04