(Illustration by Gaich Muramatsu)
On Tue, May 27, 2008 at 08:21:51AM -0400, Janusz Krzysztofik wrote: > After I have tried to resolve a local conflict, my server has crashed. > Now it keeps crashing just after the client reconnects. My SrvLog ends > with: Clearly something like this should not happen, if some operation that was sent by the client doesn't apply correctly to the internal state we should be sending an error back to the client instead of crashing. So I am really interested in the full server log as well as the list of operations the client is trying to reintegrate. As the server logs are pretty big, please send it to me off-list. To get the list of operations the client is trying to reintegrate, run 'cfs listlocal /path/to/problematicvolume' on the client. > 13:09:21 --PO: 1000003.8cfd0.c04d2 > 13:09:21 Entering VFlushVnode for vnode 8cfd0 > 13:09:21 Entering ObjectExists(volindex= 2, (467e7.c04d2) > 13:09:21 ObjectExists: NO object 467e7.c04d2 > 13:09:21 ****** FILE SERVER INTERRUPTED BY SIGNAL 11 ****** > 13:09:21 ****** Aborting outstanding transactions, stand by... > 13:09:21 Uncommitted transactions: 1 > 13:09:21 Uncommitted transactions: 1 > 13:09:21 Committing suicide now ........ > > I had similiar problems several times before and managed to restore the > operation by reinitializing the client cache. This time I would prefere > keeping all the changes waiting for reintegration. Is there a way to > skip the error provoking operation without purging the client cache? I guess we would first have to identify which operation is the problematic one and then we can discard successive operations from the beginning of the reintegration log with 'cfs discardlocal' until we get past the problematic ones. The problematic object seems to be the one that has the file identifier '1000003.8cfd0.c04d2'. Now the 1000003 part is the non-replicated volume-id, so the object probably has a different volumeid value on the client. The failure seems to indicate that another object is missing, most likely a directory. So maybe we are trying to create or move an object in a directory that no longer exists so I wonder if there is a rename/removedir or create/removedir pair of operations in your CML. If the problem is caused by such a combination it may be possible to artificially force the client to reintegrate in really small batches so that both operations end up in different reintegration attempts. There is no easy way to do this, it would probably involve both choking the client's available bandwidth (Lua script for rpc2) and modifying the reintegration parameters (cfs wd -time 0.001) to the point that we only push one record at a time. In theory it should be possible, but I've never tried to actually do this. JanReceived on 2008-05-27 11:54:45