(Illustration by Gaich Muramatsu)
On Fri, Aug 26, 2005 at 03:23:45PM -0600, Patrick Walsh wrote: > SrvErr on both servers was empty. Restarting clients didn't fix the > problem. The log messages in SrvLog were fairly usual except for a few > that I haven't seen before that look like this: > > ****** WARNING entry at 0x812ac80 already has deqing set! I've seen these messages before and figured they were mostly harmless. I think it happens when we find an object that is about to be removed, in this case it looks like a client is calling getattr or validateattrs for a previously removed object. But maybe this is a problem in this case, where the removed object is still sticking around. I've been hammering reintegration lately and already a couple of cases where we modify flags on an object but 'forget' to mark the object as modified. As a result it is not written back to RVM and you could see situations like these, or removed objects that magically reappear. But I don't think any of the cases I looked at had to do with symlinks, so I'll keep digging a bit. > 14:31:07 LockQueue Manager: found entry for volume 0x1000004 ... > 14:35:07 LockQueue Manager: found entry for volume 0x1000004 > LQMan: Unlocking 1000004 This is a different problem, the first step of server-server resolution grabs a global volume lock on all replicas. This lock is released whenever resolution completes. If the server that was chosen as the master during the resolution dies, these locks stick around. So all servers have a 'lockqueue manager' which walks through the list of active locks and removes the ones that have been around for a while. During the locked period it is impossible to perform any write operations on the volume. But I'm not seeing 'incomplete COP2' messages, which would be normal when a server actually dies. > 14:30:00 Entering RecovDirResolve 7f000003.14d.1407 > 14:30:00 RegDirResolution: WEAKLY EQUAL DIRECTORIES > 14:30:00 RecovDirResolve: RegDirResolution succeeded This must be the resolve operation that grabbed the lock. I wonder why it didn't manage to unlock the other replica when resolution succeeded. JanReceived on 2005-08-30 12:26:38