Coda File System

Re: Strange hang and deqing log message

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 30 Aug 2005 12:24:39 -0400
On Fri, Aug 26, 2005 at 03:23:45PM -0600, Patrick Walsh wrote:
> 	SrvErr on both servers was empty.  Restarting clients didn't fix the
> problem.  The log messages in SrvLog were fairly usual except for a few
> that I haven't seen before that look like this:
> 
> ****** WARNING entry at 0x812ac80 already has deqing set!

I've seen these messages before and figured they were mostly harmless. I
think it happens when we find an object that is about to be removed, in
this case it looks like a client is calling getattr or validateattrs for
a previously removed object.

But maybe this is a problem in this case, where the removed object is
still sticking around. I've been hammering reintegration lately and
already a couple of cases where we modify flags on an object but 'forget'
to mark the object as modified. As a result it is not written back to
RVM and you could see situations like these, or removed objects that
magically reappear. But I don't think any of the cases I looked at had
to do with symlinks, so I'll keep digging a bit.

> 14:31:07 LockQueue Manager: found entry for volume 0x1000004
...
> 14:35:07 LockQueue Manager: found entry for volume 0x1000004
> LQMan: Unlocking 1000004

This is a different problem, the first step of server-server resolution
grabs a global volume lock on all replicas. This lock is released
whenever resolution completes. If the server that was chosen as the
master during the resolution dies, these locks stick around. So all
servers have a 'lockqueue manager' which walks through the list of
active locks and removes the ones that have been around for a while.

During the locked period it is impossible to perform any write
operations on the volume. But I'm not seeing 'incomplete COP2' messages,
which would be normal when a server actually dies.


> 14:30:00 Entering RecovDirResolve 7f000003.14d.1407
> 14:30:00 RegDirResolution: WEAKLY EQUAL DIRECTORIES
> 14:30:00 RecovDirResolve: RegDirResolution succeeded

This must be the resolve operation that grabbed the lock. I wonder why
it didn't manage to unlock the other replica when resolution succeeded.

Jan
Received on 2005-08-30 12:26:38