Coda File System

Re: MaxRetries, EWOULDBLOCK,

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 28 May 2003 11:44:45 -0400
On Wed, May 28, 2003 at 10:09:56AM +0200, Steffen Neumann wrote:
>       [ W(21) : 0000 : 09:47:25 ] vproc::Begin_VFS(Remove): vid = 7f00002a, u.u_vol = 0, mode = -1
>       [ W(21) : 0000 : 09:47:25 ] vproc::End_VFS(Remove): code = 157
>       [ W(21) : 0000 : 09:47:25 ] Remove : returns Resource temporarily unavailable, elapsed = 23.9 msec

Hmm, we're clearly not even trying to talk to the server. In fact, it
looks like we're failing in the Begin_VFS call, because at loglevel 100
we should at least see the logmessage from fsdb::Get.

Hmm, looking at the code, it seems to think that an application specific
resolver is currently active for that volume. It will kick out any
process that is not in the 'correct' process group.

>       [ W(21) : 0000 : 09:47:25 ] vproc::lookup: fid = (0x7f00002a.0xd1.0xd54), name = flexibility.tex, nc = 0
>       [ W(21) : 0000 : 09:47:25 ] vproc::Begin_VFS(Lookup): vid = 7f00002a, u.u_vol = 0, mode = -1
> 	[ W(21) : 0000 : 09:47:25 ] vproc::End_VFS(Lookup): code = 0

Here we're clearly not doing anything, so we must still be failing in
Begin_VFS, I'm not sure why it doesn't show an errorcode.

> Another thing I see is that venus.log claims 
> the volume 0x7f00002a would be under repair,
> which it isn't:
...
>         [ W(21) : 0000 : 09:45:44 ] mgrpent::PutHostSet: 0x8145500
>         [ W(21) : 0000 : 09:45:44 ] fsdb::Get-mre: key = (7f00002a.23b6.47b3), uid = 17154, rights = 3, comp = .#flexibility.tex
>         [ W(21) : 0000 : 09:45:44 ] repvol::IsUnderRepair: vol = 7f00002a, vuid = -1

That's just calling the test that checks whether the volume is under
repair, but it doesn't show the true or false result. Interestingly, we
did manage to 'enter' the problematic volume here and are attempting to
get objects in the volume.

Ok, it looks like the client believes objects still exist and is trying
to remove them on the server. However they are already gone on the
server. When the server returns an error, the client tries to start an
application specific resolver, and while the rpc to trigger the resolver
is timing out, the volume is locked, probably to avoid repeatedly trying
to invoke the asr.

So there is some process trying to remove an object (or maybe even
recursively a subtree) that was already removed on the server, but
because that process is holding a reference on a (possibly removed)
directory it will not or cannot refetch the updated information from the
server that the directory is in fact non-existant.

Kill all current processes with references in /coda?

Jan
Received on 2003-05-28 11:47:26