Coda File System

Greg Troxel: [solution] vnode refcount panic, perhaps due to kern/vfs_lookup.c:lookup()

From: Greg Troxel <gdt_at_ir.bbn.com>
Date: Sun, 16 Mar 2003 20:28:41 -0500
I am 99% sure I found the problem in the NetBSD kernel coda code.  The
problem was coda_symlink calling vput (VOP_UNLOCK, vrele, which drops
the reference count) prior to calling lookup.  lookup expects an
unlocked vnode, but drops the reference, since it takes the vnode as
an arg and does not return it.  Hence the combination of vput and
lookup was causing the refcount to drop by 2.  [now on thin ice]
Eventually the refcount dropped so that even when the vnode should not
have been reclaimed it was, and that ended up it being in state VBAD,
since that's what vgone set it to.

Patch enclosed.

------- Forwarded Message

>From tech-kern-owner-gdt=ir.bbn.com_at_netbsd.org  Sun Mar 16 20:23:29 2003
Return-Path: <tech-kern-owner-gdt=ir.bbn.com_at_netbsd.org>
Delivered-To: gdt_at_ir.bbn.com
Received: from mail.netbsd.org (mail.netbsd.org [155.53.1.253])
	by fnord.ir.bbn.com (Postfix) with SMTP id 9A9EF867
	for <gdt_at_ir.bbn.com>; Sun, 16 Mar 2003 20:23:28 -0500 (EST)
Received: (qmail 27313 invoked by uid 605); 17 Mar 2003 01:23:18 -0000
Delivered-To: tech-kern_at_netbsd.org
Received: (qmail 27306 invoked from network); 17 Mar 2003 01:23:16 -0000
Received: from fnord.ir.bbn.com (192.1.100.210)
  by mail.netbsd.org with SMTP; 17 Mar 2003 01:23:16 -0000
Received: from fnord.ir.bbn.com (localhost [127.0.0.1])
	by fnord.ir.bbn.com (Postfix) with ESMTP id 392307A3
	for <tech-kern_at_netbsd.org>; Sun, 16 Mar 2003 20:23:01 -0500 (EST)
From: Greg Troxel <gdt_at_ir.bbn.com>
To: tech-kern_at_netbsd.org
Subject: [solution] vnode refcount panic, perhaps due to kern/vfs_lookup.c:lookup() 
In-Reply-To: Message from Jaromir Dolecek <jdolecek_at_netbsd.org> 
   of "Sun, 16 Mar 2003 21:50:45 +0100." <200303162050.h2GKoji22007_at_s102-n054.tele2.cz> 
Date: Sun, 16 Mar 2003 20:23:01 -0500
Message-Id: <20030317012301.392307A3_at_fnord.ir.bbn.com>
Sender: tech-kern-owner_at_netbsd.org
Precedence: list
X-Spam-Status: No, hits=-1.5 required=5.0
	tests=AWL,IN_REP_TO,PATCH_UNIFIED_DIFF,SPAM_PHRASE_00_01
	version=2.43
X-Spam-Level: 

I think I have found the problem, but I'm not 100% sure since the
crashes were not repeatable by a fixed number of operations.  It used
to crash after 3-5 'modify buffer, ^X u' operations, and now I've gone
20 or so.  (duh this is emacs so off goes a keyboard macro to do '
^X^S' 200 times, causing massive coda lock printouts
(coda_lockdebug=1).  After, repeating with ' ^Xu' 200 times, I'm
pretty convinced that at the very least the code is closer to correct

In coda/coda_vnops.c:coda_symlink, vput is called before lookup on the
parent vnode.  This seemed wrong to me - dropping a reference and then
using the vnode.  I took out the vput and got 'locking against
myself'.  After reading the code more, I came to the conclusion:

  lookup expects a referenced vnode, but not one that's locked

So, I replaced vput with VOP_UNLOCK, so that the vn_lock in lookup
would be happy, but without dropping the reference.  I think this lost
reference was casuing the vnode to get reclaimed at a surprising time,
causing the VBAD symptom.

I'd appreciate comments from those who are more vfs-clueful than I am
- - is my interpretation of lookup's calling conventions and the
referencing rules correct?

Here is a patch against netbsd-1-6 as of 20030206 (from my own repo).
The first and third hunks are attempts to notice the VBAD problem
earlier rather than later; the second hunk is the fix - just dropping
vput in favor of VOP_UNLOCK.

I'm happy to send a clean patch and file a PR if that's helpful.

An aside: I suppose that since lookup(9) and the comments in the
sources don't say that the vnode is to be locked on entry, therefore
the caller should hold a reference and the vnode should be unlocked.
This is arguably implied by vnode(9), but it took me a long time to
figure this out.  The notion of holding a reference is pretty
straightforward, but the notion that one passes that reference into a
function for functions that don't "return" their directory arguments
was a bit confusing - some explicit language about when functions
consume references and that returned values are ref'd might be
helpful.  Now that I understand it it seems obvious, so it could just
be me being confused.

Thanks to Bill and Jaromir for helpful hints in understanding vnodes!

- --- coda_vnops.c.~1.1.1.2.~	Wed Dec  5 23:27:40 2001
+++ coda_vnops.c	Sun Mar 16 19:50:58 2003
@@ -1640,8 +1640,15 @@
     error = venus_symlink(vtomi(tdvp), &tdcp->c_fid, path, plen, nm, len, tva, cred, p);
 
     /* Invalidate the parent's attr cache, the modification time has changed */
+    /* XXX what does this do to the vnode? */
     tdcp->c_flags &= ~C_VATTR;
 
+    if (tdvp->v_type == VBAD) {
+	    error = ENODEV;	/* not really right, but want unique */
+	    vprint("venus_symlink VBAD: ", tdvp);
+	    return error;	/* XXX CODADEBUG */
+    }
+
     if (!error)
     {
 	struct nameidata nd;
@@ -1652,9 +1659,16 @@
 	nd.ni_cnd.cn_pnbuf = (char *)nm;
 	nd.ni_cnd.cn_nameptr = nd.ni_cnd.cn_pnbuf;
 	nd.ni_pathlen = len;
- -	vput(tdvp);
+
+	/*
+	 * tdvp is refed and VOP_LOCKed.  lookup seems to require it to
+	 * be referenced (as do all calls ?) but not locked - it locks it.
+	 * So, drop the lock but not the reference.
+	 */
+	VOP_UNLOCK(tdvp, 0);
 	error = lookup(&nd);
 	*ap->a_vpp = nd.ni_vp;
+
     }
 
     /* 
@@ -1980,6 +1994,10 @@
 {
     struct cnode *cp;
     int          err;
+
+    if (type == VBAD) {
+	    panic("make_coda_node: VBAD");
+    }
 
     if ((cp = coda_find(fid)) == NULL) {
 	struct vnode *vp;


------- End of Forwarded Message
Received on 2003-03-16 20:31:34