(Illustration by Gaich Muramatsu)
I am 99% sure I found the problem in the NetBSD kernel coda code. The problem was coda_symlink calling vput (VOP_UNLOCK, vrele, which drops the reference count) prior to calling lookup. lookup expects an unlocked vnode, but drops the reference, since it takes the vnode as an arg and does not return it. Hence the combination of vput and lookup was causing the refcount to drop by 2. [now on thin ice] Eventually the refcount dropped so that even when the vnode should not have been reclaimed it was, and that ended up it being in state VBAD, since that's what vgone set it to. Patch enclosed. ------- Forwarded Message >From tech-kern-owner-gdt=ir.bbn.com_at_netbsd.org Sun Mar 16 20:23:29 2003 Return-Path: <tech-kern-owner-gdt=ir.bbn.com_at_netbsd.org> Delivered-To: gdt_at_ir.bbn.com Received: from mail.netbsd.org (mail.netbsd.org [155.53.1.253]) by fnord.ir.bbn.com (Postfix) with SMTP id 9A9EF867 for <gdt_at_ir.bbn.com>; Sun, 16 Mar 2003 20:23:28 -0500 (EST) Received: (qmail 27313 invoked by uid 605); 17 Mar 2003 01:23:18 -0000 Delivered-To: tech-kern_at_netbsd.org Received: (qmail 27306 invoked from network); 17 Mar 2003 01:23:16 -0000 Received: from fnord.ir.bbn.com (192.1.100.210) by mail.netbsd.org with SMTP; 17 Mar 2003 01:23:16 -0000 Received: from fnord.ir.bbn.com (localhost [127.0.0.1]) by fnord.ir.bbn.com (Postfix) with ESMTP id 392307A3 for <tech-kern_at_netbsd.org>; Sun, 16 Mar 2003 20:23:01 -0500 (EST) From: Greg Troxel <gdt_at_ir.bbn.com> To: tech-kern_at_netbsd.org Subject: [solution] vnode refcount panic, perhaps due to kern/vfs_lookup.c:lookup() In-Reply-To: Message from Jaromir Dolecek <jdolecek_at_netbsd.org> of "Sun, 16 Mar 2003 21:50:45 +0100." <200303162050.h2GKoji22007_at_s102-n054.tele2.cz> Date: Sun, 16 Mar 2003 20:23:01 -0500 Message-Id: <20030317012301.392307A3_at_fnord.ir.bbn.com> Sender: tech-kern-owner_at_netbsd.org Precedence: list X-Spam-Status: No, hits=-1.5 required=5.0 tests=AWL,IN_REP_TO,PATCH_UNIFIED_DIFF,SPAM_PHRASE_00_01 version=2.43 X-Spam-Level: I think I have found the problem, but I'm not 100% sure since the crashes were not repeatable by a fixed number of operations. It used to crash after 3-5 'modify buffer, ^X u' operations, and now I've gone 20 or so. (duh this is emacs so off goes a keyboard macro to do ' ^X^S' 200 times, causing massive coda lock printouts (coda_lockdebug=1). After, repeating with ' ^Xu' 200 times, I'm pretty convinced that at the very least the code is closer to correct In coda/coda_vnops.c:coda_symlink, vput is called before lookup on the parent vnode. This seemed wrong to me - dropping a reference and then using the vnode. I took out the vput and got 'locking against myself'. After reading the code more, I came to the conclusion: lookup expects a referenced vnode, but not one that's locked So, I replaced vput with VOP_UNLOCK, so that the vn_lock in lookup would be happy, but without dropping the reference. I think this lost reference was casuing the vnode to get reclaimed at a surprising time, causing the VBAD symptom. I'd appreciate comments from those who are more vfs-clueful than I am - - is my interpretation of lookup's calling conventions and the referencing rules correct? Here is a patch against netbsd-1-6 as of 20030206 (from my own repo). The first and third hunks are attempts to notice the VBAD problem earlier rather than later; the second hunk is the fix - just dropping vput in favor of VOP_UNLOCK. I'm happy to send a clean patch and file a PR if that's helpful. An aside: I suppose that since lookup(9) and the comments in the sources don't say that the vnode is to be locked on entry, therefore the caller should hold a reference and the vnode should be unlocked. This is arguably implied by vnode(9), but it took me a long time to figure this out. The notion of holding a reference is pretty straightforward, but the notion that one passes that reference into a function for functions that don't "return" their directory arguments was a bit confusing - some explicit language about when functions consume references and that returned values are ref'd might be helpful. Now that I understand it it seems obvious, so it could just be me being confused. Thanks to Bill and Jaromir for helpful hints in understanding vnodes! - --- coda_vnops.c.~1.1.1.2.~ Wed Dec 5 23:27:40 2001 +++ coda_vnops.c Sun Mar 16 19:50:58 2003 @@ -1640,8 +1640,15 @@ error = venus_symlink(vtomi(tdvp), &tdcp->c_fid, path, plen, nm, len, tva, cred, p); /* Invalidate the parent's attr cache, the modification time has changed */ + /* XXX what does this do to the vnode? */ tdcp->c_flags &= ~C_VATTR; + if (tdvp->v_type == VBAD) { + error = ENODEV; /* not really right, but want unique */ + vprint("venus_symlink VBAD: ", tdvp); + return error; /* XXX CODADEBUG */ + } + if (!error) { struct nameidata nd; @@ -1652,9 +1659,16 @@ nd.ni_cnd.cn_pnbuf = (char *)nm; nd.ni_cnd.cn_nameptr = nd.ni_cnd.cn_pnbuf; nd.ni_pathlen = len; - - vput(tdvp); + + /* + * tdvp is refed and VOP_LOCKed. lookup seems to require it to + * be referenced (as do all calls ?) but not locked - it locks it. + * So, drop the lock but not the reference. + */ + VOP_UNLOCK(tdvp, 0); error = lookup(&nd); *ap->a_vpp = nd.ni_vp; + } /* @@ -1980,6 +1994,10 @@ { struct cnode *cp; int err; + + if (type == VBAD) { + panic("make_coda_node: VBAD"); + } if ((cp = coda_find(fid)) == NULL) { struct vnode *vp; ------- End of Forwarded MessageReceived on 2003-03-16 20:31:34