Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Wed, 20 Apr 2005 00:37:24 -0400

On Wed, Apr 20, 2005 at 12:27:04AM +0200, Gunnar Wrobel wrote:
> I am using coda since a short while now and have a problem on a volume
> replicated over two servers since yesterday.
> 
> I am unable to restart one of the servers. I see the following message
> in the SrvLog file:
> 
> 15:16:37 Salvaging file system partition /var/lib/vicemnt/Distfiles
> 15:16:37 Force salvage of all volumes on this partition
> 15:16:37 Scanning inodes in directory /var/lib/vicemnt/Distfiles...
> 15:16:41 Entering DCC(0x1000005)
> 
> It fails after entering DCC and this is the error in the SrvErr file:
> 
> could not open key 2 file: No such file or directory
> Assertion failed: vnode->inodeNumber != 0, file "vol-salvage.cc", line 927
> EXITING! Bye!
> 
> Does anybody know how I could resolve that problem? I did not do any
> special operations on the "Distfiles" volumes. Just the usual reading
> and writing.

The comment that goes with this assert is,

    /* for directory vnodes inode should never be zero */
    /* if the inode number is NEWVNODEINODE blow away vnode */
    CODA_ASSERT(vnode->inodeNumber != 0);

I don't know how this happened but from the error it looks like we have
a directory that seems to be missing the actual data. I have to look at
the salvaging code a bit longer to figure out where this vnode object
came from (it could be RVM, or maybe the inode summary that is built
from the /vicepa/FTREEDB file.

If the volume is replicated it should be possible to recover by
destroying this replica and the resolving the surviving data from the
other server. A backup of your RVM data and /vicepa might be useful so
that I can see if I can make the salvager fix this problem. (I guess
/vicepa is /var/lib/vicemnt/Distfiles in your case)

With 'norton' we can mark the replica for deletion, once that is done
the server will remove the replica when it is started. Then we can
recreate the replica and trigger resolution by doing an 'ls -lR' from
the client.

We do need some additional information, we can already see that the
replica-id is 0x1000005. You can find the volume name and replicated
group/volume id in /vice/db/VRList (grep for 1000005).

If this is the 1st replica of the replicated volume it's unique replica
name will be <volname>.0, if it is the second replica the name will be
<volname>.1, etc. The replicated volume id will be the value that starts
with 7f0000..

Norton will have to be told what your RVM-log, RVM-data and RVM-data
sizes are, those are listed in /etc/coda/server.conf.

The steps should be something like the following,

    $ killall -9 codasrv # to make sure the server is really not running
    $ norton -rvm RVMlog RVMdata RVMdatasize
    norton> show volume 0x1000005
    ... # double check if the replicated volume id and replica name are
    ... # what you expect.
    norton> delete volume 0x1000005
    norton> quit
    $ startserver &
    $ tail /vice/srv/SrvLog # check if the server is actually starting now
    ... # once the server has started and clients start connecting it
    ... # will give error/warning messages about the missing replica
    $ volutil create_rep /vicepa replicaname replicated-volume-id replica-id
    # in your case...    ^ /var/lib/vicemnt/Distfiles
    #				 ^???        ^0x7f00000X          ^0x1000005
    # now instead of error messages about a missing volume replica,
    # SrvLog should show errors about missing objects.

    # On a client go to the root of the volume, run 'cfs cs ; ls -lR'
    # have a 'codacon' in an xterm and look at all the Resolving
    # messages. If at some point the client becomes disconnected, stop
    # the recursive ls and rerun 'cfs cs ; ls -lR'

Jan

Coda File System

Re: Assertion failed: file "vol-salvage.cc", line 927