(Illustration by Gaich Muramatsu)
On Wed, Apr 20, 2005 at 12:27:04AM +0200, Gunnar Wrobel wrote: > I am using coda since a short while now and have a problem on a volume > replicated over two servers since yesterday. > > I am unable to restart one of the servers. I see the following message > in the SrvLog file: > > 15:16:37 Salvaging file system partition /var/lib/vicemnt/Distfiles > 15:16:37 Force salvage of all volumes on this partition > 15:16:37 Scanning inodes in directory /var/lib/vicemnt/Distfiles... > 15:16:41 Entering DCC(0x1000005) > > It fails after entering DCC and this is the error in the SrvErr file: > > could not open key 2 file: No such file or directory > Assertion failed: vnode->inodeNumber != 0, file "vol-salvage.cc", line 927 > EXITING! Bye! > > Does anybody know how I could resolve that problem? I did not do any > special operations on the "Distfiles" volumes. Just the usual reading > and writing. The comment that goes with this assert is, /* for directory vnodes inode should never be zero */ /* if the inode number is NEWVNODEINODE blow away vnode */ CODA_ASSERT(vnode->inodeNumber != 0); I don't know how this happened but from the error it looks like we have a directory that seems to be missing the actual data. I have to look at the salvaging code a bit longer to figure out where this vnode object came from (it could be RVM, or maybe the inode summary that is built from the /vicepa/FTREEDB file. If the volume is replicated it should be possible to recover by destroying this replica and the resolving the surviving data from the other server. A backup of your RVM data and /vicepa might be useful so that I can see if I can make the salvager fix this problem. (I guess /vicepa is /var/lib/vicemnt/Distfiles in your case) With 'norton' we can mark the replica for deletion, once that is done the server will remove the replica when it is started. Then we can recreate the replica and trigger resolution by doing an 'ls -lR' from the client. We do need some additional information, we can already see that the replica-id is 0x1000005. You can find the volume name and replicated group/volume id in /vice/db/VRList (grep for 1000005). If this is the 1st replica of the replicated volume it's unique replica name will be <volname>.0, if it is the second replica the name will be <volname>.1, etc. The replicated volume id will be the value that starts with 7f0000.. Norton will have to be told what your RVM-log, RVM-data and RVM-data sizes are, those are listed in /etc/coda/server.conf. The steps should be something like the following, $ killall -9 codasrv # to make sure the server is really not running $ norton -rvm RVMlog RVMdata RVMdatasize norton> show volume 0x1000005 ... # double check if the replicated volume id and replica name are ... # what you expect. norton> delete volume 0x1000005 norton> quit $ startserver & $ tail /vice/srv/SrvLog # check if the server is actually starting now ... # once the server has started and clients start connecting it ... # will give error/warning messages about the missing replica $ volutil create_rep /vicepa replicaname replicated-volume-id replica-id # in your case... ^ /var/lib/vicemnt/Distfiles # ^??? ^0x7f00000X ^0x1000005 # now instead of error messages about a missing volume replica, # SrvLog should show errors about missing objects. # On a client go to the root of the volume, run 'cfs cs ; ls -lR' # have a 'codacon' in an xterm and look at all the Resolving # messages. If at some point the client becomes disconnected, stop # the recursive ls and rerun 'cfs cs ; ls -lR' JanReceived on 2005-04-20 00:40:04