(Illustration by Gaich Muramatsu)
On Wed, Sep 12, 2001 at 12:28:19PM -0700, Jacob S. Barrett wrote: > So no one has seen this or knows how to fix it? Should I just restore > that volume? Is there a way to for the complete reintegration of the > volume from it's replica rather than rolling back to the last backup? > > -Jake Sorry for the delay, > Jacob S. Barrett wrote: > >13:26:20 DCC: Going to check Directory (0x100000a.d.4) > >13:26:20 JE: parent = 0x100000a.d.4 ; child thinks parent is 0x34f.1328; > >Shouldnt Happen Ouch, haven't seen this one before, this must be the result of some half completed rename, which should never be allowed to happen. Maybe the salvager could be modified to fix it up, which definitely would be the long-term nicest solution, I'll have a look at it today. In any case, if the volume was replicated and the other server replica is still doing fine, the steps to recover this 'lost' volume are as follows, Kill off the dead server, and load RVM using 'norton', pointing it to the RVM log and data segments or partitions of the server, # norton /rvm/LOG /rvm/DATA <rvm data segment size> The right info used to be in /vice/srv.conf, but with a newer server you might have to pull the paths and numbers out of /etc/coda/server.conf. In norton, get some of the volume information, norton> show volume 0xe60000f4 Id: 0xe60000f4 Name: e:jaharkes.rep.0 Parent: 0xe60000f4 GoupId: 0x7f0004c5 Partition: /vicepa Version Vector: {[ 119877 115314 0 0 0 0 0 0 ] [ 0 0 ] [ 0x0 ]} Number vnodes Number Lists Lists ------------- ------------ ---------- small 196 6144 0x26eeaf2c large 18 512 0x26f3a7ec Get at least, partition path, non-replicated volumename (name with .0/.1 extension), replicated volumeid (GoupId ;) and non-replicated volumeid (0x100000a) Then we use norton to mark this broken volume for removal, norton> delete volume 0x100000a norton> quit Now we can start up the server, and keep our fingers crossed that it will come up. Once it is up the underlying replica is missing and as clients will still be referencing it some VLDB lookup errors are expected. So when we see FileServer Started show up in /vice/srv/SrvLog, we can create an empty rw-replica to replace the broken one we just removed, so if I were to recreate the volume I listed earlier, # volutil create_rep /vicepa e:u.jaharkes.rep.0 0x7f0004c5 0xe60000f4 Once that is succesful, then the only thing left to do it to resolve the data from the surviving replica to this one. So on a client, $ cfs cs # make sure we're connected to all servers $ cfs strong # don't want to get too many surprises $ cd /coda/path/to/volume $ ls -lR # or /usr/sbin/volmunge -a `pwd` Sit back and be patient, redo these steps one or two times to make sure everything got resolved. JanReceived on 2001-09-17 14:04:33