Coda File System

Re: JE: parent = 0x100000a.d.4 ; child thinks parent is 0x34f.1328; Shouldnt Happen

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Mon, 17 Sep 2001 14:01:54 -0400
On Wed, Sep 12, 2001 at 12:28:19PM -0700, Jacob S. Barrett wrote:
> So no one has seen this or knows how to fix it?  Should I just restore 
> that volume?  Is there a way to for the complete reintegration of the 
> volume from it's replica rather than rolling back to the last backup?
> 
> -Jake

Sorry for the delay,

> Jacob S. Barrett wrote:
> >13:26:20 DCC: Going to check Directory (0x100000a.d.4)
> >13:26:20 JE: parent = 0x100000a.d.4 ; child thinks parent is 0x34f.1328; 
> >Shouldnt Happen

Ouch, haven't seen this one before, this must be the result of some half
completed rename, which should never be allowed to happen. Maybe the
salvager could be modified to fix it up, which definitely would be the
long-term nicest solution, I'll have a look at it today.

In any case, if the volume was replicated and the other server replica
is still doing fine, the steps to recover this 'lost' volume are as
follows,

Kill off the dead server, and load RVM using 'norton', pointing it to
the RVM log and data segments or partitions of the server,

    # norton /rvm/LOG /rvm/DATA <rvm data segment size>

The right info used to be in /vice/srv.conf, but with a newer server you
might have to pull the paths and numbers out of /etc/coda/server.conf.

In norton, get some of the volume information,

    norton> show volume 0xe60000f4
    Id: 0xe60000f4      Name: e:jaharkes.rep.0  Parent: 0xe60000f4
    GoupId: 0x7f0004c5  Partition: /vicepa
    Version Vector: {[ 119877 115314 0 0 0 0 0 0 ] [ 0 0 ] [ 0x0 ]}

                Number vnodes   Number Lists    Lists
                -------------   ------------    ----------
    small                 196           6144    0x26eeaf2c
    large                  18            512    0x26f3a7ec

Get at least, partition path, non-replicated volumename (name with .0/.1
extension), replicated volumeid (GoupId ;) and non-replicated volumeid
(0x100000a)

Then we use norton to mark this broken volume for removal,

    norton> delete volume 0x100000a
    norton> quit

Now we can start up the server, and keep our fingers crossed that it
will come up. Once it is up the underlying replica is missing and as
clients will still be referencing it some VLDB lookup errors are
expected.

So when we see FileServer Started show up in /vice/srv/SrvLog, we can
create an empty rw-replica to replace the broken one we just removed,
so if I were to recreate the volume I listed earlier,

    # volutil create_rep /vicepa e:u.jaharkes.rep.0 0x7f0004c5 0xe60000f4


Once that is succesful, then the only thing left to do it to resolve the
data from the surviving replica to this one. So on a client,

    $ cfs cs     # make sure we're connected to all servers
    $ cfs strong # don't want to get too many surprises
    $ cd /coda/path/to/volume
    $ ls -lR	 # or /usr/sbin/volmunge -a `pwd`

Sit back and be patient, redo these steps one or two times to make sure
everything got resolved.

Jan
Received on 2001-09-17 14:04:33