Coda File System

Adding a replicating server

From: Patrick Walsh <pwalsh_at_esoft.com>
Date: Wed, 20 Apr 2005 17:35:59 -0600
	I need some tips on how to debug problems with coda server replication.

	For our task, it's required that we be able to wipe a machine clean and
reinstall its the OS.  It's also required that we be able to add a new
machine to the coda cluster at any time.  Since createvol_rep doesn't
support adding a replica machine after a volume has already been
created, I've written scripts myself using info from the mailing list
archives.

	I believe I have volumes setup properly and servers setup properly, but
for some reason, data isn't replicating.

	Here are the steps we take to add a replicating server followed by my
tests and the results.  I'd really appreciate any pointers.

------------------
Preface1: we don't have a dns server on this network
Preface2: network=192.168.5.0/24
Preface3: names look like: pc123 where pc123 has ip 192.168.5.123
Preface4: the scm is already up and running with happy clients
Preface5: there are no firewalls

On replica:

1) Setup config files
files: /etc/coda/server.conf, /etc/coda/realms, /etc/hosts, /vice/hostname, /vice/db/scm, /vice/db/servers, etc.

server.conf has an ipaddress=192.168.5.x line
hosts file is configured properly (name pcx not listed on 127.0.0.1
line)

2) Setup RVM log and RVM data files

3) Start server

On scm:

1) Add new server name to servers list

2) For each volume listed in VRList:

volutil -h "$newserver" create_rep /vicepa $volname.$entries $repid

3) Using output from volutil, edit VRList to add the new volid so that a
sample VRList entry looks like this:

/ 7f000000 2 1000001 2000001 0 0 0 0 0 0 0

4) bldvldb.sh $newserver

5) volutil -h $newserver makevrdb /vice/db/VRList 

(not sure if this one is necessary)

On client:

1) cfs strong

2) ls -lR /vice/myrealm

----------------

	And everything looks great.  But, `cfs whereis /coda/myrealm` returns
only the scm.  

	Here's some more output (from the scm -- i've renamed the hosts to make
it clearer):

# cfs cs
Contacting servers .....
All servers up

# cfs lv /coda/myrealm
  Status of volume 0x7f000000 (2130706432) named "/"
  Volume type is ReadWrite
  Connection State is Connected
  Minimum quota is 0, maximum quota is unlimited
  Current blocks used are 8
  The partition has 38233260 blocks available out of 38266112

# rpc2ping replica
RPC2 connection to replica:2432 successful.

# getvolinfo scm /
RPC2 connection to scm:2432 successful.
Returned volume information for /
        VolumeId 7f000000
        Replicated volume (type 3)

        Type0 id 0
        Type1 id 0
        Type2 id 0
        Type3 id 7f000000
        Type4 id 0

        ServerCount 1
        Replica0 id 1000001, Server0 192.168.5.129
        Replica1 id 0, Server1 0.0.0.0
        Replica2 id 0, Server2 0.0.0.0
        Replica3 id 0, Server3 0.0.0.0
        Replica4 id 0, Server4 0.0.0.0
        Replica5 id 0, Server5 0.0.0.0
        Replica6 id 0, Server6 0.0.0.0
        Replica7 id 0, Server7 0.0.0.0

        VSGAddr 0

getvolinfo replica /
RPC2 connection to replica:2432 successful.
Returned volume information for /
        VolumeId 7f000000
        Replicated volume (type 3)

        Type0 id 0
        Type1 id 0
        Type2 id 0
        Type3 id 7f000000
        Type4 id 0

        ServerCount 1
        Replica0 id 1000001, Server0 192.168.5.129
        Replica1 id 0, Server1 0.0.0.0
        Replica2 id 0, Server2 0.0.0.0
        Replica3 id 0, Server3 0.0.0.0
        Replica4 id 0, Server4 0.0.0.0
        Replica5 id 0, Server5 0.0.0.0
        Replica6 id 0, Server6 0.0.0.0
        Replica7 id 0, Server7 0.0.0.0

        VSGAddr 0

# volutil -h scm getvolumelist
V_BindToServer: binding to host scm
P/vicepa Hscm T247e500 F24764ac
W/.0 I1000001 H1 P/vicepa m0 M0 U8 W1000001 C42668944 D42668944 B0 A0
GetVolumeList finished successfully

# volutil -h replica getvolumelist
V_BindToServer: binding to host replica
P/vicepa Hreplica T247e500 F24764c4
W/.1 I2000001 H2 P/vicepa m0 M0 U2 W2000001 C4266c1b1 D4266c1b1 B0 A0
GetVolumeList finished successfully

# volutil info /.0
Recoverable volume log
version: 1 malloced
...
Res. stats for volume 0x1000001:
...
Volume header for volume 1000001 (/.0)
stamp.magic = 78a1b2c5, stamp.version = 1
partition = (/vicepa)
inUse = 1, inService = 1, blessed = 1, needsSalvaged = 0, dontSalvage =
229
type = 0 (read/write), uniquifier = 234, needsCallback = 0, destroyMe =
0
id = 1000001, parentId = 1000001, cloneId = 0, backupId = 0,
restoredFromId = 0
maxquota = 0, minquota = 0, maxfiles = 0, filecount = 7, diskused = 8
creationDate = 1114016068 (2005/04/20.10:54:28), copyDate = 1114016068
(2005/04/20.10:54:28)
backupDate = 0 (1969/12/31.17:00:00), expirationDate = 0
(1969/12/31.17:00:00)
accessDate = 0 (1969/12/31.17:00:00), updateDate = 1114029186
(2005/04/20.14:33:06)
owner = 0, accountNumber = 0
dayUse = 87; week = (0, 0, 0, 0, 0, 0, 0), dayUseDate = 1113976800
(2005/04/20.00:00:00)
replicated groupId = 7f000000
{[ 15 0 0 0 0 0 0 0 ] [ 0 0 ] [ 0 ]}

# volutil -h replica info /.1
V_BindToServer: binding to host replica
Recoverable volume log
version: 1 malloced
...
Res. stats for volume 0x2000001:
...
Volume header for volume 2000001 (/.1)
stamp.magic = 78a1b2c5, stamp.version = 1
partition = (/vicepa)
inUse = 1, inService = 1, blessed = 1, needsSalvaged = 0, dontSalvage =
229
type = 0 (read/write), uniquifier = 2, needsCallback = 0, destroyMe = 0
id = 2000001, parentId = 2000001, cloneId = 0, backupId = 0,
restoredFromId = 0
maxquota = 0, minquota = 0, maxfiles = 0, filecount = 0, diskused = 2
creationDate = 1114030513 (2005/04/20.14:55:13), copyDate = 1114030513
(2005/04/20.14:55:13)
backupDate = 0 (1969/12/31.17:00:00), expirationDate = 0
(1969/12/31.17:00:00)
accessDate = 0 (1969/12/31.17:00:00), updateDate = 0
(1969/12/31.17:00:00)
owner = 0, accountNumber = 0
dayUse = 0; week = (0, 0, 0, 0, 0, 0, 0), dayUseDate = 1113976800
(2005/04/20.00:00:00)
replicated groupId = 7f000000
{[ 0 0 0 0 0 0 0 0 ] [ 0 0 ] [ 0 ]}

-----------

	Yet the FTREEDB file on the replica is zero bytes long and no amount of
ls -lR's changes that.  

	SrvLog on the replica has this information:

16:52:34 Scanning inodes in directory /vicepa...
16:52:34 SFS: There are some volumes without any inodes in them
16:52:34 SalvageFileSys:  unclaimed volume header file or no Inodes in
volume 2000001
16:52:34 SalvageFileSys: Therefore only resetting inUse flag
16:52:34 SalvageFileSys completed on /vicepa
16:52:34 VAttachVolumeById: vol 2000001 (/.1) attached and online
16:52:34 Attached 1 volumes; 0 volumes not attached

------------

	That's everything I can think of.  If there's any more info that would
be helpful, like a tcpdump, I'll be happy to provide it.  I'm just
stumped.  I'm hoping there's a command that forces a server update or
something that will fix it.  Or else maybe I missed a step and need to
somehow initialize the volumes on the replica?  

	Thanks for your help.

-- 
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/

Received on 2005-04-20 19:39:34