(Illustration by Gaich Muramatsu)
I need some tips on how to debug problems with coda server replication. For our task, it's required that we be able to wipe a machine clean and reinstall its the OS. It's also required that we be able to add a new machine to the coda cluster at any time. Since createvol_rep doesn't support adding a replica machine after a volume has already been created, I've written scripts myself using info from the mailing list archives. I believe I have volumes setup properly and servers setup properly, but for some reason, data isn't replicating. Here are the steps we take to add a replicating server followed by my tests and the results. I'd really appreciate any pointers. ------------------ Preface1: we don't have a dns server on this network Preface2: network=192.168.5.0/24 Preface3: names look like: pc123 where pc123 has ip 192.168.5.123 Preface4: the scm is already up and running with happy clients Preface5: there are no firewalls On replica: 1) Setup config files files: /etc/coda/server.conf, /etc/coda/realms, /etc/hosts, /vice/hostname, /vice/db/scm, /vice/db/servers, etc. server.conf has an ipaddress=192.168.5.x line hosts file is configured properly (name pcx not listed on 127.0.0.1 line) 2) Setup RVM log and RVM data files 3) Start server On scm: 1) Add new server name to servers list 2) For each volume listed in VRList: volutil -h "$newserver" create_rep /vicepa $volname.$entries $repid 3) Using output from volutil, edit VRList to add the new volid so that a sample VRList entry looks like this: / 7f000000 2 1000001 2000001 0 0 0 0 0 0 0 4) bldvldb.sh $newserver 5) volutil -h $newserver makevrdb /vice/db/VRList (not sure if this one is necessary) On client: 1) cfs strong 2) ls -lR /vice/myrealm ---------------- And everything looks great. But, `cfs whereis /coda/myrealm` returns only the scm. Here's some more output (from the scm -- i've renamed the hosts to make it clearer): # cfs cs Contacting servers ..... All servers up # cfs lv /coda/myrealm Status of volume 0x7f000000 (2130706432) named "/" Volume type is ReadWrite Connection State is Connected Minimum quota is 0, maximum quota is unlimited Current blocks used are 8 The partition has 38233260 blocks available out of 38266112 # rpc2ping replica RPC2 connection to replica:2432 successful. # getvolinfo scm / RPC2 connection to scm:2432 successful. Returned volume information for / VolumeId 7f000000 Replicated volume (type 3) Type0 id 0 Type1 id 0 Type2 id 0 Type3 id 7f000000 Type4 id 0 ServerCount 1 Replica0 id 1000001, Server0 192.168.5.129 Replica1 id 0, Server1 0.0.0.0 Replica2 id 0, Server2 0.0.0.0 Replica3 id 0, Server3 0.0.0.0 Replica4 id 0, Server4 0.0.0.0 Replica5 id 0, Server5 0.0.0.0 Replica6 id 0, Server6 0.0.0.0 Replica7 id 0, Server7 0.0.0.0 VSGAddr 0 getvolinfo replica / RPC2 connection to replica:2432 successful. Returned volume information for / VolumeId 7f000000 Replicated volume (type 3) Type0 id 0 Type1 id 0 Type2 id 0 Type3 id 7f000000 Type4 id 0 ServerCount 1 Replica0 id 1000001, Server0 192.168.5.129 Replica1 id 0, Server1 0.0.0.0 Replica2 id 0, Server2 0.0.0.0 Replica3 id 0, Server3 0.0.0.0 Replica4 id 0, Server4 0.0.0.0 Replica5 id 0, Server5 0.0.0.0 Replica6 id 0, Server6 0.0.0.0 Replica7 id 0, Server7 0.0.0.0 VSGAddr 0 # volutil -h scm getvolumelist V_BindToServer: binding to host scm P/vicepa Hscm T247e500 F24764ac W/.0 I1000001 H1 P/vicepa m0 M0 U8 W1000001 C42668944 D42668944 B0 A0 GetVolumeList finished successfully # volutil -h replica getvolumelist V_BindToServer: binding to host replica P/vicepa Hreplica T247e500 F24764c4 W/.1 I2000001 H2 P/vicepa m0 M0 U2 W2000001 C4266c1b1 D4266c1b1 B0 A0 GetVolumeList finished successfully # volutil info /.0 Recoverable volume log version: 1 malloced ... Res. stats for volume 0x1000001: ... Volume header for volume 1000001 (/.0) stamp.magic = 78a1b2c5, stamp.version = 1 partition = (/vicepa) inUse = 1, inService = 1, blessed = 1, needsSalvaged = 0, dontSalvage = 229 type = 0 (read/write), uniquifier = 234, needsCallback = 0, destroyMe = 0 id = 1000001, parentId = 1000001, cloneId = 0, backupId = 0, restoredFromId = 0 maxquota = 0, minquota = 0, maxfiles = 0, filecount = 7, diskused = 8 creationDate = 1114016068 (2005/04/20.10:54:28), copyDate = 1114016068 (2005/04/20.10:54:28) backupDate = 0 (1969/12/31.17:00:00), expirationDate = 0 (1969/12/31.17:00:00) accessDate = 0 (1969/12/31.17:00:00), updateDate = 1114029186 (2005/04/20.14:33:06) owner = 0, accountNumber = 0 dayUse = 87; week = (0, 0, 0, 0, 0, 0, 0), dayUseDate = 1113976800 (2005/04/20.00:00:00) replicated groupId = 7f000000 {[ 15 0 0 0 0 0 0 0 ] [ 0 0 ] [ 0 ]} # volutil -h replica info /.1 V_BindToServer: binding to host replica Recoverable volume log version: 1 malloced ... Res. stats for volume 0x2000001: ... Volume header for volume 2000001 (/.1) stamp.magic = 78a1b2c5, stamp.version = 1 partition = (/vicepa) inUse = 1, inService = 1, blessed = 1, needsSalvaged = 0, dontSalvage = 229 type = 0 (read/write), uniquifier = 2, needsCallback = 0, destroyMe = 0 id = 2000001, parentId = 2000001, cloneId = 0, backupId = 0, restoredFromId = 0 maxquota = 0, minquota = 0, maxfiles = 0, filecount = 0, diskused = 2 creationDate = 1114030513 (2005/04/20.14:55:13), copyDate = 1114030513 (2005/04/20.14:55:13) backupDate = 0 (1969/12/31.17:00:00), expirationDate = 0 (1969/12/31.17:00:00) accessDate = 0 (1969/12/31.17:00:00), updateDate = 0 (1969/12/31.17:00:00) owner = 0, accountNumber = 0 dayUse = 0; week = (0, 0, 0, 0, 0, 0, 0), dayUseDate = 1113976800 (2005/04/20.00:00:00) replicated groupId = 7f000000 {[ 0 0 0 0 0 0 0 0 ] [ 0 0 ] [ 0 ]} ----------- Yet the FTREEDB file on the replica is zero bytes long and no amount of ls -lR's changes that. SrvLog on the replica has this information: 16:52:34 Scanning inodes in directory /vicepa... 16:52:34 SFS: There are some volumes without any inodes in them 16:52:34 SalvageFileSys: unclaimed volume header file or no Inodes in volume 2000001 16:52:34 SalvageFileSys: Therefore only resetting inUse flag 16:52:34 SalvageFileSys completed on /vicepa 16:52:34 VAttachVolumeById: vol 2000001 (/.1) attached and online 16:52:34 Attached 1 volumes; 0 volumes not attached ------------ That's everything I can think of. If there's any more info that would be helpful, like a tcpdump, I'll be happy to provide it. I'm just stumped. I'm hoping there's a command that forces a server update or something that will fix it. Or else maybe I missed a step and need to somehow initialize the volumes on the replica? Thanks for your help. -- Patrick Walsh eSoft Incorporated 303.444.1600 x3350 http://www.esoft.com/Received on 2005-04-20 19:39:34