Coda File System

Re: How To Repopulate A Server

From: Patrick J. Walsh <pwalsh_at_esoft.com>
Date: Mon, 11 Sep 2006 14:11:34 -0600
Jan,

	We'll give that a try in the next day or two and report back on how  
it went.  Thanks very, very much for your help.

--
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/


On Sep 11, 2006, at 1:09 PM, Jan Harkes wrote:

> On Mon, Sep 11, 2006 at 08:36:02AM -0600, Patrick J. Walsh wrote:
>> 	Looking at the source code, I *think* we hit the limit for the
>> number of files we can have in a directory.  Luckily, and for some
>
> Looks like it, the error seems to be EFBIG (file too large) when it
> tries to add a new entry to the directory.
>
>> odd reason, our other coda server was still running without
>> problems.  So we turned off the problematic coda server and pruned
>> out the directories.
>
> That's nice, servers have an annoying habit of dying at the same  
> time in
> these cases. I guess your client was weakly-connected and tried to
> reintegrate with only this replica. Actually the server log you  
> attached
> seems to indicate that the server starts up fine, but then dies  
> during a
> resolution attempt. So the problem may actually be in the server  
> that is
> still running and is being propagated to the crashing server during
> log-based resolution.
>
> The safest thing right now would be to create a backup tarball of
> anything in that volume that you care about. Destroying/re- 
> resolving the
> replica on the crashing server will use a different resolution  
> mechanism
> (runt-resolution), which may work and solve the problem (successful
> resolution truncates the resolution logs so the bad create won't get
> sent anymore), but it may also cause the still running server to  
> realize
> something is wrong and die.
>
>> 	Now the question is, how can we get the problematic coda server
>> started back up?  Assuming there isn't some other problem, is there a
>> way to start up the coda server and have it wipe out its existing
>> knowledge of what files are on what volumes and then rebuild that
>> knowledge from the working server, similar to how we set it up in the
>> first place (with an ls -lR or something)?
>
> If your server really crashes during the salvage phase, we can
> temporarily disable salvaging and make sure there are no other volumes
> with problems.
>
>     cat > /vice/vol/skipsalvage << EOF
> 1
> 2000004
> EOF
>
> Then start the server and see if it comes up. Because the volume will
> not be attached there are going to be errors in the logs about VLDB
> lookup failures when clients attach and try to revalidate the missing
> replica.
>
> If this worked we can shut the server back down and use 'norton' to  
> mark
> the volume so that it will get deleted during startup before it  
> tries to
> fsck everything. Then the server should be able to start with the
> missing replica. Finally we have to recreate the underlying volume
> replica that was marked for destruction and purged during startup.
>
> You'll need to gather some information which is probably easier to get
> now before we start blowing away replicas and such, besides it is good
> information to know so we can double check we're actually blowing away
> the right volume.
>
> It looks like the broken replica is 2000004, you need to find which
> replicated volume it belongs to.
>
>     grep -i 2000004 /vice/db/VRList
>
> The replicated volume number is the one in the second column  
> starting with 7f.
> Also note which replica this one has in the list
>
> e.g.
>     vm:u.jaharkes 7f000604 2 d1000129 c80000df 00000000 00000000  
> 00000000 ...
>
>     replicate volume id = 7f000604
>     replica index for d1000129 = 0
>     replica index for c80000df = 1
>
> Knowing the index is useful because the replicas are named based on  
> the
> replicated volume name + index. So in my example volume d1000129  
> has the
> name vm:u.jaharkes.0 and volume c80000df has the name vm:u.jaharkes.1.
>
> You also need to get the rvm log and data parameters from
> /etc/coda/server.conf.
>
>     grep ^rvm /etc/coda/server.conf
>
> It should also be possible to have bash parse that file. So now we'll
> shut down the server.
>
>     volutil shutdown
> ... check the log to see if the server is completely shut down.
>
>     . /etc/coda/server.conf
>     norton -rvm $rvm_log $rvm_data $rvm_data_length
>
> Then with norton we can double check the values we have,
>
>     norton> show volume 0x2000004
>
> This should show the name and replicated volume id (groupid?). If
> everything seems to match up correctly we can mark the volume for
> deletion,
>
>     norton> delete volume 0x2000004
>     norton> quit
>
> Now we can remove the skipsalvage file, the volume will be completely
> purged so there is no reason to skip it during salvage,
>
>     rm /vice/vol/skipsalvage
>
> Then we restart the server, it will take a while because it is  
> going to
> delete everything related to that volume.
>
>     startserver &
>
> Starting it in the background so we can keep an eye on the server log.
> Once the server is back we can recreate the volume replica.
>
>     volutil create_rep /vicepa <volume replica name> <replicated  
> volume id> \
> 	0x2000004
>
> (with my example the <volume replica name> is something like
> vm:u.jaharkes.1 and <replicated volume id> is 0x7f000604)
>
> At this point running 'cfs checkservers' and 'ls -lR /coda/path/to/ 
> volume'
> should trigger runt resolution and rebuild the contents of the the  
> newly
> created replica.
>
> Jan


Received on 2006-09-11 16:13:37