(Illustration by Gaich Muramatsu)
Jan, We'll give that a try in the next day or two and report back on how it went. Thanks very, very much for your help. -- Patrick Walsh eSoft Incorporated 303.444.1600 x3350 http://www.esoft.com/ On Sep 11, 2006, at 1:09 PM, Jan Harkes wrote: > On Mon, Sep 11, 2006 at 08:36:02AM -0600, Patrick J. Walsh wrote: >> Looking at the source code, I *think* we hit the limit for the >> number of files we can have in a directory. Luckily, and for some > > Looks like it, the error seems to be EFBIG (file too large) when it > tries to add a new entry to the directory. > >> odd reason, our other coda server was still running without >> problems. So we turned off the problematic coda server and pruned >> out the directories. > > That's nice, servers have an annoying habit of dying at the same > time in > these cases. I guess your client was weakly-connected and tried to > reintegrate with only this replica. Actually the server log you > attached > seems to indicate that the server starts up fine, but then dies > during a > resolution attempt. So the problem may actually be in the server > that is > still running and is being propagated to the crashing server during > log-based resolution. > > The safest thing right now would be to create a backup tarball of > anything in that volume that you care about. Destroying/re- > resolving the > replica on the crashing server will use a different resolution > mechanism > (runt-resolution), which may work and solve the problem (successful > resolution truncates the resolution logs so the bad create won't get > sent anymore), but it may also cause the still running server to > realize > something is wrong and die. > >> Now the question is, how can we get the problematic coda server >> started back up? Assuming there isn't some other problem, is there a >> way to start up the coda server and have it wipe out its existing >> knowledge of what files are on what volumes and then rebuild that >> knowledge from the working server, similar to how we set it up in the >> first place (with an ls -lR or something)? > > If your server really crashes during the salvage phase, we can > temporarily disable salvaging and make sure there are no other volumes > with problems. > > cat > /vice/vol/skipsalvage << EOF > 1 > 2000004 > EOF > > Then start the server and see if it comes up. Because the volume will > not be attached there are going to be errors in the logs about VLDB > lookup failures when clients attach and try to revalidate the missing > replica. > > If this worked we can shut the server back down and use 'norton' to > mark > the volume so that it will get deleted during startup before it > tries to > fsck everything. Then the server should be able to start with the > missing replica. Finally we have to recreate the underlying volume > replica that was marked for destruction and purged during startup. > > You'll need to gather some information which is probably easier to get > now before we start blowing away replicas and such, besides it is good > information to know so we can double check we're actually blowing away > the right volume. > > It looks like the broken replica is 2000004, you need to find which > replicated volume it belongs to. > > grep -i 2000004 /vice/db/VRList > > The replicated volume number is the one in the second column > starting with 7f. > Also note which replica this one has in the list > > e.g. > vm:u.jaharkes 7f000604 2 d1000129 c80000df 00000000 00000000 > 00000000 ... > > replicate volume id = 7f000604 > replica index for d1000129 = 0 > replica index for c80000df = 1 > > Knowing the index is useful because the replicas are named based on > the > replicated volume name + index. So in my example volume d1000129 > has the > name vm:u.jaharkes.0 and volume c80000df has the name vm:u.jaharkes.1. > > You also need to get the rvm log and data parameters from > /etc/coda/server.conf. > > grep ^rvm /etc/coda/server.conf > > It should also be possible to have bash parse that file. So now we'll > shut down the server. > > volutil shutdown > ... check the log to see if the server is completely shut down. > > . /etc/coda/server.conf > norton -rvm $rvm_log $rvm_data $rvm_data_length > > Then with norton we can double check the values we have, > > norton> show volume 0x2000004 > > This should show the name and replicated volume id (groupid?). If > everything seems to match up correctly we can mark the volume for > deletion, > > norton> delete volume 0x2000004 > norton> quit > > Now we can remove the skipsalvage file, the volume will be completely > purged so there is no reason to skip it during salvage, > > rm /vice/vol/skipsalvage > > Then we restart the server, it will take a while because it is > going to > delete everything related to that volume. > > startserver & > > Starting it in the background so we can keep an eye on the server log. > Once the server is back we can recreate the volume replica. > > volutil create_rep /vicepa <volume replica name> <replicated > volume id> \ > 0x2000004 > > (with my example the <volume replica name> is something like > vm:u.jaharkes.1 and <replicated volume id> is 0x7f000604) > > At this point running 'cfs checkservers' and 'ls -lR /coda/path/to/ > volume' > should trigger runt resolution and rebuild the contents of the the > newly > created replica. > > JanReceived on 2006-09-11 16:13:37