(Illustration by Gaich Muramatsu)
On Mon, Sep 11, 2006 at 08:36:02AM -0600, Patrick J. Walsh wrote: > Looking at the source code, I *think* we hit the limit for the > number of files we can have in a directory. Luckily, and for some Looks like it, the error seems to be EFBIG (file too large) when it tries to add a new entry to the directory. > odd reason, our other coda server was still running without > problems. So we turned off the problematic coda server and pruned > out the directories. That's nice, servers have an annoying habit of dying at the same time in these cases. I guess your client was weakly-connected and tried to reintegrate with only this replica. Actually the server log you attached seems to indicate that the server starts up fine, but then dies during a resolution attempt. So the problem may actually be in the server that is still running and is being propagated to the crashing server during log-based resolution. The safest thing right now would be to create a backup tarball of anything in that volume that you care about. Destroying/re-resolving the replica on the crashing server will use a different resolution mechanism (runt-resolution), which may work and solve the problem (successful resolution truncates the resolution logs so the bad create won't get sent anymore), but it may also cause the still running server to realize something is wrong and die. > Now the question is, how can we get the problematic coda server > started back up? Assuming there isn't some other problem, is there a > way to start up the coda server and have it wipe out its existing > knowledge of what files are on what volumes and then rebuild that > knowledge from the working server, similar to how we set it up in the > first place (with an ls -lR or something)? If your server really crashes during the salvage phase, we can temporarily disable salvaging and make sure there are no other volumes with problems. cat > /vice/vol/skipsalvage << EOF 1 2000004 EOF Then start the server and see if it comes up. Because the volume will not be attached there are going to be errors in the logs about VLDB lookup failures when clients attach and try to revalidate the missing replica. If this worked we can shut the server back down and use 'norton' to mark the volume so that it will get deleted during startup before it tries to fsck everything. Then the server should be able to start with the missing replica. Finally we have to recreate the underlying volume replica that was marked for destruction and purged during startup. You'll need to gather some information which is probably easier to get now before we start blowing away replicas and such, besides it is good information to know so we can double check we're actually blowing away the right volume. It looks like the broken replica is 2000004, you need to find which replicated volume it belongs to. grep -i 2000004 /vice/db/VRList The replicated volume number is the one in the second column starting with 7f. Also note which replica this one has in the list e.g. vm:u.jaharkes 7f000604 2 d1000129 c80000df 00000000 00000000 00000000 ... replicate volume id = 7f000604 replica index for d1000129 = 0 replica index for c80000df = 1 Knowing the index is useful because the replicas are named based on the replicated volume name + index. So in my example volume d1000129 has the name vm:u.jaharkes.0 and volume c80000df has the name vm:u.jaharkes.1. You also need to get the rvm log and data parameters from /etc/coda/server.conf. grep ^rvm /etc/coda/server.conf It should also be possible to have bash parse that file. So now we'll shut down the server. volutil shutdown ... check the log to see if the server is completely shut down. . /etc/coda/server.conf norton -rvm $rvm_log $rvm_data $rvm_data_length Then with norton we can double check the values we have, norton> show volume 0x2000004 This should show the name and replicated volume id (groupid?). If everything seems to match up correctly we can mark the volume for deletion, norton> delete volume 0x2000004 norton> quit Now we can remove the skipsalvage file, the volume will be completely purged so there is no reason to skip it during salvage, rm /vice/vol/skipsalvage Then we restart the server, it will take a while because it is going to delete everything related to that volume. startserver & Starting it in the background so we can keep an eye on the server log. Once the server is back we can recreate the volume replica. volutil create_rep /vicepa <volume replica name> <replicated volume id> \ 0x2000004 (with my example the <volume replica name> is something like vm:u.jaharkes.1 and <replicated volume id> is 0x7f000604) At this point running 'cfs checkservers' and 'ls -lR /coda/path/to/volume' should trigger runt resolution and rebuild the contents of the the newly created replica. JanReceived on 2006-09-11 15:11:42