(Illustration by Gaich Muramatsu)
On Tue, May 27, 2003 at 02:11:54PM +0100, Malte Grosche wrote: > Hello, > > after having run coda pretty well for several years, I suddenly started > getting this message on starting up the coda server: (SrvLog file) ... > log_recover failed. > do_rvm_options failed > 13:47:27 rvm_init failed RVM_EIO Normally, seeing RVM_EIO could be one out of a thousand things. I'm seeing it on machines that really do not have a lot of memory and when a server is restarted quickly after it died they crash with EIO in log_recover. Simply waiting for 30 seconds before restarting a server totally fixes things for me. However, thanks to your strace we might be able to dig a bit deeper. I'm guessing that we're opening the RVM log here. > open("/dev/hda7", O_RDWR) = 3 ... > lseek(3, 2615296, SEEK_SET) = 2615296 > read(3, "\0\0\0\0\0ile server was shut down\0\0\0"..., 512) = 512 Interestingly, part of a SrvLog message ended up in the RVM log. And this could be a reason that RVM gets pretty confused and fails during log recovery. > lseek(3, 2614784, SEEK_SET) = 2614784 > read(3, "\0\0\0\0\35\0\0\0\350\2\0\0\\{\317>6_\t\0\366[\10\0\34"..., 262144) = 262144 > lseek(3, 2615296, SEEK_SET) = 2615296 > read(3, "\0\0\0\0\0ile server was shut down\0\0\0"..., 262144) = 262144 > lseek(3, 8192, SEEK_SET) = 8192 > write(3, "\30\0\0\0\177\223Y\225RVM Interface Version 1."..., 1536) = 1536 > fdatasync(3) = 0 ... > open(0, O_RDWR) = -1 EFAULT (Bad address) This open() is very curious, would this be RVM's attempt to open the data file? Or is this something that goes wrong when we're trying to clean up state in venus. I'm assuming this is still RVM, trying to get at the data partition. Some information I'd like to see is the current log status, # rvmutl * o /dev/hda7 * status ... * quit Some of the interesting information would be the following, if we're lucky the log is empty or very recently truncated. And most of the time with our servers, there is hardly anything in the log. We tend to truncate the RVM log as soon as possible. status last written: Tue May 27 2003 5:56:26.828922 last truncation: Tue May 27 2003 5:50:50.121629 log head offset: 13666704 log tail offset: 13929264 log empty: false (There is a manual page for rvmutl, http://www.coda.cs.cmu.edu/doc/install/html/rvm_manual-8.html) Before completely destroying our existing state it is probably wise to back up the current log and data. If they are files, a simple copy will work, otherwise 'dd if=/dev/hda7 of=rvmlog bs=1024 count=<logsize in KB>'. Next step is to try to force a log truncation, which should apply any pending updates to RVM data, # rvmutl * o /dev/hda7 * recover * status * quit If that worked without anything crashing, great. If it didn't, well hopefully the logged but not applied updates weren't all that critical after all because now we are going to wipe the existing log and lose all pending changes, # rvmutl * i /dev/hda7 20M * quit At this point the log is guaranteed to be empty and the server will be able to load the existing RVM data without crashing. During startup the server will try to salvage the volumes and if RVM worked the way it was intended, most of the internal structures should still be consistent and we're back in business. JanReceived on 2003-05-27 11:43:30