(Illustration by Gaich Muramatsu)
On Fri, Sep 14, 2007 at 02:47:48PM +0200, Martin van Es wrote: > # dir > total 73824 > 4 drwxr-xr-x 3 root root 4096 Sep 3 16:14 . > 4 drwxr-xr-x 5 root root 4096 Sep 3 16:10 .. > 65604 -rw-r--r-- 1 root root 67108864 Sep 13 21:22 data > 8208 -rw-r--r-- 1 root root 8390144 Sep 14 00:00 log > 4 drwxr-xr-x 3 root root 4096 Sep 3 17:20 vicepa > > # norton log data 67108864 > About to call RVM_INIT > open_log failed. > do_rvm_options failed > rvm_init failed RVM_EIO > > Guess I'm totally out of luck here? Invocation would be, norton -rvm log data 67108864 The '-rvm' part is historical, it mirrors the arguments we used to pass to the Coda server process. > laptops that most of the time are simultaneously online. Under normal > circumstances this works flawless (thumbs up!), except when I forget to > authenticate on 1 on of them (reintegration issues that I am still not able > to repair) and now this (accidental) server crash. Repair is definitely an area that needs a lot of work before it gets usable. Some things repair easily while other types of conflicts are (almost) impossible to repair. Server crashes in theory should never be a problem, even when an operation is active we should just return in the last known consistent state. However, RVM was originally designed with raw disk/partition access in mind and I'm not sure if the write-ordering assumptions remain valid when we use files as the backing store. In a way, recovery behaviour becomes file system specific if we unexpectedly lost power during a transaction. Most of the time a crash is a result of some limit in the Coda client or server. For instance a directory that exceeds 256KB (hard limit that needs a lot of changes to fix), or when we try to use more than 4096 resolution log entries (but this is a soft limit which can be changed with volutil setlogparms). I've had a couple of cases where the rvm log did get corrupt. The first thing to try is often 'rvmutl'. This tool can be used to open the rvm log and replay pending operations. If that fails, I typically make a backup copy of both the log and data files and then use rvmutl to re-initialize/re-create an empty log file. During startup the server will run various internal consistency checks to see if every directory has content, all objects are referenced and that all meta-data in RVM has corresponding data in /vicepa etc. Most of the time the server comes up fine even when we had to forcibly reinitialize the log. It can also be that the server fails to start because some corruption exists in a specific volume. For instance we find a directory descriptor without directory-content, this seems to be more common when a server crashes during or right after backups (still not sure if crash occurs because of the corruption or the corruption is caused by the crash). These cases can prevent the server from starting, I list such volumes in the 'skipsalvage' file. This allows the server to complete it's startup, but it will not try to activate the problematic volumes. If the server manages to start up with some volumes disabled, I bring the server back down, and then use norton to mark them for destruction, remove the skipsalvage file, restart the server (which then destroys the marked volumes) and recreate the replicas when the server is back up. The server setup here has all volumes replicated across at least 2 servers, so running a recursive ls from a client triggers resolution and the newly created empty replica is repopulated with the data from the other server. JanReceived on 2007-09-20 14:29:21