Coda File System

Re: Crashed server fails to restart

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Thu, 20 Sep 2007 14:28:11 -0400
On Fri, Sep 14, 2007 at 02:47:48PM +0200, Martin van Es wrote:
> # dir
> total 73824
>     4 drwxr-xr-x 3 root root     4096 Sep  3 16:14 .
>     4 drwxr-xr-x 5 root root     4096 Sep  3 16:10 ..
> 65604 -rw-r--r-- 1 root root 67108864 Sep 13 21:22 data
>  8208 -rw-r--r-- 1 root root  8390144 Sep 14 00:00 log
>     4 drwxr-xr-x 3 root root     4096 Sep  3 17:20 vicepa
> 
> # norton log data 67108864
> About to call RVM_INIT
> open_log failed.
> do_rvm_options failed
> rvm_init failed RVM_EIO
> 
> Guess I'm totally out of luck here?

Invocation would be,

    norton -rvm log data 67108864

The '-rvm' part is historical, it mirrors the arguments we used to pass
to the Coda server process.

> laptops that most of the time are simultaneously online. Under normal 
> circumstances this works flawless (thumbs up!), except when I forget to 
> authenticate on 1 on of them (reintegration issues that I am still not able 
> to repair) and now this (accidental) server crash.

Repair is definitely an area that needs a lot of work before it gets
usable. Some things repair easily while other types of conflicts are
(almost) impossible to repair.

Server crashes in theory should never be a problem, even when an
operation is active we should just return in the last known consistent
state.

However, RVM was originally designed with raw disk/partition access in
mind and I'm not sure if the write-ordering assumptions remain valid
when we use files as the backing store. In a way, recovery behaviour
becomes file system specific if we unexpectedly lost power during a
transaction.

Most of the time a crash is a result of some limit in the Coda client or
server. For instance a directory that exceeds 256KB (hard limit that
needs a lot of changes to fix), or when we try to use more than 4096
resolution log entries (but this is a soft limit which can be changed
with volutil setlogparms).

I've had a couple of cases where the rvm log did get corrupt. The first
thing to try is often 'rvmutl'. This tool can be used to open the rvm
log and replay pending operations. If that fails, I typically make a
backup copy of both the log and data files and then use rvmutl to
re-initialize/re-create an empty log file. During startup the server
will run various internal consistency checks to see if every directory
has content, all objects are referenced and that all meta-data in RVM
has corresponding data in /vicepa etc. Most of the time the server comes
up fine even when we had to forcibly reinitialize the log.

It can also be that the server fails to start because some corruption
exists in a specific volume. For instance we find a directory descriptor
without directory-content, this seems to be more common when a server
crashes during or right after backups (still not sure if crash occurs
because of the corruption or the corruption is caused by the crash).
These cases can prevent the server from starting, I list such volumes in
the 'skipsalvage' file. This allows the server to complete it's startup,
but it will not try to activate the problematic volumes.

If the server manages to start up with some volumes disabled, I bring
the server back down, and then use norton to mark them for destruction,
remove the skipsalvage file, restart the server (which then destroys the
marked volumes) and recreate the replicas when the server is back up.
The server setup here has all volumes replicated across at least 2
servers, so running a recursive ls from a client triggers resolution and
the newly created empty replica is repopulated with the data from the
other server.

Jan
Received on 2007-09-20 14:29:21