Coda File System

replicated operation during a server's downtime

From: <u-codalist-rcma_at_aetey.se>
Date: Thu, 29 May 2014 14:21:47 +0200
Hello,

I am using Coda server replication for many years and it seemed to work
as expected. Nevertheless I recently encountered a situation when the
behaviour looks odd.

Running Coda 6.9.5 on ia32.
All volumes in the realm are replicated across the realm's two servers.
All concerned computers have reliable hardware and are on the same LAN.
The realm seems to be "properly setup in every way" including DNS SRV
records pointing out the two servers.

When I take down one of the servers some parts of the data become
unavailable on clients. An attempt to access the data indicates either
I/O error (observed on directories) or 'Connection timed out'.

I am expecting the client to be able to fetch the data from the remaining
server but this does not happen.

A "cfs flush" or "cfs flushvolume" makes also the flushed parts of the
data unavailable until the missing server is brought back online.

Remarkably, a computer with a freshly initialized Coda client
(memory cache, no rvm) started when one of the servers already is down
can access the data, fetching from the remaining server just fine.

Its twilling which was online and ran find over the data while talking
to both servers gets with a single server 'Input/output error' on many
directories when trying the same find command, until the missing server
is back.

>From my observations it looks like it is the data having been "stat()-ed"
during the availability of both servers but not present in the cache
which suffers when one of the servers goes away.

Does the cache/rvm become "poisoned" by references to the exact
server the getattr data had been fetched from?

Jan, would you comment on this issue and suggest what can be wrong
in my expectations or in my setup?

Regards,
Rune
Received on 2014-05-29 08:27:23