Coda File System

Re: volume connection timeouts

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Mon, 2 May 2005 16:32:45 -0400
On Mon, May 02, 2005 at 10:47:32AM -0600, Patrick Walsh wrote:
> The apache user, www-data, gets tokens that give it read-only access to
> the httpd directory and write access to the logs directory.  There's

Logs won't get written back until they are closed. Also if the client is
running (write-)disconnected the pending store is optimized away when
the logfile is reopened which makes it pointless to periodically close
them to flush the data back to the server. Basically you won't see a log
on the server until it got rotated.

> nothing fancy in the setup.  For testing purposes, the coda client and
> server are on the same vmware machine.  There's only a single server.
> We're strongly connected and checkservers shows all up.  This is Coda
> 6.0.8.
> 
> 	Try to `ls httpd` and get "permission denied".  Try to `cfs la httpd`
> and get "httpd: Connection timed out".  Try `cfs flushobject httpd` and
> get "can't flush active file".  Stopping apache and then checking with

You either lost tokens. With 6.0.8 losing tokens means losing access
even to parts that are accessible to anonymous users, this is 'fixed' in
6.0.9 by dropping the established user identity when tokens expire. It
could also be that you have a reintegration conflict, which makes the
volume switch to disconnected state. This is because a reintegration
(local-global) conflict actually moves everything in the volume to a
temporary local repair volume so that we can show both the local and the
global versions later on when the conflict is expanded. The problem is
that the temporary volume doesn't actually exist on the server so any
uncached object can not be fetched. The active file error could very
well be that that object is considered important for repair purposes, if
it was thrown out we wouldn't be able to reach the location of the
conflict, or maybe there is a not-yet-reintegrated CML entry associated
with it.

> venus.log:
> 	CacheFile::ValidContainer: 00/00/00/2B invalid

This was probably an open-for-write file, during recovery the size
doesn't match the recorded size in RVM. We only get to see the new size
when the file is closed. I guess it got filed away in /usr/coda/spool
(or /var/lib/coda/spool).

> console:
> 	8 volume replicas
> 	6 replicated volumes  (is this a strange discrepancy?  there should
> only be 6 volumes)

Not necessarily, there are the local root (/coda) and repair volumes.

> 	Local inconsistent object at ???/html/blah/blah, please check (this is
> on the httpd volume)
> 	worker::main got bogus opcode 35

That's probably why httpd couldn't get flushed. (Mountpoints are by
default not covered, so during startup we don't yet know where the
volume is mounted and show ???)

> 	We can reproduce this reliably on a clean install and a reboot doesn't
> fix it.  The files in httpd are put there using a standard `cp -dR`

Well, considering that you're running both a client and a server on the
same machine, it is to be expected that they end up either weakly
connected (write-disconnected), or completely disconnected at times.
Running the both of them in a vmware session probably only makes it do
that faster. 

On Mon, 02 May 2005 13:33:02 -0400, Patrick Walsh wrote:
> with only one client and one server and both strongly connected, how
> does a conflict happen and how can it be avoided?

We can't guarantee strong connectivity, not when there is a network
between machines, and not even when both processes are running on the
same machine.

The whole strong/weak connectivity is an estimate, and the estimate is
based on observed delays and losses when sending RPC2 and SFTP packets.
Since we use UDP, losses are pretty normal, even when both processes are
on the same machine. In fact the kernel is perfectly allowed to drop the
packet without even passing it to the receiving process just because it
can't immediately free up memory for a packet buffer, or when the
transmit/receive buffer gets full. So packet loss is unavoidable, which
is why we have retransmission. Once we get a successful reply to a
request we divide the delay into several components, server processing time
(which is already subtracted on the server side), outgoing latency and
bandwidth, and incoming latency and bandwidth. The latency and bandwidth
components are calculated based on previous estimates, depending on
the size of the request and reply packets and the expected latency and
bandwidth in either direction, current estimates are made. But really
we're taking one moderately reliable number (time between sending a
request and receiving a reply) and attempt to derive 5 different values
out of it based on historical measurements.

This works most of the time when the client and server are on different
machine, because we can assume that almost all of the measured delay can
be attributed to the network, as the server already subtracted it's
internal processing time. However, when both the client and the server
are on the same machine, we get additional scheduling/context switching,
and resource contention. Both the client and the server have a big chunk
of mmapped data that they need to access (and modify), the RVM data. The
client tends to work optimistically and flushes dirty data as little as
possible, while the server is very conservative and flushes pretty much
as soon as anything important happens. The way that fflush/fsync is
implemented in the Linux kernel, the process that calls sync is blocked
until all dirty date for the same device is flushed to disk.

So the client ends up dirtying a lot of memory, both rvm and the data
associated with the container file, but doesn't flush anything to disk.
Then it sends an RPC to the server who fetches the data, which is most
likely for the most part still in memory since the client didn't flush.
But now the server is hit with a double whammy, not only is he writing
his own dirtied data, but also the client's dirty stuff.

And because we're single threaded (userspace threading), the server
process is blocked until all the data has hit the disk. In the mean time
the client is eager for a response, earlier when it was fetching data
the server was real quick to respond, the RTT estimate probably ended up
being near zero. So it get's impatient and assumes the request got
dropped and retransmits. At the same time, the poor server is still
stuck waiting for the disk, and can't even dash off a quick ack telling
the client that it did get the request and is working on it.

So the client tries another couple of times and either gives up and
disconnects (not so likely with recent rpc2), or by the time it gets an
answer from the server decides that the previous latency and bandwidth
estimates were far too optimistic and switched to write-disconnected
mode.

Now all of this is most likely to happen when we're performing a STORE
operation. And the fallback for a store during disconnection is a very
special case. Most other operations actually return ETIMEDOUT to
userspace, but since stores happen on a close and most applications
don't retry when close returns a failed, the store is logged in the CML.
This is done by internally retrying the operation, but we end up logging
the operation with a different operation identifier compared to the
timed out store operation. Not really a problem if we really timed out,
but in this case the server probably ended up completing the operation,
we just missed out on the final reply. And when we reintegrate we
automatically hit a conflict because the locally cached version vector
is different from the server's version _and_ there is a different store
identifier associated with the operation. So we assume we got an
update/update conflict, i.e. another client wrote to the file while we
were disconnected.

Jan
Received on 2005-05-02 16:34:02