Coda File System

Re: Coda connectivity lossage

From: Ivan Popov <pin_at_medic.chalmers.se>
Date: Wed, 21 Jul 2004 09:26:08 +0200
Hello Olin,

nice to see a detailed and structured report.
(yet I do not seem to find how big your client caches were)

> filesys *on the machine where the coda server runs*. This worked, thought it
> was a little weird to see "red zone -- stalling blah blah" messages on such
> net-less operation.

It is not the net, it is the server who limits the speed.
When it is on the same machine, you make it even heavier for the server
by adding the client's load.
As for the performance, I do not fill a 100Mbit connection even by copying big
files, with my 466MHz server and its disks. YMMV.

>     4b0fb02ca5944804cc403b6ff1f3797a  ./affection/audio_01.inf

>     md5sum: ./affection/track05.cdda.flac: Connection timed out
>     find: ./affection/audio_08.inf: Connection timed out

I think it has been discussed some time.
There are situations when it takes a long time for the server to answer.
Combined with some packet loss it may lead the client to the conclusion
that the server is unreachable.

It is a fundamental functionality in Coda - the client decides itself
when it goes disconnected. We do not want the client stall too much.
May be it is possible to improve the protocol and the algorithm, but
it is not at all "that bad", and is hard to make changes to.

There is room for improvement and Jan is well aware of the problem,
but I think it is rather low on the priority list.

> Again, note that this lossage occurred on a system with no cable modem, and
> presumably symmetric bandwidth to the server.

Cable modems exaggerate the problem, but it is present otherwise too.

>     Red zone, stalling writer ( 00:33:35 )
> 
> messages, and then the client went write-disconnected.

In http://www.coda.cs.cmu.edu/maillists/codalist/codalist-2004/6115.html
Jan wrote

| I've said this many times before, there is no such thing as guaranteed
| connected operation in Coda. If anything goes wrong during a write/store
| operation the client will silently switch to write-disconnected
| operation (logging state). If the server is slow to respond we switch to
| a logging state. And reversely, when the client can't be reached by the
| server, the server triggers the disconnect were are likely to switch to
| a logging state.
|
| The only thing that cfs strong does is prevent the client from listening
| to the often incorrect 'bandwidth estimates' from the RPC2 communication
| layer, so that transitions only happen in error cases and not based on
| incorrect estimates. In fact, if you were already write-disconnected
| before calling cfs strong, the client will never discover that the
| network actually has good bandwidth and will never transition to the
| connected state.

>   Write-back is VIOC_STATUSWB: Invalid argument
> 
> Note the weirdo final line -- "VIOC_STATUSWB: Invalid argument"? What's that? 

It is an artifact of write-back caching which was implemented and tested once
but did not meet the expectations and is not present except for some traces
like this one.

> So the real-world operation of coda here is that if you start writing a lot of
> data, you disconnect, and then your writes just fail. So you can't ever count
> on some operation actually working; it could very easily fail mid-stream.

It depends on the operation and the circumstances.
If you start the operation during good connectivity and then your
mobile phone connection goes down, then both reading (obviously) and writing
(say when you do not have enough space in the client cache or in the cml)
can fail.

Of course we do not want the connection to be treated as unavailable
while the net and the server are still there. It will become better as time
goes but for the moment you have to make precautions for bulk copies.

> that access my coda files sometimes win and sometimes seem to drive the
> system into disconnected state, and then I must go through a
>     cfs wr
>     cfs cs
>     cfs lv .
> dance to reconnect. This happens when I am on a client with a completely
> stable connection to the ethernet. We are not talking phone lines here.
> This essentially renders coda unusable.

I am familiar with the problem, still I find Coda usable.
One workaround I had to use when my servers or network were slow,
was to run a loop of "cfs cs" which helps against disconnections.

> Some questions:
> 1. Am I doing something wrong?

Nothing evidently wrong.

> 2. Do other people lose in this way? / Are other people winning?

It is a known problem ("unnecessary disconnections" while a retry or extra wait
would help). A lot of complaints may raise the priority to fix...
There is probably a certain way to get the fixes done, just fund the work :)
I'd rather accept these inconveniences for more important fixes and
improvements.

> 3. Is coda not ready for really big repositories (800Gb filesys, 1Gb rvm
>    metadata)?

I am running with 768Mb rvm but as my files are small - "typical Unix" :)
it maps to just about max 30G data.
It should not be any problem to fill more space with bigger files.

> 4. Any advice at all?

Coda offers unique possibilities - for some price. The usage pattern
has to be "Coda friendly" - and probably will have to, even after ultimate
fixes and improvements.

My 2c,
--
Ivan
Received on 2004-07-21 03:30:56