Coda File System

Re: codasrv crashes, won't come back up, production server down :(

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 13 Aug 2004 18:41:47 -0400
On Fri, Aug 13, 2004 at 02:03:56PM -0700, Steve Simitzis wrote:
> the reliability of what aspect?
> 
> On 08/13/04, Jan Harkes <jaharkes_at_cs.cmu.edu> wrote: 
> 
> > I'm wondering if there are some flags to relax the rpc2 timeouts to a
> > minute or more (instead of the current 15 seconds). That should add a
> > bit to the overall reliability.

The connection timed out problems you mentioned earlier. If we increase
the timeout, RPC2 will be more patient and not give up when it doesn't
receive a response from the server within 15 seconds.

Ofcourse this also means that it takes longer before we realize that a
request or reply packet was lost and we need to retransmit. This can be
compensated for by increasing the number of retries over the timeout
period. However increasing the number of retries won't help when the
packet was simply delayed because of the server was busy it won't help,
and if it was really lost due to network congestion we're likely to only
make the congestion worse and end up more likely to lose packets.

So it is really a two edged sword, increasing timeout will make us more
resilient, but increases the delay observed by users when there is
packet loss or when a server crashed. Increasing the number of retries
will help if we lost packets, but if the loss was due to congestion
we're only adding to the problem.

Jan
Received on 2004-08-13 18:43:28