(Illustration by Gaich Muramatsu)
On Fri, May 14, 2004 at 12:18:09PM -0400, shivers_at_cc.gatech.edu wrote: > Things go wrong when I just say on ClientW > cp <12-big-files> /coda/Server/shivers/. > The system writes about three files, then things get screwy & disconnected. > What happens, I *think*, is that the write goes in two stages: it's written > into the venus cache quickly, then dribbles out over my cable modem slowly. Ahh, cable modem, asynchronous network... I don't have DSL or cable myself and Coda used to only work reliably on networks that had identical up and download speeds. What happens is that during the fetches we see an amazingly fast network, but we time out and get disconnected as soon as we try to write even a little bit of data because the acks are taking far too long. RPC2 'thinks' we have a 3MB/s sync network, so when sending several KB and not seeing the ack within a couple of milliseconds it believes the packet got lost and retransmits. This only makes the congestion on the uplink even worse. Once we hit about 5 retransmissions and haven't yet seen the ACK message, the client gives up and disconnects from the server. A couple of the CMU grad students that are using Coda here got DSL (384Kb down/128Kb up?) and complained enough for me to try to (blindly) fix it. RPC2 now _tries_ to estimate uplink and downlink speeds independently. It mostly solved the problems for them, but as I really don't have anything to test it with I'm pretty sure this wasn't a perfect solution. btw. TCP probably has similar issues if you use a persistent connection and first fetch a lot of data for a while and then try to push back data, it is just that typical use is either one-directional, or is started by the client behind the DSL by sending a request on the slow uplink and then getting a response from the downlink. Ramping up in speed is no problem, it is the sudden degradation that bites. TCP is just a bit more tenacious and just backing off more instead of giving up entirely after 5 retries. > When this bottleneck causes enough reintegration data to build up, blammo. > The lossage is as I described in my last message: cfs lv shows the system > in some kind of disconnected state, and cfs wr won't make it reconnect. > > So the message seems to be that if I don't press the system hard, it works. > Under pressure, it falls over. For me, that's progress. Now I want to > understand the current hosage. Can anyone help? Well, one thing is that your connection really is 'weak' in Coda's terms. The uplink speed is probably in the order of 64 or 128Kb/s, so it prefers to work write-disconnected. You can tell it not to adapt to network bandwidth estimated by using 'cfs strong'. This should prevent the (connected -> write-disconnected) transition. However you can still become write-disconnected because of the (connected -> disconnected -> write-disconnected) transition, in other words if RPC2 misses the bat and times out you end up logging the change and won't automatically return to connected state when we notice that the server hasn't really gone. The reason your client isn't reintegrating is either because the pending changes haven't 'aged' long enough. Statistically, any file that hasn't been removed within 5 or 10 minutes after creation, it is likely going to be around for several months. So a lot of bandwidth is saved by delaying reintegration long enough so that short lived (temporary) files can be optimized away locally. The other reason could be that the estimated bandwidth is so incredibly low, that the client thinks it can't even reintegrate a single record without blocking the user for a significant amount of time. I believe the formula was something like, size of reintegration / bandwidth has to be less than 15 seconds. The low bandwidth estimate would be caused by RPC2's own insistence on retransmitting 'lost' packets, if every packet is sent 4 or 5 times, these all eat up the available link bandwidth. 128Kb/s would end up looking more like 32Kb/s (4KB/s) which really is a trickle. JanReceived on 2004-05-14 14:05:13