(Illustration by Gaich Muramatsu)
On Mon, Mar 06, 2006 at 12:00:51PM -0500, Martin Ginkel wrote: > Jan Harkes wrote: > >However backfetches are using the same connection, and your backfetch is > >taking very long. So the server is unable to send the ping back to the > >client. This shouldn't be a problem because the server should be > >responding with RPC2_BUSY which will make the client wait an extra 15 > >seconds or so. I guess at some point the client did give up, returned > >ETIMEDOUT and disconnected. > > Hmm. How is this RPC2_BUSY supposed to work. After the client sends a request, it starts to wait for a reply. If we haven't seen the reply after the estimated round trip time period, we assume that either the request or the reply packet was lost and the request is retransmitted (wait time is exponentially increased). This is done up to 5 times. If the request was lost, the server will see one of the retransmissions. If the reply was lost, the server will notice that the connection already handled that request and it will retransmit the reply. Up to this point everything is pretty predictable. But if the server is still processing the request, we don't actually have anything to retransmit, so the server sends back an RPC2_BUSY reponse to let the client know that it did get the request and it is still working on it. All of the retransmit/busy handling happens at the lowest RPC2 layer on the server, the 'socketlistener' thread. So as long as the server is handling incoming packets, it should be able to respond. When the client receives an RPC2_BUSY reponse, it immediately bumps the retransmission period to the maximum RPC2 timeout value, as we know the request got there and the reply isn't ready yet. If that times out we resend the request, and get either a retransmitted response, or another BUSY. > Should this ObtainWriteLock somehow timeout? Which thread should > send the BUSY reply? > As far as I can see, the client sends the Probe several times on > the RPC2 level, without getting *any* reply. Well, the first probe request probably got stuck on that lock, but at that point the connection state should be S_PROCESS, and any new packets received by the server on the same connections should automatically get an RPC2_BUSY response. The only thing I can think of is that it is sending the busy to a bad address, but I would think that the reponse have been sent to the wrong place as well. > >This is actually not a reintegration write lock, this is caused by the > >fact that there is only a single RPC2 connection from the server to the > >client, so it can only do one thing at a time. Fetch a file, or send a > >callback probe. > > You mentioned, that clients should break up their store OPs. > Can they break up the transfer-size below file-size? > Or do they have to transmit at least one file completely. > I think this matters for the ISO-Images. It happens right before reintegration. If the first CML entry is a large store, it will be send in smaller chunks. Then when we reintegrate we add a handle to the store log entry, which makes the server avoid the backfetch and use the previously sent data. JanReceived on 2006-03-06 12:33:12