(Illustration by Gaich Muramatsu)
On Tue, Dec 02, 2003 at 08:25:05AM -0800, Steve Simitzis wrote: > the problem is that codasrv will freeze, apparently unbind all its > connections, and refuse to do much of anything. the only way to get it > running again is to kill -9 codasrv, and restart everything. I've seen similar freezes on our testserver and attributed those to clients that are connecting from behind a masquerading firewall without lowering the server-probe timeout. The problem is that the netfilter/iptables UDP connection tracking forgets about forwarded ports within 3 minutes, but the normal server probe is only about once every 5 minutes. So each probe sets up a bunch of new connections from a new port when it revalidates the local cache. The server isn't very smart yet, and tracks a client based on the ip-address. So over time it builds up more and more RPC2 connection endpoints, but because some of these connections have always recently been used it never expires them. After a couple of days (weeks) it spends so much time looking for a matching connection endpoint for each incoming packet that the server seems to freeze. This disconnected any clients with pending operations, and they reconnect, only making the problem worse. This is my current 'theory' about what is causing this. A server restart clearly fixes it for a while because that we we get rid of all those 'dead' endpoints. Another solution is to pull the network wire for about 10 minutes :) I'm not yet sure where to 'attack' this problem. For one, the server should become a little smarter about tracking clients and which connections belong to them/are still active. But maybe rpc2 has a exponential growth problem in the lookup path where it is matching incoming packets. JanReceived on 2003-12-05 17:07:47