(Illustration by Gaich Muramatsu)
On Wed, Apr 23, 2003 at 06:06:33PM -0400, Steve Simitzis wrote: > i'm running venus on two production machines, both linux 2.4.20. today > i woke up to find venus frozen on both machines. > > by "frozen" i mean: > > (1) cfs listvol showed that the volumes were connected. > (2) there were no reports of any crashes in any log. > (3) venus.log activity appeared normal (nothing but the occasional > BeginRvmTruncate and EndRvmTruncate message). > (4) no file access could take place, to the point where a simple ls > of a volume or any file access would hang indefinitely. That sound like venus ran out of worker threads to deal with new upcalls. Possibly caused by something like lock-ordering, or some thread not releasing a critical resource. I've not seen anything like that lately. The only unfixed case that I know of is a user doesn't have tokens and there is a reintegration log. And incoming write from another user then blocks to wait for the reintegration to 'complete', but it never does because the CML owner doesn't have tokens. The second user kills the write (^C) but the thread is still waiting, and when the user retries the operation he simply 'locks up' another thread. > i, at first, suspected the coda server, since both venus clients had > stopped responding at the same time. but restarting it fixed Well, from the tcpdumps you sent me related to the 15-30 second stalls during small file fetches, it looks like your network is dropping bursts of packets every couple of minutes. Something I really wouldn't expect on a switched 100base-T-FD local area network. Perhaps both clients were affected in some way by some period of sustained packet loss around that time? > once i restarted each venus client, however, everything was fine, as > if nothing had ever happened. Sure sounds like some kind of worker thread starvation. Venus only has about 20 worker threads. When files or attributes are fetched the worker typically has a lock on the object to avoid concurrent fetches for the same object. Other upcalls that try to access the same object have to wait for the lock to clear object. Now if the network drops a packet we have to retransmit a request and this can be anything between 300 msec up to 15 seconds (pathetic worst case behaviour). So if we have a bunch of apache processes (>20) that receive about 50-100 requests per second we get about 15-30 new upcalls during the 300 msec timeout (and 750-1500 during a 15 second stall). Now typical web accesses are pretty much focussed on the index.html files so we can very quickly run out of available worker threads. Reducing the number of apache processes, or increasing the number of available worker threads (-maxworkers 50) could very well help a lot in keeping the system running. JanReceived on 2003-04-23 20:55:11