(Illustration by Gaich Muramatsu)
Morning all I believe I have a similar issue to that described by Steve in the post http://www.coda.cs.cmu.edu/maillists/codalist/codalist-2003/5878.html - which I don't think was ever resolved, so I'd like to take up the torch. I am trying to use coda as the basis of a replicating highly available file store shared between 2 mail servers in a cluster. To summarise - if I send less than 10 mails per sec to the cluster it seems to be able to handle this load pretty much indefinitely. More than 10 - after a couple of thousand mails or so I get warnings in SrvLog and the MTA freezes while accessing files via the /coda/<realm> mountpoint. restarting codasrv seems to fix it but it just goes wrong again the same way when I restart the test. This cluster should be able to handle 40-50 messages per sec at least. o I am running coda (codasrv,venus et al) 6.0.3 (built by me on gcc 3.2.2), slackware linux 9.0, kernel 2.6.3 (with whatever standard coda sources come with the kernel). o The servers are Dell 1750 dual Xeons, 4GB RAM. o I have set up a realm which includes the 2 servers, and I authenticate using a cron script that calls clog. Both machines are running both codasrv and venus. o Both clients are connected read-write and both servers are apparently up. o The MTA accesses the shared directories via /coda/<realm>/ o I set up coda and venus up using default paths, and the largest default options available in the setup scripts. o It is extremely unlikely (I don't think possible actually) that both clients will attempt to access the same file simultaneously, though it's entirely possible that one system may attempt to delete a non-empty dir that contains files open on the other system (clearly I anticipate failure here!) The failure mode is always the same. Here is a typical entry the SrvLog: 07:34:34 ****** WARNING entry at 0x8122320 already has deqing set! here is where codasrv is at: (gdb) where #0 0x4024b44e in select () from /lib/libc.so.6 #1 0x400812fc in __JCR_LIST__ () from /usr/lib/liblwp.so.2 #2 0x4007d130 in IOMGR (dummy=0x0) at iomgr.c:354 #3 0x4007ef16 in Create_Process_Part2 () at lwp.c:796 The MTA is stuck in an open() call. I am pretty new to coda so I'm not too sure where to go with this beyond trawling the coda ML and google and trying anything that seems remotely related - which I have done. I have copies of all logs, and I will post anything that anyone thinks would be useful. I have tried (clutching at straws) setting serverprobe=120, and 60, no difference. I have iptables loaded on both servers but not doing any masquerading (not doing anything actually beyond the defaults) - so I'm inclined to think that Jan's idea that it is related to socket routing and masq is not so in my case. This is eminently reproducable. I can reproduce it within 2 minutes on demand. This is a showstopper for us regarding our use of coda. I have some limited time however (a couple of days before I am forced to abandon coda for some less satisfactory alternative) and I am happy and keen to try and assist in debug in any way I can in that period. I have not delved into the coda source yet but I'm open to suggestions and I am a reasonably competent programmer. I'm hoping to entice someone from the coda core team to help out here as it seems like there is a serious fundamental bug which, if fixed, would greatly benefit the coda community, especially those wanting to load coda up a bit. It happens so quickly and regularly in my case that I can;t believe others aren;t in the same boat. Trying not to sound too desperate ... :) Cheers Jim Page Email has been scanned for viruses and SPAM by Email Systems *** Email the way you want it ***Received on 2004-06-01 05:55:35