(Illustration by Gaich Muramatsu)
Hi Jan, Please see embedded comments. > More likely a 64-bit issue. > > How did you build the servers on the 64-bit machines? Were they compiled > as 32-bit applications? The sparc64 systems are actually running in 32-bit mode. NetBSD 3.0 with 32-bit userland and 32 bit kernel. It is probably a bad habit of mine to refer to this as "sparc64"; I use the term to differentiate between NetBSD running on sun4[c|m] versus sun4u rather than differentiating between NetBSD/sparc running in 32-bit mode versus 64-bit mode. I built coda and all its dependencies by hand -- no packages -- they are all 32 bit. The trouble strikes both native 32-bit sun4m machines and 64-bit-in-32-bit-mode sun4u machines all the same. > > sonnet% cfs lv /coda/diablonet.net/tmp > > (just sits forever) > > ^C/coda/diablonet.net/tmp: Interrupted system call > > sonnet% > > > > (this happens both against my own servers, and against > > testserver.coda.cs.cmu.edu so I know its not anything wrong with my > > server configuration) > > So the 'cfs lv' hangs on a 32-bit sparc client, even when talking to > testserver? If this is the first time you access the realm it could be a > DNS resolver problem. Yes, that's correct. This is after doing something like cd /coda/testserver.coda.cs.cmu.edu;ls though. So it should already have resolved the name without any trouble. > > After this point, venus is completely hosed -- if you go to CODA-space > > and try to do anything like list a directory, read/write a file, > > whatever, it is just hung up and will sit, until you kill it, always > > with the interrupted system call error. You have to kill venus off and > > re-invoke with venus -init to make it work again. > > I've noticed that when the realm lookup fails, some thing are not > cleaned up correctly and venus crashes. When venus dies it hangs around > waiting for a debugger, and commands tend to get 'stuck' until venus is > killed. > I've tried some bogus realms e.g. cd /coda/some.bogus.realm.com and it takes a while to time out but seems to be pretty graceful about it.. not too much worse than afs. I did a cfs lv on my macppc system on a bogus realm, and it tried for maybe 30 seonds, then errored out gracefully, and venus worked fine afterwards, for whatever its worth. > > If I ktruss it, I see it is hanging up on the system call, > > > > 803 cfs open("/coda/.CONTROL", 0, 0) = 3 > > > > and cranking venus up with debuglevel -d 100, I see this in the logs: > > > > [ W(15) : 0000 : 21:21:03 ] fsobj::Lookup: (diablonet.net/tmp), uid = 0 > > [ W(15) : 0000 : 21:21:03 ] fsobj::Access : (diablonet.net, 8, 0), uid = 0 > > [ W(15) : 0000 : 21:21:03 ] Realm::GetUser local uid '0' for realm 'diablonet.net' > > Ah, the realm lookup did complete otherwise we wouldn't see lookups and > access calls for subdirectories. > > > [ W(15) : 0000 : 21:21:03 ] srvent::GetConn: host = blossom.diablnet.net, uid = -1, force = 0 > > [ W(15) : 0000 : 21:21:03 ] PutConn: host = blossom.diablnet.net, uid = -1, cid = 391227550, auth = 0 > > [ W(15) : 0000 : 21:21:03 ] PutServer: blossom.diablnet.net > > This is kind of strange, there seems to be an 'o' missing, typo in /etc/hosts? Very good eye. Big thanks for catching a typo in my reverse zone file. It helps to have a second pair of eyes look at these log files :) I corrected this of course, but it shouldn't be the factor that trips venus up, because it works on the macppc client, and it uses the same DNS (in fact, the mac is one of the DNS servers for my domain, diablonet.net). > > [ W(15) : 0000 : 21:21:03 ] volent::volent: (7f00000b, diablonet.tmp) > > [ W(15) : 0000 : 21:21:03 ] repvol::repvol 5043a4c8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > > [ W(15) : 0000 : 21:21:03 ] vsgdb::GetVSG 451276c8 451276c0 451276b8 451276b0 451276a8 451276a0 45127698 45127690 > > This is strange, it seems to think that the volume with only a single > replica is replicated on 8 servers. > > Normally you would see the repvol::repvol and the vsgdb::GetVSG lines > match up. > > [ W(12073) : 0000 : 15:59:41 ] repvol::repvol 5108d908 51000148 50fe9688 00000000 00000000 00000000 00000000 00000000 > [ W(12073) : 0000 : 15:59:41 ] vsgdb::GetVSG c7d10280 6fde0280 c0bf0280 00000000 00000000 00000000 00000000 00000000 > > Also the host values in the VSG array are very suspicious. I would > expect the GetVSG line to look like, > vsgdb::GetVSG 97a00740 00000000 00000000 00000000 00000000 00000000 ... > > And finally, we end up getting a bus error probably when we try to > release this strangly initialized structure. > > > [ W(15) : 0000 : 21:21:03 ] mgrpent::CheckNonMutating: acode = -2001 > > hosts = [0x45120ed0 0x45120ec8 0x45120ec0 0x45120eb8 > > 0x45120eb0 0x45120ea8 0x45120ea0 0x45120e98], > > retcodes = [0 -2002 -2002 -2002 -2002 -2002 -2002 -2002] > > [ W(15) : 0000 : 21:21:03 ] mgrpent::Put 0x10dc00, uid = 0, mid = 1, auth = > > 1, r > > efcount = 2, detached = 0 > > [ W(15) : 0000 : 21:21:03 ] mgrpent::PutHostSet: 0x10dc00 > > [ W(15) : 0000 : 21:21:03 ] ***** FATAL SIGNAL (10) ***** > > I'll look at the logs a bit more, it is almost like your struct in_addr > contains a pointer instead of a 32-bit integer. > > Jan > Thanks for taking the time out to peek at it. Please let me know if you want me to try something and produce some more data. Regards, Sean scaron_at_umich.eduReceived on 2006-07-27 17:54:37