(Illustration by Gaich Muramatsu)
On Wed, Jul 26, 2006 at 09:39:52PM -0400, Sean Caron wrote: > Hi all, > > Running coda 6.0.14 on NetBSD/sparc64 back-end servers and a mix of > NetBSD/sparc, NetBSD/sparc64, and NetBSD/macppc clients (well > described in the list archives). I have recently taken note of the > fact that the 'cfs lv' command does not work on any of the SPARC > systems -- perhaps an endianness bug? More likely a 64-bit issue. How did you build the servers on the 64-bit machines? Were they compiled as 32-bit applications? > sonnet% cfs lv /coda/diablonet.net/tmp > (just sits forever) > ^C/coda/diablonet.net/tmp: Interrupted system call > sonnet% > > (this happens both against my own servers, and against > testserver.coda.cs.cmu.edu so I know its not anything wrong with my > server configuration) So the 'cfs lv' hangs on a 32-bit sparc client, even when talking to testserver? If this is the first time you access the realm it could be a DNS resolver problem. > After this point, venus is completely hosed -- if you go to CODA-space > and try to do anything like list a directory, read/write a file, > whatever, it is just hung up and will sit, until you kill it, always > with the interrupted system call error. You have to kill venus off and > re-invoke with venus -init to make it work again. I've noticed that when the realm lookup fails, some thing are not cleaned up correctly and venus crashes. When venus dies it hangs around waiting for a debugger, and commands tend to get 'stuck' until venus is killed. > If I ktruss it, I see it is hanging up on the system call, > > 803 cfs open("/coda/.CONTROL", 0, 0) = 3 > > and cranking venus up with debuglevel -d 100, I see this in the logs: > > [ W(15) : 0000 : 21:21:03 ] fsobj::Lookup: (diablonet.net/tmp), uid = 0 > [ W(15) : 0000 : 21:21:03 ] fsobj::Access : (diablonet.net, 8, 0), uid = 0 > [ W(15) : 0000 : 21:21:03 ] Realm::GetUser local uid '0' for realm 'diablonet.net' Ah, the realm lookup did complete otherwise we wouldn't see lookups and access calls for subdirectories. > [ W(15) : 0000 : 21:21:03 ] srvent::GetConn: host = blossom.diablnet.net, uid = -1, force = 0 > [ W(15) : 0000 : 21:21:03 ] PutConn: host = blossom.diablnet.net, uid = -1, cid = 391227550, auth = 0 > [ W(15) : 0000 : 21:21:03 ] PutServer: blossom.diablnet.net This is kind of strange, there seems to be an 'o' missing, typo in /etc/hosts? > [ W(15) : 0000 : 21:21:03 ] volent::volent: (7f00000b, diablonet.tmp) > [ W(15) : 0000 : 21:21:03 ] repvol::repvol 5043a4c8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > [ W(15) : 0000 : 21:21:03 ] vsgdb::GetVSG 451276c8 451276c0 451276b8 451276b0 451276a8 451276a0 45127698 45127690 This is strange, it seems to think that the volume with only a single replica is replicated on 8 servers. Normally you would see the repvol::repvol and the vsgdb::GetVSG lines match up. [ W(12073) : 0000 : 15:59:41 ] repvol::repvol 5108d908 51000148 50fe9688 00000000 00000000 00000000 00000000 00000000 [ W(12073) : 0000 : 15:59:41 ] vsgdb::GetVSG c7d10280 6fde0280 c0bf0280 00000000 00000000 00000000 00000000 00000000 Also the host values in the VSG array are very suspicious. I would expect the GetVSG line to look like, vsgdb::GetVSG 97a00740 00000000 00000000 00000000 00000000 00000000 ... And finally, we end up getting a bus error probably when we try to release this strangly initialized structure. > [ W(15) : 0000 : 21:21:03 ] mgrpent::CheckNonMutating: acode = -2001 > hosts = [0x45120ed0 0x45120ec8 0x45120ec0 0x45120eb8 > 0x45120eb0 0x45120ea8 0x45120ea0 0x45120e98], > retcodes = [0 -2002 -2002 -2002 -2002 -2002 -2002 -2002] > [ W(15) : 0000 : 21:21:03 ] mgrpent::Put 0x10dc00, uid = 0, mid = 1, auth = > 1, r > efcount = 2, detached = 0 > [ W(15) : 0000 : 21:21:03 ] mgrpent::PutHostSet: 0x10dc00 > [ W(15) : 0000 : 21:21:03 ] ***** FATAL SIGNAL (10) ***** I'll look at the logs a bit more, it is almost like your struct in_addr contains a pointer instead of a 32-bit integer. JanReceived on 2006-07-27 16:16:19