(Illustration by Gaich Muramatsu)
Greetings, I've beent rying to come to terms with the problem detailed below for the better part of two days now, and would be extremely grateful for any hint, however small. The situation: I'm trying to replicate a coda volume over four identical Intel-based servers running Linux 2.2.12, patched for Beowulf ethernet channel bonding, glibc2.1, based on a vanilla SuSE 6.2 installation. This doesn't work, as shown below. First try: Coda 5.3.1, set up according to the instructions in the HowTo. I was unable to start a non-SCM server, while the SCM worked fine. This problem was mentioned several times before on this mailing list, for example on February 2nd and 3rd by Bernd Markgraf and on September 10th by Alex Fomin. After toying around with this for a while, I scrapped the installation and moved on to the... Second try: Coda 5.2.7. Now this was slightly better. I was able to set up a replicated volume, spread across two servers, by following the "exploring replication" instructions in the HowTo. However, as soon as I try to add another, more widely replicated volume, all non-SCM servers crash; the newly added ones on startup, the already running one upon receipt of the new configuration files. Here's the end of the log and the resulting backtrace from a server crashing on startup: lqman: Creating LockQueue Manager.....LockQueue Manager starting ..... 11:02:13 LockQueue Manager just did a rvmlib_set_thread_data() done 11:02:13 ****** FILE SERVER INTERRUPTED BY SIGNAL 11 ****** 11:02:13 ****** Aborting outstanding transactions, stand by... 11:02:13 Uncommitted transactions: 0 11:02:13 Uncommitted transactions: 0 11:02:13 You may use gdb to attach to 388 (gdb) bt #0 0x40125b71 in __libc_nanosleep () from /lib/libc.so.6 #1 0x40125aed in __sleep (seconds=1) at ../sysdeps/unix/sysv/linux/sleep.c:78 #2 0x8115973 in coda_assert (pred=0x8116047 "0", file=0x8116040 "srv.cc", line=314) at coda_assert.c:46 #3 0x804a8dd in zombie (sig=11) at srv.cc:314 #4 0x400af9b8 in __restore () at ../sysdeps/unix/sysv/linux/i386/sigaction.c:125 #5 0x400b1141 in _quicksort (pbase=0xbffff220, total_elems=4294967294, size=4, cmp=0x80d1500 <cmpHost(long *, long *)>) at qsort.c:121 #6 0x400b17bb in qsort (b=0xbffff220, n=4294967294, s=4, cmp=0x80d1500 <cmpHost(long *, long *)>) at msort.c:114 #7 0x80d157a in vsgent::vsgent (this=0x8207d28, vsgaddr=3758096644, hosts=0xbffff220, nh=-2) at vsg.cc:67 #8 0x80d1bc2 in InitVSGDB () at vsg.cc:213 #9 0x80b0f0c in ResCommInit () at rescomm.cc:98 #10 0x804afe4 in main (argc=12, argv=0xbffff814) at srv.cc:510 I didn't save the backtrace from the already running one, but it, too, was crashing within qsort while building the VSGDB. This occurs regardless of whether I set up a two-server replicated volume first, or whether I go for the full four-server setup immediately. My servers file looks like this: dogma-1 1 dogma-2 2 dogma-3 3 dogma-4 4 ...and my VSGDB looks like this: E0000100 dogma-1 E0000101 dogma-2 E0000102 dogma-3 E0000103 dogma-4 E0000104 dogma-1 dogma-2 dogma-3 dogma-4 OK, this is about as much information as I can supply. Does that ring a bell with anyone? Thanks a whole lot in advance, Daniel. -- daniel schmitt - lead system architect - kidata ag, koenigswinter, germanyReceived on 1999-10-13 05:49:13