(Illustration by Gaich Muramatsu)
I have now tried experimenting with the servers. Restarting codasrv on both servers didn't affect the problem. Stopping the server on dir225 and trying to access the problem directory (which, I should mention, was once a directory that worked fine but a problem file in the directory -- until I tried to delete the file) still triggers the familiar issue. Stopping the server on dir224 and then trying to ls in the directory gives a different error and one that doesn't show up in the logs: # ls /coda/director/snapin/pool_scm/r ls: /coda/director/snapin/pool_scm/r/readline-4.2-2.i386.rpm: Connection timed out And I've hit upon what I think must be the problem: # cfs whereis /coda/director/snapin/pool_scm dir224 dir225 dir225 A quick look at VRList on the server shows: /snapin 7f000003 3 1000004 2000004 200000a 0 0 0 0 0 0 So it appears we are triply replicating a volume to two servers. I have no idea how this happened -- we've automated the setup of coda and that code hasn't changed for some time. So I'll look into this and try to figure out what's going on. Sorry to waste your time with a bad setup. I just can't figure out how it got setup wrong. ..Patrick On Thu, 2005-07-07 at 14:22 -0400, Jan Harkes wrote: > On Thu, Jul 07, 2005 at 09:41:03AM -0600, Patrick Walsh wrote: > > # ls /root/pool_scm/r > > readline2.2.1-2.2.1-4.i386.rpm > > rpm-4.0.4-7x.20.i386.rpm > > readline-4.2-2.i386.rpm > > rsh-0.17-18.AS21.2.i386.rpm > > restore-default-system-1.0-20031001.i386.rpm > > rsh-0.17-18.AS21.4.i686.rpm > > rootfiles-7.2-1.noarch.rpm > > # du -s -h /root/pool_scm/r > > 2.6M /root/pool_scm/r > > # ls r > > ls: r/readline-4.2-2.i386.rpm: No such device > > Ok, 7 directory entries wouldn't be enough to fill a directory. > > > At this point, venus has crashed. The console.log file has the > > erroneous seeming errors that I pasted before, but to show again: > > > > ***LWP (0x810ec50): Select returns error: 4 > > > > 09:28:28 worker::main Got a bogus opcode 36 > > 09:29:30 readline-4.2-2.i386.rpm (606e1fc8.7f000003.1018.4de) > > inconsistent! > > 09:29:30 fatal error -- fsobj::dir_Create: (dir225, > > 606e1fc8.7f000003.fffffffc.80002) Create failed! > > This is very strange, I looked at the source, we are trying to add a > directory entry to some unknown directory (the name or fid of the parent > in which we are trying to create is not logged). We do know that the new > entry has the name "dir225" and it is pointing at a fake object in the > same volume as the inconsistent rpm file. > > However, server-server conflict do not in any way try to create names or > anything. The lookup or getattr operation returns EINCONS and this is > mapped to faked stat data right before we send the reply back to the > kernel. As far as I know there isn't even an actual filesystem object > associated with the inconsistent object, since the servers disagree > about it's contents. Only reintegration related expansion is changing > directory contents, since in that case we do have a locally cached copy > of the object and it has to be modified before we can show the global > version. > > I also don't see how anything in that volume would even have a name like > 'dir225', there are the [a-z] directories, and a bunch of *.rpm files. > > But somehow these two must be related, since they seem to happen so > reliably right after each other. > > > I should have mentioned that I already tried this. And as you can see > > from the above terminal transcript, it had little effect. > > > > Any other thoughts? > > No idea, it just doesn't make sense. I don't see how a server-server > conflict could possibly get into the expansion code that is used when a > reintegration fails, if you are simply doing an 'ls'. I also don't > understand why it is trying to create a directory named 'dir225' when > all the names in the volume are either a single character 'a-z' or > '*.rpm'. > > Maybe start venus with loglevel 100 (venus -init -d 100) and repeat the > same thing. At that point the log might show how we're getting to this > point and if those two events (the inconsistency and the crash) are > really related or not. > > Jan > -- Patrick Walsh eSoft Incorporated 303.444.1600 x3350 http://www.esoft.com/Received on 2005-07-07 15:48:05