(Illustration by Gaich Muramatsu)
On Thu, Feb 10, 2005 at 06:31:37AM +0800, Alan Tam wrote: > Jan Harkes wrote: > >So I have a pretty good idea where it crashed, but no idea how it > >managed to crash there. > > Maybe it is caused by my manual editing of these files to [1] correct > the wrongly detected machine names. Probably I should remove everything > else and install again. I've got a lot of such experience anyway. Changing stuff like that shouldn't crash a client, it might not make it able to reach a server. I'll try to mess around a bit with stuff to see if I can reproduce it. > But still sometimes I do have no way to discover where the problems are. > Process can be frozen, my not knowing what it is waiting for [2]. And in That looks like a normal 60 second rpc2 timeout. It used to be 15 seconds, but too many people had problems with unexpected disconnections over weak links (or links with asynchronous bandwidth like ADSL or cable modems) that we bumped up the delays to more conservative values. If you had 2 or more root servers for your realm this would probably have lasted a multiple of this. A clear indication that we really should teach RPC2 to look at ICMP error responses (NETUNREACH/HOSTUNREACH) so that we can abort useless retries to unreachable servers more quickly. > most cases, the messages logged are simply not enough to track down what > is configured wrong. Are you running 'codacon'? I tend to run it permanently in a separate xterm on my desktop and it does often give some more feedback about the transient stuff that is going on. > sltam_at_beta:/coda$ date; ls -l; date > Thu Feb 10 06:23:35 HKT 2005 > total 9 > dr-xr-xr-x 2 root guest 2048 Dec 25 02:57 ./ > drwxr-xr-x 25 root root 4096 Feb 5 19:12 ../ > lrw-r--r-- 1 root guest 9 Feb 10 03:29 delta.mydomain.com -> > #@delta.mydomain.com > Thu Feb 10 06:24:26 HKT 2005 Ok, so we know that 'delta.mydomain.com' is already known by venus either because you accessed it earlier, or because you have obtained tokens for that realm. Now the question is where we waited for that 60 second timeout. Name resolution must have succeeded, because I don't get any delay if I do 'ls /coda/foo.bar'. So there are 2 rpc operation between here and successfully mounting the volume. One is the volume location query, and the second is where we try to get the attributed of the root directory of the volume. If the server is not running we probably timed out on the first, and if the IP-address in the location information is wrong we probably timed out on the second. What you could try is the 'getvolinfo' command, I'm not sure whether is is installed with the client or the server, but it should be in /usr/sbin. Do something like 'getvolinfo delta.mydomain.com ""', and that should return the volume location information of the rootvolume. The result will contain information like, Replica0 id c7000085, Server0 128.2.191.192 Check if that IP is actually valid and reachable, my guess is that either that address is 127.0.0.1, or that your server is bound to a different address. JanReceived on 2005-02-10 22:04:59