(Illustration by Gaich Muramatsu)
On Fri, Apr 18, 2003 at 07:49:46PM -0400, Steve Simitzis wrote: > it started at 6am, when venus died randomly: > > Apr 18 06:40:10 sg2 kernel: No pseudo device in upcall comms at f89cbdc0 > Apr 18 06:40:10 sg2 last message repeated 25 times > Apr 18 06:40:10 sg2 kernel: coda_upcall: Venus dead on (op,un) (7.1564091) flags 8 > Apr 18 06:40:10 sg2 kernel: No pseudo device in upcall comms at f89cbdc0 > Apr 18 06:40:10 sg2 kernel: No pseudo device in upcall comms at f89cbdc0 These are all messages from the kernel module when upcalls are no longer accepted by venus. (i.e. venus had died around this time). > (sorry, i don't have the venus log from that time, as this is a > production machine, causing me to act hastily to bring it back up.) I know the feeling, I hardly ever use our production servers or webserver for debugging purposes. > out of desperation, i tried running venus-setup again to get > everything into a known state. but it seems that everytime the volumes > are disconnected, i get something like this in the logs: An equivalent way to totally reset a clients state is to recursively remove all the container files in /usr/coda/venus.cache (or /var/lib/coda/cache), and then create an empty 'INIT' file which tells venus to reinitialize RVM on the next startup. That is pretty much identical to installing a client from scratch, and because we remove the container files it does a bit more than just rerunning venus-setup. rm -rf /usr/coda/venus.cache/* touch /usr/coda/venus.cache/INIT > [ W(23) : 0000 : 15:46:11 ] *** Long Running (Multi)Fetch: code = -2001, elapsed = 30761.0 *** > [ W(23) : 0000 : 15:46:11 ] Cachefile::SetValidData 52224 Ouch, it took 31 seconds to fetch a 52KB file. > [ W(21) : 0000 : 15:46:11 ] WAIT OVER, elapsed = 14896.6 And another thread was blocked for about 15 seconds because of this. > [ X(00) : 0000 : 15:46:15 ] DispatchWorker: out of workers (max 20), queueing message This is pretty serious, all worker threads are blocked and we have to queue a couple of requests coming from the kernel. > [ W(22) : 0000 : 15:46:27 ] *** Long Running (Multi)Fetch: code = -2001, elapsed = 30232.0 *** > [ W(22) : 0000 : 15:46:27 ] Cachefile::SetValidData 2270 The worker threads were probably blocked because of this fetch, we took 30 seconds to fetch a lousy 2KB. > [ W(27) : 0000 : 15:46:27 ] WAIT OVER, elapsed = 15025.2 > [ W(25) : 0000 : 15:46:27 ] WAIT OVER, elapsed = 15026.8 > [ W(24) : 0000 : 15:46:27 ] WAIT OVER, elapsed = 15026.6 And here 3 of the blocked threads start running again. > what i've assumed is happening is that venus is trying to rebuild a > rather large cache of files, but is tripping over itself in the > process. Possibly, but because the waits are 30 seconds on a fetch, and the typical rpc2 timeout is about 15 seconds this looks like some strange double timeout before the operation succeeds. I'm assuming that you have multiple servers and one of the servers has trouble with the sftp transfer connecting back to the client. I'm not entirely sure how we get a 30 second stall, but once the sftp connection setup fails (15 seconds) an error is returned and the client connects to another server which actually does the data transfer. Gettimeofday is used a lot in rpc2, it is used to timestamps outgoing packets and check timeouts for retransmissions. Hmm, perhaps you could grab a tcpdump of the traffic. I would prefer a raw dump because I can then decode the rpc2 and sftp headers to see what is going on the wire. tcpdump -s 512 -w tcpdump.out port codasrv or port codasrv-se Don't worry about 'Hoarding' it is the internal state name for fully-connected mode as we 'hoard' data before a disconnection. JanReceived on 2003-04-18 22:56:24