(Illustration by Gaich Muramatsu)
On Wed, Feb 12, 2003 at 12:00:13PM -0500, Greg Troxel wrote: > 11:49:10 worker::Return: message write error 3 (op = 20, seq = 92135), > wrote -1 of 12 bytes > 11:49:10 DispatchWorker: signal received (seq = 92135) Error 3 (ESRCH) is returned when the upcall was cancelled because the userspace process received an interrupt. In this case we tried to reply after the signal was handled, but before it was seen by venus. > 11:49:24 DispatchWorker: signal received (seq = 92151) > 11:49:25 worker::Return: message write error 3 (op = 20, seq = 92151), > wrote -1 of 12 bytes And here we saw the signal about a second before we tried to send a reply back. Venus tries to interrupt the worker thread that is handling the upcall, but generally won't be able to actually stop it. > I do not have this problem on NetBSD/{i386,sparc} 1.5.4ish or > FreeBSD/i386 4.7ish. > > I wonder if emacs is doing some sort of asynchronous IO that isn't > handled correctly in the NetBSD kernel. We block some signals in the kernel module when we're waiting for a reply from venus, perhaps the signal handling has changed and they are still coming through. In the Linux kernel module, we never allow any interrupts for CODA_CLOSE. This is very important, if that upcall is aborted before venus picks it off the queue, venus will have a non-zero 'pending writers' and will never propagate changes back to the server. The file will also be considered 'dirty' during startup and moved aside to /usr/coda/spool. Besides that, we also ignore all other signals except SIGKILL and SIGINT for at least 30 seconds, I believe it was xemacs that was sending various signals to itself during system calls. I though Bob Baron implemented something similar in the *BSD kernel modules, except that he didn't even let SIGINT and SIGKILL through, and atomatically aborted the upcall after 60 seconds or something. > I also get emacs into a state where it is nonresponsive and stuck in > 'R' according to ps. Does NetBSD have the 'D' state (i.e. blocked in syscall?), perhaps it is repeatedly calling the same system call, interrupts it and because the call 'failed' tries again. Interestingly enough the upcall that is aborted is CODA_FSYNC. When the file is open for writing, venus will call sync(2) and then flush any pending RVM operations to the log (which calls fsync). Seems a bit useless really because it doesn't 'commit' venus to anything and doesn't really guarantee that updates will be seen by venus if it crashes. With all the syncs, I can believe venus will block a bit and, especially when it causes an impatient emacs to loop on the fsync(fd). Simplest thing is probably to avoid the useless syncs in venus (patch attached). What I really want to get at some point is that fsync triggers a store operation, i.e. that it would be a synchronization point and any updates at that point in time will be propagated to the server. It would allow an application to commit updates without closing/reopening the file. Which right now won't do much in write-disconnected mode because pending stores are optimized away when the file is opened for writing. Jan --- coda/coda-src/venus/worker.cc.orig 2003-01-31 22:22:53.000000000 -0500 +++ coda/coda-src/venus/worker.cc 2003-02-12 13:50:56.000000000 -0500 @@ -1179,7 +1179,7 @@ { LOG(100, ("CODA_FSYNC: u.u_pid = %d u.u_pgid = %d\n", u.u_pid, u.u_pgid)); MAKE_CNODE(vtarget, in->coda_fsync.Fid, 0); - fsync(&vtarget); + //fsync(&vtarget); break; }Received on 2003-02-12 13:59:46