(Illustration by Gaich Muramatsu)
On Sun, May 16, 2004 at 11:54:37PM -0400, shivers_at_cc.gatech.edu wrote: > Suppose on my coda client, I say > cp *.c /coda/myserver/src > and copy a lot of files to my coda filesystem. Both client & server are on the > net, the client via cable modem. If, in the middle of the cp, I suspend > the process by typing ^z in unix, two bad things happen: > > - It takes 3 minutes to suspend. (Which basically renders a large portion > of the reason to suspend moot.) The close(2) syscall is the only syscall that we absolutely can't interrupt in the kernel because if we do we don't know whether the kernel still has an active reference to the file. So if you had hit ^Z anywhere during the open or write calls it would have suspended right away. If your client was operating write-disconnected the close upcall would have simply triggered a log update and returned quickly, while the actual store to the server would have happened in the background. You probably told the client to (try to) stay in fully connected mode and in that case the semantics are such that close(2) won't return until we know for sure that the update has been committed on the server. This clearly took about 3 minutes. > - The current copy operation aborts with an "syscall interrupted" error and > so when I later resume the cp job, I will find that one of my files failed > to get copied. Once we get a response from the server that the operation completed, we re-enable interrupts. At that time we notice the the SIGSTOP signal and process it which triggers the syscall interrupted message. However the operation must have been completed and committed on the server as we got a reply for the upcall. I don't know for sure why a file would not be copied, but maybe the signal isn't seen until the next operation starts (the open call for the next file), and we returns EINTR or something to the cp application, cp is simply handling that as a fatal failure instead of a retryable one. > Is this standard behavior for coda? Am I doing something wrong? I'm guessing it is an application behaviour of cp, normally quite hard to trigger. You would have to time the suspend signal so that it arrives during the open call and while the kernel happens to be in an interruptible sleep, maybe while performing disk IO. It is simply trivial to trigger this case on Coda because we actually go into such an interruptible sleep every time we need to inform the userspace cache manager and for longer periods of time than a typical disk IO operation. JanReceived on 2004-05-18 23:43:36