(Illustration by Gaich Muramatsu)
On Fri, May 20, 2005 at 10:43:40AM -0600, Patrick Walsh wrote: > [ H(07) : 0207 : 04:42:20 ] Hoard Walk interrupted -- object missing! > <606e1fc8.7f000001.964.56d1> > [ H(07) : 0207 : 04:42:20 ] Number of interrupt failures = 131 This could mean that your client cache is too small, or that the file was deleted on the server between the status and data validation phases. As there were 131 interrupts, this would indicate that this is happening quite often. > [ W(823) : 0000 : 11:13:17 ] Cachefile::SetLength 552280 > [ W(823) : 0000 : 11:13:17 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28), > VVs differ Cachefile::SetLength is called whenever a file is fetched and possibly when it is closed for writing. Because of the StatusEq message I guess this is after a fetch. The problem here is that we got the attributes, then fetched the data and by the time the fetch completed, the version vector on the server has already changed. > [ W(823) : 0000 : 11:13:19 ] Cachefile::SetLength 552933 > [ W(823) : 0000 : 11:13:20 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28), > VVs differ So we fetch the new version (which is 653 bytes larger), and again it already changed on the server before we were done. I guess some process on is opening/writing/closing the file a lot, and every close results in a store. > changing to this: > > [ W(1783) : 0000 : 17:17:18 ] Cachefile::SetLength 3243845 > [ W(1783) : 0000 : 17:17:19 ] *** Long Running (Multi)Store: code = > -2001, elapsed = 1252.4 *** > [ W(1783) : 0000 : 17:17:19 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28), > VVs differ Interesting, this makes it look like it isn't fetches, but stores. I don't see how/why the VV's could differ without causing a conflict. > and then eventually to this: > > [ W(1783) : 0000 : 21:54:50 ] Cachefile::SetLength 7015276 > [ W(1779) : 0000 : 21:54:53 ] WAITING(606e1fc8.7f000002.e.28): level = > RD, readers = 0, writers = 1 > [ W(1783) : 0000 : 21:54:53 ] *** Long Running (Multi)Store: code = > -2001, elapsed = 3633.3 *** > [ W(1783) : 0000 : 21:54:53 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28), > VVs differ > [ W(1779) : 0000 : 21:54:53 ] WAIT OVER, elapsed = 361.2 Argh, the file is getting real big here, so each store is taking longer and longer. By now other threads are getting blocked waiting for the store to complete (the store is taking 3.6 seconds). > The very end of venus.log looks like this: > > [ W(1783) : 0000 : 21:54:57 ] Cachefile::SetLength 7016538 > [ D(1804) : 0000 : 21:55:00 ] WAITING(SRVRQ): > [ W(821) : 0000 : 21:55:00 ] WAITING(SRVRQ): > [ W(823) : 0000 : 21:55:00 ] ***** FATAL SIGNAL (11) ***** Thread 1783 is trying to store the file, which probably takes about 4 seconds by now. In the mean time other threads want connections to the server, but there are none available, so thread 1804 and 821 are waiting for an available RPC2 connection. Thread 823 hits a NULL ptr dereference, no idea why. > 12:55:00 root acquiring Coda tokens! > 12:55:01 root acquiring Coda tokens! > 12:55:01 Coda token for user 0 has been discarded > 15:55:00 root acquiring Coda tokens! > 15:55:00 root acquiring Coda tokens! > 18:55:00 root acquiring Coda tokens! > 18:55:00 root acquiring Coda tokens! > 21:55:00 root acquiring Coda tokens! > 21:55:00 root acquiring Coda tokens! > 21:55:00 Fatal Signal (11); pid 1708 becoming a zombie... Ah, when tokens are refreshed all old connections are first destroyed. I've seen crashes in this area, maybe not everyone correctly increments the reference count on the connection. So the token refresh kills the existing connection and when the RPC returns the thread gets a NULL ptr dereference as soon as it tries to use the connection. Or if we don't look at the connection anymore, links the destroyed connection back into the list of available connections and some other thread hits the segfault. > Finally, to my questions: > 1) is there something I can do to prevent future signal 11's? Fix all remaining bugs and race conditions in Coda. > 2) If such a signal (whatever it means) happens, can coda just restart > itself instead of going into a zombie state and causing httpd and > proftpd to hang? Not really, even if we quit at this point, or exec a new venus process, it is likely that we hit the bug while servicing a request from one of the httpd or ftpd processes. So there will be an active reference to some file in /coda where the process is waiting for a response. As a result we cannot unmount the tree. It is possible to reattach venus to an existing mount, but that is in 99% of the cases a very bad idea. Venus doesn't know if a process still has an indirect reference to a container file, and we might see a close upcall for a file that we didn't even know was open. JanReceived on 2005-05-25 16:08:41