(Illustration by Gaich Muramatsu)
On Sat, May 15, 2004 at 09:59:08PM -0700, Steve Simitzis wrote: > On 05/11/04, Jan Harkes <jaharkes_at_cs.cmu.edu> wrote: > > This is strange, the reference counts (refs) are tracking the internal > > references [readers writers refcnt] and the open counts (openers) are > > trying to follow the kernel references [reading writing executing]. It > > seems strange that we have a readers refcount, but no filehandles open > > for reading. I'll have to check the source to see where these counters > > are manipulated. > > strange indeed. It actually looks like there was no problem with those reference counts. > i think i may have finally tracked down what may be causing the problem. > > i noticed that i hadn't seen this type of crash until around the same time > that we launched a new feature that relied on imagemagick. (imagemagick > is a commonly used set of command line tools that converts and manipulates > images to different formats and sizes.) > > the crashes seem to take place after convert (part of imagemagick) does > its work, and venus completes a reintegration of the files. unfortunately, > it's not consistent. perhaps convert is writing the file out in an > unpleasant way, and venus is reintegrating before the file is actually > ready. (?) I've straced convert and there are only a couple of noticable things, - The output file is opened with the O_LARGEFILE flag. This really shouldn't be a problem, as far as I can remember the kernel even strips off this flag for us because we don't have the large file compatibility flag set in our superblock descriptor. - There are 2 stat calls (which translate to getattr operations in venus) One is right after the file is created/opened. The other is before the final (partial?) write is done. Imagemagick seems to write the file in 8KB blocks until there is less than 8KB left to write, calls a stat and then writes the last bits to disk. That last one is unusual, although I don't immediately see how it could affect anything. I'd like to know more about the reintegrations, for instance is convert running for a long time and is the initial CREATE operation already reintegrated before we're done writing the file. > so now i have convert writing the file to /tmp, and then i'm renaming > the file into coda. since making this change to our web application a > few days ago, i haven't had a single crash. (still crossing my > fingers!) Well, if that works, definitely keep it like that for now. I'll try to simulate the convert behaviour with a small test program and see if I can trigger the problem. JanReceived on 2004-05-17 15:19:00