(Illustration by Gaich Muramatsu)
On Mon, Oct 04, 2010 at 09:19:17AM +0200, u-codalist-wk5r_at_aetey.se wrote: > Linux 2.6 has always been flaky with regard to concurrent file system > access races. Coda can easily trigger this. I guess it is "optimization" > shortcuts in the kernel (not in Coda) who cause this. It is related to the revalidation of a newly cached path component in real_lookup. I guess when the validation fails it is assumed that the dentry creation raced with a removal of the same object and ENOENT is returned. > A move from 2.6.26 to 2.6.32 happens to make the races much more > pronounced. My environment is run from Coda and I notice e.g. that > running a loop with about 800 ls|tail commands produces around a dozen > "sh: /coda/..../ls: No such file or directory" messages. 2.6.26 was released Jul 13th, 2008 2.6.32 was released Dec 2nd, 2009 Between the two versions there were 70208 commits, 32170 files were changed, more than 5 million lines were added and over 2 million lines were removed. The only advantage here is that it is most likely some change in fs/namei.c, which has only seen 73 commits (~1000 lines changed) there is some reworking of the path walk, but I don't see any revalidation related changes. What I did see was the following commit [1] which I believe may fix the problem either way, as it removes/replaces the test where ENOENT is returned when revalidation fails. It looks like that patch went into 2.6.36-rc2. > It is in the first hand sh pipelines which exhibit the problem, > in other words parallel lookups on Coda due to execve() and open() > on the libraries. Actually the deep recursive symlink paths in your tree are triggering a lot of uncached lookups. I don't know why d_revalidate is failing, maybe some things are getting dropped from the (kernel or venus) cache when memory is allocated during the path lookup. > Anybody here who can guess where in the kernel the problem lies, to talk > to the developers? Today Linux 2.6 moves from being not totally reliable > to quite unusable with Coda in certain setups. That's really disturbing. I have not reliably reproduced your problem and for some reason am unable to reproduce on any machine with >1GB of main memory. It may be unusable for you, but it is a quite hard to trigger race condition that doesn't affect most people. It seems to require storing binaries and shared libraries in Coda which are accessed through a recursive and deep symlink forest. Other file system developers have occasionally hit on the same problem [2][3], Nick's patch seems to be the first one that has actually been accepted. [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2e2e88ea8c3bd9e1bd6e42faf047a4ac3fbb3b2f [2] http://marc.info/?l=linux-fsdevel&m=121936252707440&w=2 [3] http://marc.info/?l=linux-fsdevel&m=125378110215043&w=2 JanReceived on 2010-10-08 01:35:13