Coda File System

Re: Linux 2.6.32 seems to exaggerate the race bug(s) with Coda

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 8 Oct 2010 01:34:51 -0400
On Mon, Oct 04, 2010 at 09:19:17AM +0200, u-codalist-wk5r_at_aetey.se wrote:
> Linux 2.6 has always been flaky with regard to concurrent file system
> access races. Coda can easily trigger this. I guess it is "optimization"
> shortcuts in the kernel (not in Coda) who cause this.

It is related to the revalidation of a newly cached path component in
real_lookup.  I guess when the validation fails it is assumed that the
dentry creation raced with a removal of the same object and ENOENT is
returned.

> A move from 2.6.26 to 2.6.32 happens to make the races much more
> pronounced. My environment is run from Coda and I notice e.g. that
> running a loop with about 800 ls|tail commands produces around a dozen
> "sh: /coda/..../ls: No such file or directory" messages.

	2.6.26 was released Jul 13th, 2008
	2.6.32 was released Dec 2nd, 2009

Between the two versions there were 70208 commits, 32170 files were
changed, more than 5 million lines were added and over 2 million lines
were removed.  The only advantage here is that it is most likely some
change in fs/namei.c, which has only seen 73 commits (~1000 lines
changed) there is some reworking of the path walk, but I don't see any
revalidation related changes.

What I did see was the following commit [1] which I believe may fix the
problem either way, as it removes/replaces the test where ENOENT is
returned when revalidation fails.

It looks like that patch went into 2.6.36-rc2.

> It is in the first hand sh pipelines which exhibit the problem,
> in other words parallel lookups on Coda due to execve() and open()
> on the libraries.

Actually the deep recursive symlink paths in your tree are triggering a
lot of uncached lookups. I don't know why d_revalidate is failing, maybe
some things are getting dropped from the (kernel or venus) cache when
memory is allocated during the path lookup.

> Anybody here who can guess where in the kernel the problem lies, to talk
> to the developers? Today Linux 2.6 moves from being not totally reliable
> to quite unusable with Coda in certain setups. That's really disturbing.

I have not reliably reproduced your problem and for some reason am
unable to reproduce on any machine with >1GB of main memory. It may be
unusable for you, but it is a quite hard to trigger race condition that
doesn't affect most people. It seems to require storing binaries and
shared libraries in Coda which are accessed through a recursive and deep
symlink forest.

Other file system developers have occasionally hit on the same problem
[2][3], Nick's patch seems to be the first one that has actually been
accepted.

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2e2e88ea8c3bd9e1bd6e42faf047a4ac3fbb3b2f
[2] http://marc.info/?l=linux-fsdevel&m=121936252707440&w=2
[3] http://marc.info/?l=linux-fsdevel&m=125378110215043&w=2

Jan
Received on 2010-10-08 01:35:13