Coda File System

Re: Unable to do beginrepair...

From: Greg Troxel <gdt_at_ir.bbn.com>
Date: 12 Dec 2002 09:32:09 -0500
Not much spam and no gnutella, but probably a web browser open on
the weather page that updates occasionally.

IPsec speed should be ok; server is a PPro-200 and client a Pentium IV
2GHz, which should be able to keep up (doing AES and HMAC-SHA1).
Seriously, it's not the cpu speed, although there could be hiccups
during SA renegotiation (but I don't think so).

  If global is a symlink to the volumeid, the client hasn't been able to
  mount the volume. Perhaps it timed out during the volume lookup or
  something. Another reason would be a server-server conflict, but as you
  only have a single server that probably isn't the case.

This seemed to persist, even though I could do 'cfs lv'.  My real
problem isn't that something bad happened, it's that I could not
recover.

  Yup, norton is only a server thing. 'cfs fl' is in many cases an evil
  operation, it flushes the data of cached objects in the specified
  subtree. It shouldn't touch 'dirty' objects (i.e. the ones with
  associated CML entries), but you never know.

So really 'cfs fl' should only discard cached data that is still in
the 'read-only' state, and thus should be safe at any time.  If not,
it probably should be fixed.

  cfs fl is mostly useful for debugging, it can be used to push an object
  out of the cache, so that I can test or time the fetching of an object.

sure, so I was doing the wrong thing with it, but still it should not
have caused trouble.

  The CML is an ordered log, and you cannot simply kill an entry within
  the log. The only possible operations are by stepping through the
  operations with 'repair', or 'cfs purgeml'. Maybe you could try
  'removeinc', I know someone here tried to make that work for
  local-global conflicts, but I'm not sure it ever worked.

I should have tried purgeml, but given that I had a 'local
inconsistent object' I bet it would not have worked.  Perhaps there
should be some way to drop the head entry off the CML, which seems to
be analogous to 'discardlocal' in repair.  So maybe there is, and the
problem is that entering repair mode can fail.  In my book, entering
repair mode should succeed any time there is a conflict.

  Ugh, these are local 'fake' fids, but during reintegration these should
  have been replaced with correct global fids. Maybe your kernel is a bit
  too agressive caching the directory contents, or the objects got lost as
  a result of the cfs fl or failed repairs.

This is on NetBSD 1.6, but I have had similar experiences on FreeBSD.

  Ok, so we have a CML entry to create a file named 'local', but the
  client is unable to find the associated container file. That's a pretty
  bad state right there. The name is typically the last name used to
  access the object, I guess this got expanded as a file conflict and the
  'local' directory entry was the last name used to access wi0.dump.

I never created anything called local.  I am pretty sure this is from
the failed repair session.

  > 08:56:56 Callback failed RPC2_DEAD (F) for ws [client-in-question]:65516
  > 09:01:19 Callback failed RPC2_DEAD (F) for ws [client-in-question]:64967

  These are a result of timeouts, for some reason the client is not
  responding (or not receiving) rpc2 callback messages. As a result the
  server will kill all incoming connections from that client.

This probably happened during congestion on the link.  That's life,
and shouldn't cause lasting trouble (I'd expect going from WD to
disconnected, and then back when the server probe works, picking up
reintegrating).

  Hey I've got 33k6, I'll throttle it for 28.8 for a while to see if I get
  hit by similar problems.

I don't think that will make the difference.  Mine is nominally 33.6,
but I get 28 or 26.  It could be that the BSD kernel support is
buggy.  That would be fair enough after all the trouble I've seen with
Linux kernels and coda over the years.

  The server should deal with retried reintegrations, not sure why it
  doesn't seem to do that in your case.

Do you mean

  modification on client while WD
  try to reintegrate
  Backfetch times out, causing Store to fail
  pause
  try to retintegrate the same Store
  Backfetch works this time
  [successful store with no conflict]

is what you think happens with the current code?

My issue isn't

  I get timeouts once in a while and go disconnected when I think I
  shouldn't, but venus reintegrates stuff later so it's only annoying.

This would be an annoyance.  It's more like

  When I modify files in the client, and the client was WD to start with,
  and no one else is writing to that volume, I end up with conflicts that
  I cannot repair and I have to reinit venus.

which would cause me to stop using coda if I hadn't already integrated
into how I work.

What's the state of the realms branch and the future repair changes?
It seems like repair (venus's representation of stuff) is bletcherous
now.  But, it may be that the problem is in the NetBSD kernel code.

I wonder if putting some more aggressive cache flushing into
venus/netbsd would help.  I'd take not losing over performance
happily, and then we'd know where to fix.  I admit I have assumed that
the problem is in venus, and that isn't necessairly clear.


        Greg Troxel <gdt_at_ir.bbn.com>
Received on 2002-12-12 09:39:20