(Illustration by Gaich Muramatsu)
On Wed, May 25, 2005 at 02:51:04PM -0600, Patrick Walsh wrote: > > This could mean that your client cache is too small, or that the file > > was deleted on the server between the status and data validation phases. > > As there were 131 interrupts, this would indicate that this is happening > > quite often. > > These log entries were referring to a log file, so I'm sure it was > being opened and closed pretty frequently. As a workaround we've moved > the log files to /tmp and just use cron to copy them to coda. That's where write-disconnected would be really nice, because the writes would happen in the background. Ofcourse it also increases the chances of getting a conflict. > > So we fetch the new version (which is 653 bytes larger), and again it > > already changed on the server before we were done. I guess some process > > on is opening/writing/closing the file a lot, and every close results in > > a store. > > True to the second part, but there is only one client writing to the > file and no conflicts ever occurred, so I'm not sure the version vectors > differed. Interesting, I wonder why the client would bother about the version vectors if it is the one that performed the store in the first place. I found why we see the VV difference errors. UpdateStatus used to be split into a mutating and a non-mutating version depending on whether we had results from a mutating operation. Both versions were merged, but the comparison function was called as if we always were performing an attribute revalidation. As a result you would _want_ to see the VV difference error, otherwise the local object wouldn't get refreshed. > > Ah, when tokens are refreshed all old connections are first destroyed. > > Why is this? We have cron jobs that refresh tokens three times a day, > plus for various operations, like our cron jobs that download updates, > the tokens are refreshed at the beginning of the operation to insure > successful writes. So are you saying that if something is trying to > write a file and something else refreshes the tokens for that user (even > though the coda user <-> local user mapping is the same) that the write > operation will be broken off? (And potentially cause a crash?) If so, > can this behavior be changed? It shouldn't affect pending operations. All connections have an 'inuse' flag (refcount). If a connection is still active we simply flag it as dying and it will get reclaimed when the operation completes. All other connections send out a Disconnect message to the server first and then get reclaimed. I have read the code up and down and only found one place where we possibly can go wrong if the refcount is > 1, but that should never happen since we shouldn't grab a reference to any connection object that is already in use... Got the culprit in, userent::Reset() /* Delete the user's connections. */ { struct ConnKey Key; Key.host.s_addr = INADDR_ANY; Key.uid = uid; conn_iterator next(&Key); connent *c = 0; connent *tc = 0; for (c = next(); c != 0; c = tc) { tc = next(); /* read ahead */ if (tc) tc->GetRef(); /* make sure we don't lose the next connent */ (void)c->Suicide(1); if (tc) tc->PutRef(); } Combined with, PutConn(connect **cpp) if (!c->HasRef()) { c->print(logFile); CHOKE("PutConn: conn not in use"); } if (c->dying) { connent::conntab->remove(&c->tblhandle); delete c; } else c->PutRef(); If a connection is already in use, the userent::Reset iterator bumps the refcount on the 'next connent' to 2. Suicide makes an RPC2 call to the server and yields waiting for a reply. Then if the in-use connection returns first, and it was already marked as 'dying', the call to PutConn takes the entry off the list and destroys it. Then when Suicide returns and tries to iterate to the next entry it will segfault. This also explains why we don't get to see this all to often, it depends on having long-running RPC2 operations while connections are being destroyed. The only puzzle left is why the next connection is already marked as dying, this would indicate that we actually already went through the loop before and I haven't figured out how that might have happened. The attached patch should fix the problem though. Jan Index: comm.cc =================================================================== RCS file: /afs/cs/project/coda-src/cvs/coda/coda-src/venus/comm.cc,v retrieving revision 4.90 diff -u -u -r4.90 comm.cc --- comm.cc 7 Mar 2005 21:53:43 -0000 4.90 +++ comm.cc 26 May 2005 15:54:48 -0000 @@ -256,12 +256,12 @@ if (!c->HasRef()) { c->print(logFile); CHOKE("PutConn: conn not in use"); } - if (c->dying) { + c->PutRef(); + + if (!c->HasRef() && c->dying) { connent::conntab->remove(&c->tblhandle); delete c; } - else - c->PutRef(); }Received on 2005-05-26 12:11:40