(Illustration by Gaich Muramatsu)
Hello everybody! I have a similar problem in my coda realm. Venus crashes when I try to copy(overwrite)/delete many files. I use coda in a production environment for a web hosting solution. During the testing period this situation never happened. My guess is that the problem occurs because of the high number of accessed files per second that triggers false conflicts. I am running coda-server and coda-client version 6.0.16 on Debian Sarge(stable) boxes and so far I didn't encounter any other problems. It works very smoothly. Do you think that if I install the version 6.9.0 the problem with false conflicts will be avoided? What any other suggestions for this situation? Regards, Florin -----Original Message----- From: Jan Harkes [mailto:jaharkes_at_cs.cmu.edu] Sent: Monday, March 12, 2007 8:09 PM To: codalist_at_coda.cs.cmu.edu Subject: Re: venus crash On Mon, Mar 12, 2007 at 05:45:28AM +0100, S. Cance wrote: > I missed something in the console log file : > > 05:38:55 RecovTerminate: dirty shutdown (1 uncommitted transactions) > Assertion failed: 0, file "fso_dir.cc", line 96 > Sleeping forever. You may use gdb to attach to process 1890. This is an assertion that triggers when we try to create a new filename entry in a directory, but the name already exists. This is a situation that should never occur, before we try to create this new directory entry we should have checked if it exists or not. So we may be forgetting (in some code path) to check if the name already exists. It is also possible that we do try to check for existence, but somehow combine that test with a check for the validity of the object the name refers to and if that object is missing (which is definitely possible during disconnected mode) we assume that the directory entry can be safely created. Either way, it is a bug. It may be hard to track down. Clearly we're adding some name. If you are using vim, it has several different ways it could possibly handle how files are written (create backup, overwrite original, or move original to backup, create new version, etc) Is the problem reliably reproducable? In that case you could rotate the log, bump the debug level on your client and try to trigger the problem. (vutil -swap ; vutil -d 100) # rotate logfiles and set debug level 100 Then if the problem occurs again the log hopefully will contain enough detail to figure out what sequence of operations caused the problem. If the problem didn't occur it is probably best to set debugging back to 0 to avoid filling up your local disk. (vutil -d 0) # turn of excessive debug logging. > I lost the modifications on the file, but what is surprising is that I > get CML conflicts on vim's swp's files. > > is it normal behaviour ? There is a case where false conflicts are detected even when it only involves write operations, or more specifically the store of new file data when a file is closed, from a single client. If your client is older than 6.9.0, it can switch between 2 different modes of operation. In the 'connected' mode it will send individual operations to the server (create/store/chown/chmod), the other mode is typically called 'write-disconnected', but the same mechanism is used after disconnections so a better name is probably 'reintegrating', it logs the operations and reintegrates them several at a time. Now when a connected mode store operation fails (f.i. network timeout) we don't know if it ever reached the server. So to make sure we don't lose any data we switch from connected mode to 'reintegrating' and perform the store operation again, this time logging it in the CML so that we will resend the operation when we get connected again. Not the false conflict is triggered when the server did in fact see the connected store operation and committed it locally, but the reply was lost because of the disconnection/network timeout. In that case the retried store from the CML is trying to update a file that was already updated on the server and it is flagged as a conflict. Reintegration does know how to detect repeated operations because it assumes we have bad connnectivity, but it only works for when the operation was previously sent by reintegration. 6.9.0 never uses connected mode, so everything is logged and reintegrated and this type of false conflict doesn't happen. JanReceived on 2007-03-13 09:14:21