(Illustration by Gaich Muramatsu)
And the logs .... > -----Original Message----- > From: Florin Muresan [mailto:Florin.Muresan_at_atc.ro] > Sent: Thursday, March 15, 2007 5:03 PM > To: Jan Harkes > Cc: codalist_at_coda.cs.cmu.edu > Subject: RE: venus crash > > Hello! > > Thank you Jan for your answer. > > > -----Original Message----- > > From: Jan Harkes [mailto:jaharkes_at_cs.cmu.edu] > > Sent: Tuesday, March 13, 2007 4:57 PM > > To: Florin Muresan > > Subject: Re: venus crash > > > > On Tue, Mar 13, 2007 at 12:45:32PM +0200, Florin Muresan wrote: > > > Hello everybody! > > > > I guess this was intended to be sent to codalist, > > Yes it was but I missed the reply all button. Sorry for that. > > > > > > I have a similar problem in my coda realm. Venus crashes when I try > to > > > copy(overwrite)/delete many files. > > > I use coda in a production environment for a web hosting solution. > > > During the testing period this situation never happened. My guess is > > > that the problem occurs because of the high number of accessed files > per > > > second that triggers false conflicts. > > > > High rate of accesses isn't really something that would trigger false > > conflicts. Also, false conflicts should not cause a client to crash. > > I must say that you are right and I had made a mistake saying that Coda > client crashes. In fact it just hangs. I had read some of the emails > from the codalist posted last year and I think I understand better how > Coda works. > > > > > There are some problems that I have observed with our own web/ftp > > servers, which get pretty much everything out of /coda. > > > > - The Coda client is a userspace process, so all file open and close > > requests are forwarded to this single process where they are queued. > > There is a set of about 20 worker threads that pick requests off the > > queue and handle them. This is in a way similar to how a web-server > > has one incoming socket where requests are queued, which are then > > accepted by a web-server instance (thread or process). > > > > In-kernel filesystems have it a bit easier, when the webserver makes a > > system call the kernel simply uses the application context, so there > > is no queueing and it can handle as many requests in parallel as there > > are server instances. > > > > Now we only really see open and close requests all individual read and > > write calls are handled by the kernel, so if the client has most or > all > > files cached the worker threads don't have to do much and are pretty > > efficient. However if a some (or a lot) of files are updated on the > > server, most the locally cached data is invalidated and almost every > > request ends up having to fetch data from the Coda servers. So each > > request takes longer, and may have some locks on a volume because it > is > > updating the local cache state. So in the worst case only process a > > single request at a time and the queue becomes very long, blocking all > > web-server processes. > > I think this is exactly what happened when I tried to delete about 900 > files at one time and the result was that apache webserver got blocked, > slowing down the whole system. > > My Coda setup is formed from one Coda server and three Coda clients (for > the moment). It is very important to have the same web documents on all > three clients because I implemented an load-balancing solution and all > clients must serve to visitors the same content. Coda solves this very > elegantly. > > Back to the point. Trying to solve the problem I terminated the apache > processes and then restarted venus. After restart the whole /coda volume > was in disconnected state and I feared that I lost all the data. I had > to move quickly because all of my websites were down and the only quick > solution that I could think of was to purge and reinstall the > coda-client package on the system where I deleted the files. I thought > this way I will avoid any conflicts that could appear. > The curios thing is that the other two Coda clients were hanging after > this problem occurred. > > For the sake of readability I atached the relevant logs. > > After I reinstalled the coda-client and restarted the Coda server the > /coda become accessible on all the clients but I had to waith for a > while to reintegrate. > > I belive now that it wasn't necesary to reinstall the coda-client > because all the conflicts would been resolved automaticaly and that an > restart of the Coda server would have been enough. > > > > > - Another thing that can happen is that when one client is updating a > > file, another client sees the update before the final commit. At this > > point the versions are still skewed. A write first updates each > replica > > and then uses the success/failure status to synchronize the versions. > > So if we see a changed object before the versions are synchronized the > > reading client believes there is a conflict and triggers server-server > > resolution. As a result the servers lock down the volume, exchange > their > > copies of the file or directory, compare the differences and decide > > which one would be considered the most up-to-date version. > > > > We detect the missed version synchronization because the contents are > > identical, this is a 'weak-equality' type resolution and so the > servers > > reset the versions to be correct again. Then when the writing client > > finalizes the operation, the versions end up getting bumped for a > second > > time, skewing them again, requiring the reading client to refresh it's > > cache and triggering another resolution. There is not a correctness > > issue here, but the additional 2 resolution phases definitely slow > down > > everything because they add an additional 10 roundtrips and take an > > exclusive lock on the volume, preventing all readers and writers even > > for unrelated objects within the volume. > > > > Neither of these would introduce crashes or conflicts though, mostly a > > temporary performance degradation where all web servers are blocked > > until the system catches up again with all the queued requests. > > Its clear for me now why I get this performance degradation when trying > to copy many files but this is not at all desireable in an production > environment. > > > > > > Do you think that if I install the version 6.9.0 the problem with > false > > > conflicts will be avoided? > > > > Not sure if false conflicts are your problem. A crash is a bug, even > > when there are conflicts we shouldn't crash. With 6.9 we basically end > > up using an existing code-path that was normally only used during > > disconnections or when connectivity was poor. That code has been > around > > for a long time, but really hadn't been tested all that well because > it > > was the fallback path. So I do actually expect it to be somewhat less > > reliable. However as it is the same code that is used by older clients > > when things went wrong it isn't really a step back. Any bugs that > could > > happen in unusual situations have become bugs that will happen. > > > > > What any other suggestions for this situation? > > > > What does your setup look like? Are there replicated volumes or is > > everything backed by a single Coda server (i.e. a single server would > > never have resolution issues). How many Coda clients are there. Are > > updates being written by one client or by several client. Which client > > crashes, the one that is writing or another that is only reading. > > I had described the setup above but I must make some remarks. Typicaly, > all the Coda clients will write from time to time. One of them is mainly > used to write files, but only on specific times and was not doing any > writing at the moment of the incident. In this case, the client that > hanged was one of the clients that rarely writes. At that time I used it > to delete the files. > > > > > What web server are you using? How many threads/processes does it use? > > How many requests per second are we talking about? > > I run on every client Apache 2.0.54 with mpm_prefork_module and is > currently setup for 200 MaxClients. The prefork module uses only one > thread/process. The average of requests vary from 20 to 50 requests per > client per second. > > > > > What is logged in /var/log/coda/venus.err when the client crashes? > > > > Jan > > > Thank you for your time. > Florin