(Illustration by Gaich Muramatsu)
From: Ivan Popov <pin_at_medic.chalmers.se> nice to see a detailed and structured report. Nice to see a detailed and structured reply! Thanks for your post, Ivan. (yet I do not seem to find how big your client caches were) Pretty big, varying from 100Mb to 10Gb. > 4b0fb02ca5944804cc403b6ff1f3797a ./affection/audio_01.inf > md5sum: ./affection/track05.cdda.flac: Connection timed out > find: ./affection/audio_08.inf: Connection timed out I think it has been discussed some time. There are situations when it takes a long time for the server to answer. Combined with some packet loss it may lead the client to the conclusion that the server is unreachable. It is a fundamental functionality in Coda - the client decides itself "Functionality" is not the word I would choose. I wouldn't even use the word, "feature." when it goes disconnected. We do not want the client stall too much. Why not? We're talking about *reads* here. Were it a write, then, yes, we could cache it and do it later. But it's a read. The choice is either stall or abort. Coda aborts. May be it is possible to improve the protocol and the algorithm, but it is not at all "that bad", and is hard to make changes to. There is room for improvement and Jan is well aware of the problem, but I think it is rather low on the priority list. I would assert that it *is* that bad. It's broken. Let me summarise the failure. - Both the client and the server have a high-quality net connection. No phone lines. No cable modems. Real honest-to-goodness Internet. - In this mode, coda will just arbitrarily blow away file reads. - So you have no assurance your ops will win. Note that I was not "pushing" the filesys hard. I was just using it. My client computer was accessing no more than one file from the coda fs at a time. Granted, these were 3-10Mb files -- so what? I didn't have 1,000 processes, each of which had many open files in, with concurrent operations hitting the same files, which were also being accessed on other servers. *That* would be pushing the filesystem. The picture that I am developing here of coda is that if you tune it just right, and your network connection has certain good properties, and your patterns of access stay within some (unspecified) envelope, you have good odds of winning most of the time. But you better be prepared to deal with failures whenever you operate on the filesys; they do happen. In http://www.coda.cs.cmu.edu/maillists/codalist/codalist-2004/6115.html Jan wrote | I've said this many times before, there is no such thing as guaranteed | connected operation in Coda. If anything goes wrong during a write/store | operation the client will silently switch to write-disconnected | operation (logging state). If the server is slow to respond we switch to | a logging state. And reversely, when the client can't be reached by the | server, the server triggers the disconnect were are likely to switch to | a logging state. | | The only thing that cfs strong does is prevent the client from listening | to the often incorrect 'bandwidth estimates' from the RPC2 communication | layer, so that transitions only happen in error cases and not based on | incorrect estimates. In fact, if you were already write-disconnected | before calling cfs strong, the client will never discover that the | network actually has good bandwidth and will never transition to the | connected state. > So the real-world operation of coda here is that if you start writing a > lot of data, you disconnect, and then your writes just fail. So you can't > ever count on some operation actually working; it could very easily fail > mid-stream. It depends on the operation and the circumstances. If you start the operation during good connectivity and then your mobile phone connection goes down, then both reading (obviously) and writing (say when you do not have enough space in the client cache or in the cml) can fail. Of course we do not want the connection to be treated as unavailable while the net and the server are still there. It will become better as time goes but for the moment you have to make precautions for bulk copies. I note that it has been > 10 years. And it appears to be, in some sense, a deep part of coda's design philosophy. > that access my coda files sometimes win and sometimes seem to drive the > system into disconnected state, and then I must go through a > cfs wr > cfs cs > cfs lv . > dance to reconnect. This happens when I am on a client with a completely > stable connection to the ethernet. We are not talking phone lines here. > This essentially renders coda unusable. I am familiar with the problem, still I find Coda usable. One workaround I had to use when my servers or network were slow, was to run a loop of "cfs cs" which helps against disconnections. That kind of voodoo is a symptom that something is really wrong. Let me restate it: if I can reliably hang a network filesystem on a *connected client* simply by doing find . -file f -exec md5sum {} \; then the filesystem is broken. It's not my fault. The filesystem is broken. > 2. Do other people lose in this way? / Are other people winning? It is a known problem ("unnecessary disconnections" while a retry or extra wait would help). A lot of complaints may raise the priority to fix... There is probably a certain way to get the fixes done, just fund the work :) I'd rather accept these inconveniences for more important fixes and improvements. > 3. Is coda not ready for really big repositories (800Gb filesys, 1Gb rvm > metadata)? I am running with 768Mb rvm but as my files are small - "typical Unix" :) it maps to just about max 30G data. It should not be any problem to fill more space with bigger files. > 4. Any advice at all? Coda offers unique possibilities - for some price. The usage pattern has to be "Coda friendly" - and probably will have to, even after ultimate fixes and improvements. I am -- very regretfully -- concluding that coda is not something I can use & am abandoning my attempts to use it. Bummer. I've left out some other bad behaviours I've encountered. I managed this week to generate this df output on a coda client: [root_at_northpoint coda]$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda2 10321168 3625088 6171796 38% / /dev/hda1 99043 38128 55801 41% /boot /dev/hda5 15481496 3280176 11414908 23% /var /dev/hda6 10321136 2553020 7243832 27% /home none 30916 0 30916 0% /dev/shm coda 100000 -18446744069414692531 4294952835 101% /coda coda 100000 -18446744069414692531 4294952835 101% /coda I've also observed "cfs lv ." hang until I killed a venus on a *different* client, which hinted to me that there was some kind of locking going on that I didn't understand, but, in any event, just increased my feelings of unease. -OlinReceived on 2004-07-25 15:02:25