(Illustration by Gaich Muramatsu)
On Fri, Jan 26, 2001 at 01:38:09PM -0500, Douglas C. MacKenzie wrote: > As a thank you to this mailing list and the Coda developers > I would like to pass on my thoughts and experiences with Coda. We appreciate that. Any feedback is good feedback. I've commented on your annoyances with Coda, not to make them seem less important, but more to show some of my views and attempts to delve to the causes. > I ran Coda for about 8 months on a small office cluster of > 4 workstations and one server. I really liked the promise > of disconnected operations, and looked forward to running > a coda client on my win98 laptop, but it never sounded > stable enough to bother with loading it up. There are several big hurdles we have taken and still have to take to get Coda nicely integrated on Windows platforms. First of all the problem of `multitasking'. In contrast to the common belief the Win32 API implementation for 95/98/ME is not reentrant and all applications are runnin in the same virtual machine (VM). Bouncing VFS calls up to userspace and going back into the kernel leads to deadlocks. Luckily a workaround was found by Michael Callahan, DOS boxes happen to be running in their own VM, and he added socket and mmap APIs for DOS applications. Then there is the difference in how applications look at the filesystem. Windows filesystems are case insensitive, and even worse, still have the 8.3 filenames deep in their bowels, a filesystem like Coda is hit by nice things like "Creat file_blah.txt" "Store FILE_BL~.TXT" "Open FiLe_bLaH.tXt" (all from the same `save file to disk' operation) and is assumed to do the right thing. > I had three major Coda problems over the 8 months. The first > was due to clock skew. Coda needs to have the clocks set pretty > closely and we kept getting reconnect conflicts until we > started running ntpd. The daylight savings time roll over was > really a pain. The only effect that clock skew has is when applications that use RVM are restarted and the time has warped back. The client-server interaction has definitely no time dependent parts, we wouldn't even dare consider calling this a _Distributed_ File System if it did. The only timestamps that are ever transferred are the file mtime's and these are never used by either client or servers. We have the non-time based versionvectors for conflict detection and resolution. > The second problem was on-going, the clients would continually > disconnect and reconnect, even when on a fast network connection. > This caused no end of random clients running disconnected. Were you going through masquerading firewalls or is your network very congested? I know of some connectivity problems on `normal' networks, but those are not really `fast reliable' network connections. PPP connections suffer from the default in-kernel route queues which are only 10 packets and SFTP sends 8 packet bursts so there is a high likelyhood that the last packets in the sequence are dropped. And ADSL lines because RPC2 assumes a symmetric connection and fails to get a proper RTT estimate, so it times out too quickly.. > A major problem with Coda is that there is no way for a casual > user (the developers on our network) to quickly decide if they > are running connected or disconnected. There should be some > obvious alarm given when a client disconnects. Something like > the popup dialogs that UPS software provides when the power fails. But there are several feedback mechanisms, there is `cmon' which shows the running status of our production servers. I've also got a modified WindowMaker dockapp, which shows server names and a little green/red `led' to indicate up/down status, while clicking the servername opens an ssh-connection to the machine. Then there is smon which pulls down statistics and records them in a RRD database (sort of like MRTG). And there is a machine in the lab running netsaint + rpc2ping which sends me direct email whenever any of our Coda-servers doesn't respond to the ping. > Losing your network connection is easily as critical. Anyway, > I spent a lot of time helping people get their clients back > reconnected. (I saw an e-mail on the list which suggested > that this problem was fixed in the latest version, but I gave > up before I got to try it.) That is why there already is a lot of software available to monitor your network. Netsaint is one example, but MON, MRTG, etc. You normally shouldn't need Coda to tell you when you've `pulled the plug'. To keep an eye on what venus is doing, there is codacon. Al long as it displays "Create" "Store" etc, you are connected. When it doesn't you are disconnected. In fact, as far as venus is concerned network connectivity is almost like Schroedinger's paradigm. You won't know whether you are connected until you try to make an RPC2 call. And in most cases venus really doesn't know until we're already in disconnected mode. > The final problem was the killer. One day the coda server > core dumped with an assert and wouldn't restart. I fooled > around with it for a day and got the server running and > found that read operations from clients would work OK but > that the server core dumped again on the first write operation. Sounds like RVM allocations failed. We had that on both verdi and viotti. It turns out that the RVM allocator assumes it can defragment as a last resort, i.e. when allocations start failing. However by that time there are hardly any fragments to merge anymore. Viotti couldn't allocate 32KB even though 120MB was still `free'. The only solution I have at the moment is norton-reinit -dump state / reinit RVM / -restore state > My basic conclusion is that Coda is not usable by anyone > other than very dedicated researchers until you get rid > of all the asserts in the software and replace them with > meaningful error messages. My biggest frustration was > trying to track down what a particular assert really meant. An assert implies that something took a wrong turn somewhere are things are really really gone wrong. Luckily RVM is transactional, and the last transaction is aborted. When the server is restarted it doesn't even remember it took a wrong turn. In some places we might be able to add enough code to get out safely, but going through every possible path that might lead to the error case and making sure we handle those return paths correctly is a lot of work. > Dumping the asserts, adding an alert mechanism to report > when clients disconnect, and modernizing the conflict > repair mechanism are the three short comings that I would Instead of dumping the asserts, I'd rather try to improving the code so that we avoid making those bad turns in the first place. Alert mechanism, was done 3/4 years ago in the AdviceMonitor. Never really got off the ground. Modernizing the conflict repair. Well, we've been working hard at both avoiding `false-conflicts' and getting repair to do the right thing most of the time several improvements already went into the repair tools, venus, and 2.4 linux kernels. > suggest working on first to make Coda ready for the > real world. As it was, I (Ph.D. in Computer Science > and used to advanced system administration problems) > just was spending way to much time keeping it going > and didn't see how any of my users would ever be able > to take over any Coda system admin. When that happens > I'm ready to give it another try. > > Thanks for all your help, > > Doug I'm sorry to see anybody become disappointed in Code, and sincerely hope that some day we are able to achieve those goals and see you back on the list. JanReceived on 2001-01-27 21:51:24