(Illustration by Gaich Muramatsu)
On Mon, Nov 01, 2004 at 02:14:50PM -0800, redirecting decoy wrote: > So tell me, how is it that I can't make this nice > little filesystem function correctly. I "guess", the Not sure, my current guess is that there is something whacky with the networking setup. Your servers might be multi-homed, firewalled, or some other reason why the server cannot set up a working connection back to the client. > replication works. But now it seems that I am having > trouble with my clients. First of all, whenever I > stop venus, in order to restart it I have to do "Venus > -init", otherwise it turns into a zombie. Once I do > that, then I can stop venus and restart it without the > "-init". What is causing this? Second it seems that Maybe bad memory or some other reason that causes RVM to get corrupted. Are you by any chance using some unusual filesystem as a backing store? I tend to use ext2 or ext3 and don't trust reiserfs too much. Other possibilities are that you're using a gcc version that miscompiles some code, or incorrectly pads some structure. Venus is an interesting mix of C, and aging C++ code that doesn't necessarily always follow the latest standard. We've been bitten before by compiler changes where we used to assume that a C struct would be layed out similar to a C++ class with identical members, when exceptions were introduced they broke LWP threads (or the other way around), and the latest one was glibc built with wide characters, which doubled the stack usage of vsprintf which caused stack overflows in some threads. It also depends on which versions of LWP/RPC2/RVM you are linking against, especially when you build your own libraries in /usr/local/lib, but might have an older version installed in /usr/lib and so we build against one version but the dynamic linked uses the other at runtime (although I've been pretty consistent about bumping the library version numbers on any incompatible changes). > propagate, I have to do a combination of the > following commands, each of which has it's own > problems. > > 1) cfs cs myrealm > *Note: I try this command, and sometimes it works, > other times to doesn't. 'cfs cs' - check/rebind connections with all known servers 'cfs cs servername ...' - check/rebind connections with specified servers There is no 'cfs cs <realmname>', it will interpret that names as the name of a 'server' and try to bind to it. 'cfs cs coda.cs.cmu.edu' will try to bind to our webserver, which happens to run testserver.coda.cs.cmu.edu, but that server isn't part of the 'coda.cs.cmu.edu' realm. So even when the connection succeeds it doesn't do anything for the connections to the coda.cs.cmu.edu realm. > 2) cfs reconnect > *Note: I have never been able to get this command to work. cfs disconnect activates a 'fail filter' in the RPC2 layer. Basically this filter quietly drops any packet we receive or try to send. There are other filters available that slow down packets to simulate a modem (low bandwidth) or a satellite link (high latency) connection, but those are mostly for experiments and not enabled by default. > 3) cfs fr /coda/myrealm/storage > *Note: Sometimes I have to do this inorder for > changes to be made to the server. Isn't there a > better, automatic way? I accidently broke forcereintegrate a week or two before the 6.0.7 release. It is fixed in CVS, and I will hopefully make a new release soon. > 4) echo -n "pwd" | clog user > *Note: I run this command on all my machines, and > sometimes it reconnects the volume, sometimes it > doesnt. Functionality is sporadic. Obtaining new tokens doesn't necessarily reprobe the the servers, but it will kill off any old connections for that user and rebind those. You could also put the password in a file and use clog user < pwdfile, that way the password won't be visible in a 'ps auxwww' listing. > Basically, one of the applications I am attempting to > run is an mpi version of povray. MpiPovray works best > with shared storage. If I have 8 machines total, then ... > to the working dir. I am unsure if this particular > application send's data back to the master mode to > final output to file, or if each worker node write's > to the file from it's respective machine. So, my I looked at the patch at (http://www.verrall.demon.co.uk/mpipov/). The image data is sent back to the master process over MPI messages. So that is safe. It also looks like the slaves have the DISPLAY and VERBOSE flags cleared so they shouldn't try to display anything or send stuff to stdout. However if povray has some 'append stuff to logfile', then I assume that all slaves are trying to append to the same file which will just create one big conflict. For the rest it depends on what the scene files specify. The scene language is pretty much a scripting language, and it allows read/write access to the directory where the scene files are located, so depending on what the scene contains we might have concurrent writes to the same file from several clients (i.e. mpipov slaves). > BTW: in my server.conf file on both servers I have > "mapprivate=1" enabled. and in venus.conf would > "dontuservm=1" make any difference to my situation. I > am unsure of it's proper use. dontuservm does exaclty what it says, it doesn't store the cached metadata in recoverable virtual memory, but simply uses malloced areas. This way locally cached stat cannot survive a venus restart, so it is similar to always starting venus with -init. We used this on iPaq's to avoid wearing out the flash. With the venus.cache in a ramdisk, the dontuservm flag enabled and no swap we never actually wrote anything to flash. JanReceived on 2004-11-01 21:36:23