(Illustration by Gaich Muramatsu)
On Thu, Jul 15, 2004 at 11:28:18AM -0500, Troy Benjegerdes wrote: > Is there going to be any reasonable way to run testcases or otherwise > audit the code for all the potential endian and 64 bit problems? 2 issues here, I'll deal with them one at a time, endianess, Anything that only uses RPC2 request/reply should be ok. That code has seen enough use. However some places we send things around with the side-effect (SFTP) as a large buffer. Reintegration is one of these, and that code actually borrows a lot of the marshalling/unmarshalling functionality from RPC2 and it works without a problem. Then there is resolution, the code dealing with resolution logs isn't all that pretty, and I am not sure if it actually even tries to marshall the logs before sending it to the other side. So there could be endian related problems there. The other issue, 64-bit Big problems just about everywhere. The only code that I've cleaned up and actually tested on a 64-bit alpha machine are LWP and RVM. In this area, RPC2 actually causes a lot of problems. RPC2_Integer is defined as a long integer instead of a int32_t. This effectively leaked into everything that is using RPC2, so almost everywhere where currently an unsigned or signed long is used, we should really be talking about ints. I think I also saw somewhere that a string was originally a pointer in a struct, but when it is 'flattened' for storage in RVM or to send it across the network the pointer is replaced with an offset. These structures will either have to use a 64-bit integer, or not try to alternately use both pointers and offsets in the same field. > I managed to kill the x86 server on a resolve this time... > > The last thing in 'SrvLog' is this: > > 10:58:04 rsle::InitFromBuf Bad begin stamp 0x84ea32fb > 10:58:04 rsle::InitFromBuf Bad begin stamp 0x84ea32fb That does look like resolution logs are not marshalled to be platform independent and are simply dumped straight from RVM. > 10:58:03 Incomplete host set in COP2. > 10:58:03 Incomplete host set in COP2. These happen when an operation does not complete successfully on all replicas. Often caused by a crashed server, or a client that (believes it) is disconnected from a server. > 10:58:03 CheckRetCodes: server 209.234.73.41 returned error 102 > 10:58:03 ViceResolve:Couldnt lock volume 7f000001 at all accessible > servers I had to do some searching, but 102 is VNOVNODE. To me this says that the object we're trying to resolve doesn't yet exist on all servers. The real problem in this case is in fact with the parent directory. The client should automatically go up one level and try to resolve the directory, which would create the directory entry as well as a runt (empty) object. Only then can we resolve the contents of the file. This is done to avoid creating lots of orphan objects where we don't even know where they belong in the tree. It could even be that this involves a removed file and the server that returned VNOVNODE is the one that actually is correct here, although that is probably unlikely if you only have a single client. JanReceived on 2004-07-16 13:54:42