(Illustration by Gaich Muramatsu)
On Thu, Aug 22, 2002 at 12:17:39PM +0200, Ivan Popov wrote: > I suspect the earlier projects were abandoned because it is hard to use > consequently. It may be my fault, but I am not aware of people actually > deploying overlay filesystems, on any platform. It seems like a logical way to extend the apparent size of a filesystem beyond the storage capacity of a single host, without having to create or rearrange mountpoints. It's easy to add capacity from remote servers, simply by NFS-mounting their available storage. Mountpoints are a nuisance though, because they separate the total filespace into two or more (large) chunks, and I as the administrator have to cut my filesystem - I have to find a single directory which is big enough (including its subdirectories) to make a significant difference in freespace on my original filesystem. Moving the files from the cut point to the new server requires manual effort and possibly downtime. It's also possible that there are no appropriate directories to cut a filesystem into two, which will make a difference. If I had a 100 gig filesystem which was full, and it contained 100 top-level directories each of approx. 1 gig utilisation, and I wanted to add an additional 50 gigs on a second server, I could balance the utilisation only by moving 33 directories across and setting up 33 symlinks from the original names into a new mountpoint. That's not desirable. So if I have explained the problem well, perhaps you can see why I think a union filesystem should be a useful tool, because it can extend the amount of storage available without any need for the administrator to make artificial cuts in the filespace, nor moving files around to balance utilisation. Now Coda solves some of the problems I have outlined above. If my understanding of it is correct, one can add a new server to a cell and the new server's available disk is equivalent to the existing server's disk, from the point of view of clients. One must still move volumes from server to server to balance utilisation, but at least this can be done without downtime. The balancing problem becomes easier because the volumes themselves are typically smaller, and one does not have to setup a "symlink farm" after moving them. The only proviso is that each volume must reside completely on one server (i.e. can't be split between two servers). It would lead to an issue only when the volumes become _too big_, for example bigger than 50% of the physical drive capacity. So I think that Coda's architecture as I described above is one which permits an "infinitely scaleable filespace", probably way better than a union filesystem, the only problem is that Coda's data structures and resource requirements don't scale. A cheap <$1000 fileserver can access a quantity of disk which is far beyond the ability of the Coda system to serve (due to the physical memory and VM requirements, and I guess the X86 memory architecture also). So I'm still stuck in a dilemma, sorta. I might end up sharing the bulk of my data over NFS ... that at least scales to any capacity on a single server, and quite well with multiple servers, it just doesn't scale with multiple clients. I don't have dozens of clients, only a few PeeCees around my home, so ability to handle raw capacity definitely ranks higher than local (client) caching. By the way, I joined the InterMezzo list and queried them on whether their client was really a proper "cache". In other words, can the server serve 50 gigs, and the client be configured with 1 gig, and still be able to access the entire 50 gigs, although obviously not at the same time. The answer was that they want it to be like that, the client should be wiping unused files from its cache, but it just doesn't do that yet, and so the possibility is there that the client will exhaust its local backing store and fail. Finally, one possible answer to the dilemma I raised above which is "what do you do when your server is full and it cannot take any more physical devices, and so you have to move to a multi-server model?" The Network Block Device (NBD) driver allows a remote host's block devices to be accessed by the local host, as if they were locally attached storage. If the NBD device is supported by LVM, then extending a filesystem would be a matter of adding the NBD to the volume group containing the local host's logical volumes, and then extending any full logical volumes onto the NBD, and then resizing the filesystem on that logical volume. I believe this solution scales to at least 256 gigs using the current linux LVM implementation, and possibly more if different volume groups are defined. Considering the local host (the one with the direct attached disk plus lots of NBD storage) as a server then, from a client's point of view they continue to talk to only one server, which just appears to have more space. This is inefficient of course, because the data might have to travel over the network link twice; the client should, ideally, transmit its request directly to the second server ... but that sort of smarts requires a whole different architecture to the simple sharing arrangement I started describing. That idea brings me back to the Berkeley xFS concept of serverless sharing, where hosts provide disk resource as peers, and clients somehow locate and access their files. I wish their website was working, somehow I think this is a project which will never be completed. This is more of a cluster filesystem idea ... it would be nice if we had a cluster filesystem to go with cluster processing, so each additional CPU or storage device increases the capacity of the cluster incrementally, with no real effort on an administrator's part. MOSIX and OpenMOSIX claim to balance CPU and VM utilisation across a cluster, but they don't balance filesystem utilisation; any migrated process which does file or device I/O has its I/O transparently redirected to the home system. There's no unified filespace. Perhaps this could be achieved via some co-operation between *MOSIX and one of the distributed filesystem projects. Has anybody here played with either of the two MOSIXs? Nick.Received on 2002-08-22 11:27:13