(Illustration by Gaich Muramatsu)
On Thu, Aug 22, 2002 at 01:33:09AM +1000, Nick Andrew wrote: > On Wed, Aug 21, 2002 at 10:17:09AM +0200, Ivan Popov wrote: > > > I am in the process of setting up a _home_ fileserver with twin 80-gig > > > disks (RAID-1 mirrored) and am looking for a distributed filesystem > > > > It should be doable, depending on how big your files are - i.e. it is the > > ===> number of files <=== > > being the limitation, not the total size of the files. > > At present I have about 150,000 files consuming about 40 gigs. The > average file size will probably increase over time. > > > The 4%-estimation is based on "typical" file size distribution that can > > vary a lot. > > Are you working on some number of bytes in the RVM per file? Yup, only the file metadata is storen in RVM, pathnames, version vectors, creator/owner/last author, references to copy on write versions of the file in backup volumes etc. > One thing which Coda's documentation does not explain clearly is > its relationship to the underlying filesystem. There's a note in > the docs which says "do not fsck this filesystem" but it doesn't > explain why. Ah, that's from the time that Coda servers used a special 'inodefs' hack to get direct access to the underlying filesystem. Nowadays we store files in a tree structure, which adds a bit of overhead, but is far more generic and fsck can't mess with us (too much) anymore. > As I was trying to figure it out, I considered some diverse possibilities, > like (a) Coda (server) implements its own filesystem structure (allocation > algorithms, etc) completely replacing any other filesystem, to (b) Coda > creates huge files within the underlying filesystem, one per volume, and > stores all managed files within each, to (c) Coda stores one managed > file per physical file in the underlying filesystem. Each of those > rationales had some problem: c. > (c) ... would be a disaster for performance (I recall somewhere in > the documentation it said that Coda did not create directories). The > size of the directories (remember I'm looking at 150k files) would > kill the system, I'd have to use reiserfs as a base. Surely the Coda > developers did not do this. Plus it doesn't explain the fsck issue. Ah, we do have directories, but store them as files. The only existing problem is that there are no double or triple indirect blocks in the in-file directory representation. As a result Coda's directories are limited to about 256KB in size. i.e. it is even impossible to have a single directory with all RFC's. It's in a way really funny, because the directory lookup code uses extensive hashes to ensure that a directory lookup can be done very quickly even when we have huge directories, but the actual directory data structure can't scale to such sizes. I'd rather have had a scalable structure with a dumb, but simple linear search because that would have been easier to fix and optimize. > Finally I arrived at rationale (d) which I hope you'll confirm ... Used to be the case, but I dropped the whole (device,inode) file access, it was causing too many problem when the underlying filesystem was trying to do journalling etc. We basically could use Coda only on ext2, now we have no problems with ext3, reiserfs, tmpfs, ramfs, vfat, etc. Probably even XFS will work fine now. The access through a filehandle stuff only really stabilized recently in the linux-2.4.19 pre patches. > For example with 32-bit inode numbers (0x12345678) a 3-level > 2-character directory tree could be used, so the stored file > would be "12/34/56/78" ... 256 top-level directories, under That's exactly what we do in the venus cache and the /vicepa partitions on the server. > > Windows client is considered in alpha stage but I haven't seen complaints > > on the list, so it may work rather well. > > Ok. Were there no Windows clients a few years ago? I first considered > Coda in around 1998 as I was looking for a way to share disk and > improve redundancy in my ISP systems infrastructure. I thought there > were Windows clients then. Flaky Windows 95/98 clients. The latest Windows client is using an external filesystem development kit (from OSR), and seems to be getting pretty reliable. > > I think you have to stay below 2G of metadata yet. Not sure. > > And you have to have more *virtual* memory than your RVM - that is > > the sum of your physical RAM and your swap has to exceed that size, > > say for 1.9G RVM you would need say 1G RAM and 1G swap giving 2G > > virtual memory. > > I guess it's a linux thing but I can't figure out why an mmap'ed > file needs to be backed by swap capacity. If the host runs short > on memory, I don't see why it can't just page out a block from > the mmap'ed area back to disk, after all it can read it back anytime. Simple, it is not a shared mapping, but an anonymous mapping. I.e. the code 'mallocs' as much space as the RVM-data partition and reads all of it into memory(/swap). Again, hysterical raisins, Coda started off running under MachOS and was directly interacting with the OS's pager/vm systems. The first ports to 'normal' UNIX systems already had enough obscure things to port so they took the easy way for some areas. Phil Nelson implemented private mappings about 2 years ago. This greatly increases large server startup times. And only dirtied pages have to go to swap, so swap only slowly fills up. We can't just page out a dirty block from the mmapped area (i.e. no shared mappings) because first of all, there can be 'committed changes' mixed up with 'uncommitted changes' and we don't know when the OS will write the data back. It would be possible to munmap/mremap a known fully committed private page. But there are some efficiency issues here. Once we munmap the page, the only way to get it back is to read it from disk. We also don't know whether the system really needs us to free up some dirty memory, so we might be actively reading back data on a system that has more than enough memory available, or on the other side of the coin, we might be freeing up pages that have already been swapped out (forcing swap-read, vma splitup, disk read, vma merge). And if we're repeatedly modifying the same page the system could at least use the page-aging to know whether it is worth it to write it to swap or keep it around in memory. > One possible issue with InterMezzo is update delay from server to > client - it occurs on file close. But Coda is the same, right? Correct, AFS semantics. It is far more efficient for any userspace filemanager, because we don't get hit by a context switch on every write and don't have to play games by actively purging memory pages from VM to provide reliable unix sharing semantics. > you can be sure that I will be replicating my email, so I might > start using "maildir" as a storage arrangement. If you're not > familiar with maildir it's a non-locking mailbox storage arrangement Maildir works fine with Coda as long as you replace the 'link/unlink' with 'rename', we don't allow cross-directory links and our rename is atomic and Coda declares a 'conflict' whenever it notices that rename is trying to remove a file that the client didn't know about (update/delete sharing conflict). > > Union mounts are available for linux, see e.g. > > http://kernelnewbies.org/status/latest.html Those are union mounts, not the union/overlay filesystem that people always talk about. JanReceived on 2002-08-21 13:43:50