Coda File System

Re: CODA Scalability

From: Nick Andrew <lists_at_nick-andrew.net>
Date: Thu, 22 Aug 2002 01:33:09 +1000
G'day,

On Wed, Aug 21, 2002 at 10:17:09AM +0200, Ivan Popov wrote:
> > I am in the process of setting up a _home_ fileserver with twin 80-gig
> > disks (RAID-1 mirrored) and am looking for a distributed filesystem
>
> It should be doable, depending on how big your files are - i.e. it is the
> ===> number of files <===
> being the limitation, not the total size of the files.

At present I have about 150,000 files consuming about 40 gigs. The
average file size will probably increase over time.

> The 4%-estimation is based on "typical" file size distribution that can
> vary a lot.

Are you working on some number of bytes in the RVM per file?

One thing which Coda's documentation does not explain clearly is
its relationship to the underlying filesystem. There's a note in
the docs which says "do not fsck this filesystem" but it doesn't
explain why.

As I was trying to figure it out, I considered some diverse possibilities,
like (a) Coda (server) implements its own filesystem structure (allocation
algorithms, etc) completely replacing any other filesystem, to (b) Coda
creates huge files within the underlying filesystem, one per volume, and
stores all managed files within each, to (c) Coda stores one managed
file per physical file in the underlying filesystem. Each of those
rationales had some problem:

(a) ... it would just destroy the previous filesystem, so there's no
need to actually _have_ a filesystem on the managed area. Plus I feel
uncomfortable about the idea of Coda managing disk allocation policies
(as filesystems do), because it's a black art; I'd rather see a
distribution system which layered on top of a non-distributed
filesystem for recoverability/speed/flexibility.

(b) ... it can be efficient so long as the files are fairly large, but
each file might be limited to 2 gig. This rationale fails to explain
why an fsck is bad though.

(c) ... would be a disaster for performance (I recall somewhere in
the documentation it said that Coda did not create directories). The
size of the directories (remember I'm looking at 150k files) would
kill the system, I'd have to use reiserfs as a base. Surely the Coda
developers did not do this. Plus it doesn't explain the fsck issue.

Finally I arrived at rationale (d) which I hope you'll confirm ...
the Coda kernel module interacts with the underlying filesystem
at the inode level, not the file level, so it creates inodes in
the underlying filesystem which are not linked into any directory.
Thus, the underlying filesystem is used, but the filesystem cannot
be checked else fsck will find many thousands of unlinked files and
attempt to link them all into lost+found.

Assuming my guess (d) is correct it would be good if that could
actually be documented somewhere so people like me don't have to
guess so much in future.

Also I can't help but wonder if it would be much of a performance hit
to maintain a directory structure _anyway_, so that an fsck on the
filesystem will work and return clean. I can assume that coda will
still access files directly by their inode number, but when
creating a file, there should be no issues in also creating a
directory structure to link the file. Once created the directory
structure never needs to be updated for a file, only for new
files.

For example with 32-bit inode numbers (0x12345678) a 3-level
2-character directory tree could be used, so the stored file
would be "12/34/56/78" ... 256 top-level directories, under
each there would be 2 levels of 256 directories and at the
bottom level each directory would contain up to 256 files.
There's no need to fully populate the directory tree, just
make the nodes as needed during file creation. 256 entries
in a directory does not hurt performance, and anyway Coda
would not use those directories itself; their purpose is
to maintain the filesystem in a clean state, and provide
some chance of file recovery by a non-Coda system.

Anyway getting back to my fileserver planning ...

> > drives I already have), scalability (up to the 200+ gig range) and
>
> Depends on if it suits you to have 200 G split on say 4 servers (possibly
> running on the same machine). That is, the limitation lies in
> adressability inside one client-server pair.

I could put 320 gigs of disk in the fileserver's housing, with a
RAID-5 arrangement, which would result in 240 gigs of usable
disk. I'd probably have to run 10 server processes on that
machine, and I'd have to have nearly 10 gigs of swap, due to Coda's
architectural limitation. I think I'll look for a cleaner solution
first.

> Windows client is considered in alpha stage but I haven't seen complaints
> on the list, so it may work rather well.

Ok. Were there no Windows clients a few years ago? I first considered
Coda in around 1998 as I was looking for a way to share disk and
improve redundancy in my ISP systems infrastructure. I thought there
were Windows clients then.

> I think you have to stay below 2G of metadata yet. Not sure.
> And you have to have more *virtual* memory than your RVM - that is
> the sum of your physical RAM and your swap has to exceed that size,
> say for 1.9G RVM you would need say 1G RAM and 1G swap giving 2G
> virtual memory.

I guess it's a linux thing but I can't figure out why an mmap'ed
file needs to be backed by swap capacity. If the host runs short
on memory, I don't see why it can't just page out a block from
the mmap'ed area back to disk, after all it can read it back anytime.

> I haven't ever tried InterMezzo but I assume it does not cache the whole
> data on each client.

I think you're right here. I downloaded and read one of the PDF files;
it indicates that both partial replication (caching) and full
replication are available. I'll have to test it to see for myself
what actually happens.

One possible issue with InterMezzo is update delay from server to
client - it occurs on file close. But Coda is the same, right?

I don't expect that there will be much likelihood of two hosts
in my network writing to the same file at the same time, so update
propagation delay should not be important. There's only one area
where I can think it might have an impact, and that's in mail
delivery versus mail reading. Single host systems use a lock file
or kernel advisory locking to ensure exclusivity; locking has
never worked properly on NFS ... in a distributed arrangement
you can be sure that I will be replicating my email, so I might
start using "maildir" as a storage arrangement. If you're not
familiar with maildir it's a non-locking mailbox storage arrangement
where each message is stored in a separate file with unique name.
Invented by Dan Bernstein, Qmail author. Maildir's claim to fame
is that it works over shared filesystems and a mail reading/deleting
client is not going to interfere with a mail writing/delivering
server.

> > What would be nice is a transparent filesystem or union filesystem
> > so that I can acquire data in different ways and put it all together
> > into one namespace which never has to change (just gets bigger)
> > but I think that linux doesn't have a union filesystem, and if
> > it did, it's not clear how that would be distributed.
>
> I suppose there would be all kinds of semantical and manageability
> problems.

Yes, quite. The first thing about a Union filesystem is that the
system decides on which underlying filesystem writes will occur,
and most of the implementations seem to not be based around an
idea of "share all available space".

So a Union filesystem might be useful in transition or migration to a
bigger raw data store. But I am trying to avoid the need for a migration
step entirely, by putting in an architecture from Day One which will scale
to more disks on the server, and more servers, without any significant
downtime on the filespace. In other words, after this system is setup,
I don't mind unmounting and growing, then remounting, but I do mind
having to copy all my data in order to grow the filespace from 80 to
say 240 gigs.

> Union mounts are available for linux, see e.g.
> http://kernelnewbies.org/status/latest.html

That's a great URL, I hadn't seen it before. I spent a couple of
hours reading about some of the cool projects listed therein.

My current thoughts on what my server will look like are:

0:  RAID-1 on the bottom layer, for disk redundancy.

1:  Logical Volume Manager (LVM or EVMS) on that, to provide growable
    block devices, so my filesystems can span disks as needed. I can
    grow each logical volume on the fly.

2:  EXT3 filesystems on that, with journalling. I can grow the EXT3
    filesystem on the fly, to fit the newly-grown logical volume.
    Journalling and everything will be duplicated under the RAID.

3:  InterMezzo distributing one or more filesystems to clients. I
    don't know if one InterMezzo server can distribute several
    filesystems, or if several processes can run on the same server.
    I'll have to install and play with it to find out.

If I had filesystems which were not distributed using InterMezzo, I'd
probably use NFS. It's not necessary for my colocated server to have
access to my Mp3 collection, for example, and I probably don't need
it to be replicated because I have CDR backups.

Each Unix client with a large disk drive would also be a server (and
it's here where the tricky part comes, cause I'd like to take
advantage of those disks for more storage, yet provide a single
namespace to clients). It's not yet clear to me whether InterMezzo
scales to multiple servers. Probably only on different mount points.

Nick.
Received on 2002-08-21 11:35:39