Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Tue, 20 May 2003 10:36:12 -0400

On Tue, May 20, 2003 at 02:32:35PM +0000, lou wrote:
> I was wondering (while going through rvm papers), is there a
> possibility to map rvm data onto swap space?

If you don't have the map_private flag enabled, which used to be the
default for both clients and servers, RVM will allocate a large chunk of
memory and read the RVM data into this allocated memory. As a result the
complete RVM data segment will end up in swap space. We can't mmap a raw
partition on Linux, so it will always do this when RVM is on a raw
partition.

However this is very inefficient, the OS caches the data as we read it
into memory, and as soon as we try to load more than the size of
physical memory the system will be heavily trashing it's disks trying to
interleave the reads from the RVM data file with the writes to
swapspace.

Now if private mmaps are enabled, the data is loaded from the RVM data
file on demand. When there is memory pressure unmodified pages are
simply discarded as they can be loaded again later on and dirty pages
are written to swap as a result of the private mmap. So technically it
ends up being pretty much the same thing, but ends up trashing the disk
a lot less and is easier on the memory usage because we don't dirty RVM
memory all that quickly.

> Other thing that I'm still looking at is the fsync() on Linux,
> according to the man page, fsync() actually synchronizes data (with
> the one on the disk) rather than synchronizing writes(basically
> synchronous writes). And basically if the IO is done  synchronous,
> coda will operate actually faster and in more safe manner?

fsync walks the pagetables and schedules writeout for modified pages.
Because RVM already to coalesce ranges of modifications it should work
pretty well if we open the files O_SYNC, however RVM assumes a 512 byte
block size independent of whatever the filesystem uses so it won't
coalesce modifications across the 512 byte boundary. In reality, if you
have your RVM files on an ext2/3 file system with 4KB blocks, RVM it
will end up sending 8 independent writes which all modify the same file
system block. If we open the file with O_SYNC we end up with,

RVM application			kernel
read(512)			read(8 * 512)
* merge modifications
write(512)			* hopefully page is still cached
				* modify page
				write(8 * 512)
read(512)			* hopefully page is still cached
* merge modifications
write(512)			* hopefully page is still cached
				* modify page
				write(8 * 512)
read(512)	...
* merge modifications
write(512)	...
... repeat another 5 times ...

If there is any memory pressure the 'hopefully page is still cached'
cases will end up as yet another disk read, but we're already suffering
from the 8 individual write operations. Even when we've modified more
than a whole page of data, the kernel still needs to read the original
page from disk because it only sees 512 byte writes.

When we use fsync it will look more like the following.

RVM application			kernel
read(512)			read(8 * 512)
* merge modifications
write(512)			* hopefully page is still cached
				* modify page
read(512)			* hopefully page is still cached
* merge modifications
write(512)			* hopefully page is still cached
				* modify page
read(512)	...
* merge modifications
write(512)	...
... repeat another 5 times ...
fsync(RVM data)			write(8 * 512)

So we end up reading once and writing once, except when there is memory
pressure and we degenerate to the previous case. As the kernel still
only sees 512 byte writes, it is always forced to read the original data
from disk.

Now RVM _could_ be modified to work on a PAGE_SIZE basis. If the
underlying filesystem has a smaller block size, we end up writing out
more data than was really modified. It also makes RVM non-portable
across platforms with a different PAGE_SIZE, but my educated guess is
that we already can't reliably move RVM data from one machine to
another.

Ideally if we used O_SYNC and no fsync, we would then end up with,

RVM application			kernel
read(4096)			read(8 * 512)
* merge modifications
write(4096)			write(8 * 512)

I did try this once, and had some problems with the fact that RVM
internally had some things aligned on a 512 byte boundary. Either my
change was too intrusive, or those cases might have been resolved when
map_private was introduced. And if the whole page of data was modified
we can even skip the read operation completely.

> Anyways these are just assumptions...
> 
> am i going the wrong way?

No these ideas are exactly what we need to consider as options to make
RVM perform better, which will end up make Coda servers a lot faster and
more responsive.

Jan

Coda File System

Re: rvm & other questions