(Illustration by Gaich Muramatsu)
On Tue, May 20, 2003 at 02:32:35PM +0000, lou wrote: > I was wondering (while going through rvm papers), is there a > possibility to map rvm data onto swap space? If you don't have the map_private flag enabled, which used to be the default for both clients and servers, RVM will allocate a large chunk of memory and read the RVM data into this allocated memory. As a result the complete RVM data segment will end up in swap space. We can't mmap a raw partition on Linux, so it will always do this when RVM is on a raw partition. However this is very inefficient, the OS caches the data as we read it into memory, and as soon as we try to load more than the size of physical memory the system will be heavily trashing it's disks trying to interleave the reads from the RVM data file with the writes to swapspace. Now if private mmaps are enabled, the data is loaded from the RVM data file on demand. When there is memory pressure unmodified pages are simply discarded as they can be loaded again later on and dirty pages are written to swap as a result of the private mmap. So technically it ends up being pretty much the same thing, but ends up trashing the disk a lot less and is easier on the memory usage because we don't dirty RVM memory all that quickly. > Other thing that I'm still looking at is the fsync() on Linux, > according to the man page, fsync() actually synchronizes data (with > the one on the disk) rather than synchronizing writes(basically > synchronous writes). And basically if the IO is done synchronous, > coda will operate actually faster and in more safe manner? fsync walks the pagetables and schedules writeout for modified pages. Because RVM already to coalesce ranges of modifications it should work pretty well if we open the files O_SYNC, however RVM assumes a 512 byte block size independent of whatever the filesystem uses so it won't coalesce modifications across the 512 byte boundary. In reality, if you have your RVM files on an ext2/3 file system with 4KB blocks, RVM it will end up sending 8 independent writes which all modify the same file system block. If we open the file with O_SYNC we end up with, RVM application kernel read(512) read(8 * 512) * merge modifications write(512) * hopefully page is still cached * modify page write(8 * 512) read(512) * hopefully page is still cached * merge modifications write(512) * hopefully page is still cached * modify page write(8 * 512) read(512) ... * merge modifications write(512) ... ... repeat another 5 times ... If there is any memory pressure the 'hopefully page is still cached' cases will end up as yet another disk read, but we're already suffering from the 8 individual write operations. Even when we've modified more than a whole page of data, the kernel still needs to read the original page from disk because it only sees 512 byte writes. When we use fsync it will look more like the following. RVM application kernel read(512) read(8 * 512) * merge modifications write(512) * hopefully page is still cached * modify page read(512) * hopefully page is still cached * merge modifications write(512) * hopefully page is still cached * modify page read(512) ... * merge modifications write(512) ... ... repeat another 5 times ... fsync(RVM data) write(8 * 512) So we end up reading once and writing once, except when there is memory pressure and we degenerate to the previous case. As the kernel still only sees 512 byte writes, it is always forced to read the original data from disk. Now RVM _could_ be modified to work on a PAGE_SIZE basis. If the underlying filesystem has a smaller block size, we end up writing out more data than was really modified. It also makes RVM non-portable across platforms with a different PAGE_SIZE, but my educated guess is that we already can't reliably move RVM data from one machine to another. Ideally if we used O_SYNC and no fsync, we would then end up with, RVM application kernel read(4096) read(8 * 512) * merge modifications write(4096) write(8 * 512) I did try this once, and had some problems with the fact that RVM internally had some things aligned on a 512 byte boundary. Either my change was too intrusive, or those cases might have been resolved when map_private was introduced. And if the whole page of data was modified we can even skip the read operation completely. > Anyways these are just assumptions... > > am i going the wrong way? No these ideas are exactly what we need to consider as options to make RVM perform better, which will end up make Coda servers a lot faster and more responsive. JanReceived on 2003-05-20 10:38:00