(Illustration by Gaich Muramatsu)
On Thu, Dec 25, 2003 at 06:45:06PM +0000, Joerg Sommer wrote: > firstly, I wish all a merry christmas. And a Happy New Year :) > http://www.inter-mezzo.org/docs/bottlenecks.pdf Does anyone know this > paper? What's about the things mentioned in it? Are any improvements > done for this? Or are any improvements planed? Ofcourse I know about this paper. Peter ran these tests while he was at CMU, in the office across the hall from mine :) The numbers are right, but I do not necessarily agree with all the premises or conclusions. First of all, we're clearly not competing with local disk filesystems. Those do not have to deal with replication, centralized storage, and some do have lower consistency constraints. Especially comparing write performance with ext2 which is asynchronous is a bit unfair. Ext2 doesn't care when bits hit the disk, people can only look at the data through the ext2 filesystem. When I write a file in Coda the underlying filesystem (i.e. on the servers) has to be consistent across all replicas. We also don't have an extensive fsck that can fix up various inconsistencies on the servers if there was a power failure, so we only perform synchronous operations on the servers. But it is nice to know where the 'theoretical limits' of performance are, I don't expect to ever be faster than the underlying filesystem we use for our cached data. Most of these tests were on 2.0 (or early 2.2) kernels. One of the main culprits for the bottleneck, the context switch, has improved tremendously over time. Several factors contribute to this, the basic kernel tick got bumped from 100Hz to around 1000Hz and processors have become faster by leaps and bounds. There is also more memory available for caching. Some changes in the kernel module have helped as well. When a file is opened, instead of passing the (device,inode) number pair, we pass an open filedescriptor for the container file. This is only a small change, but when we are fetching a file from the server we used to write the data to the container file, fsync the bits to disk and then pass the device/inode numbers down. Now we can hand the still open file descriptor and in most cases the we can be reading the still in-memory data before the bits have actually hit the disk. The numbers seem to indicate that the basic 'ls -lR <dir>' test was a recursive ls on the Coda sourcetree. Now I probably don't have the exact same tree in /coda, but I'm getting... client: P4 3.06GHz, 1GB RAM, udma5 IDE, 100Base-T network OS: Linux-2.6.0-test7 server: Identical to Peter's test server test: ls -lR on coda-4.6.0 (516 directories, 1548 files) cold cache: 3.09s warm cache: 0.05s The numbers for local disk (ext2) are: cold cache: 3.57s warm cache: 0.03s We're pretty much on par in case of a cold-cache, I unmounted / remounted the local filesystem to get rid of any cached data in the ext2 cache, and used cfs flush for Coda. The slight speed difference (in this case Coda being faster) can be attributed to the fact that the disk caches on the server were in fact still warm and we probably never even had to touch the disk on the server. Peter's numbers were, client: PII 266MHz, 80MB RAM, IDE, probably 100Base-T network OS: Linux-2.1/2.2? test: ls -lR on unknown tree (~300 directories, ~1500 files) cold cache: 0m9.710s warm cache: 0m0.500s So currently we are about 3 times faster on a cold cache and are in fact approaching the speed of ext2 pulling the same data off the local disk. This is nice because the network, server hardware and server OS haven't changed over the past years. Viotti is basically still running the same RedHat 5 release modulo some security fixes. So on the server side, any improvements are attributable to better Coda server code. We're about 10 times faster for the warm cache case, which is not such a big surprise because my current CPU is approximately 10x faster. On the other hand my client cache is probably about 10x larger as well which is slowing down my client significantly because it has more objects to manage and there are still codepaths that are O(n^2), comparing every cached file to all other files. We're probably still about twice as slow as ext2, but 0.02 seconds isn't such a big deal. Best of all, we got all this by simply cleaning up code instead of trying to optimize :) JanReceived on 2003-12-29 15:13:58