(Illustration by Gaich Muramatsu)
On Fri, Mar 28, 2003 at 04:58:19AM -0500, Steve Simitzis wrote: > i'm running a 2.4.18 linux kernel, coda 5.3.20, and the latest rpc2 > 1.15. from what i understand, the latest rpc2 was meant to fix > problems related to slow write performance, but it hasn't helped my > installation at all. Well, it fixes extreme packet loss during the client->server bulk transfers. This is only part of what you look at as write performance. * I just reread this email before sending it off and it is a bit * technical. But there are some interesting numbers hidden between * all of it. Coda works a lot like a (classic) BSD FFS filesystem, i.e. all metadata updates are synchronous. Every single create, chmod, or chown won't 'complete' until the server is absolutely positive that the update has hit the disk. This already is extremely different (and a lot slower) than what you are probably used to, ext2 asynchronously writes back modifications and relies on the fsck to fix things up if the power fails before everything is written back. BSD has embraced 'softupdates' where they order 'dependent writes' and as such can do a similar async writeback, but the filesystem will always be in a consistent state. If we look at things performance wise, we have an application performing a mutating operation, then we get the context switch to the userspace cachemanager, which commits to the operation locally. Then it sends the operation to the server (network latency), the server then performs a transaction to validate and perform the operation and heavily relies on fsync to make sure all updates have been written to the disk. The server then returns a response (network latency again), and the client registers success of the operation. Only then can we return the result to userspace (context switch again). So we have at least 2 context switches, the network RTT, and the time it takes for the server to perform the transaction. Now on linux, fsync should be probably be called fsuck, it is extremely slow has to walk through the pagetables to find dirty pages and schedule the writes, and it seems to sync not just file we called fsync on, but all pending writes on the server. This includes updates to logfiles etc. And because all the filesystems on the server are typically doing async writeback, there often is a lot of data that needs to go to disk and our process is the one that's paying for it because we actually care about consistency. Peter Braam once calculated that it was not ossible to perform more than about 100 of these synchronous transactions on one of our servers. Now a typical 'file write' with tar involves at least about 5 or 6 remote calls (create, store, chown, chmod, set timestamps, rename). So if we're dealing with the creation of files with no data, we would probably be able to deal with about 20 files per second and if we took out the consistency guarantees, we would still have a overhead of 6 times the network latency, which on a 100Base-T network would be in the order of a millisecond or two, but on a PPP link it is probably more than a second. > my coda server is an unloaded, 4 CPU machine, and i've been testing > it against a single dual CPU client machine. Because all Coda programs are single threaded, adding more CPU's won't help. In the long run we want to slim down the Coda server process and make it easier (or trivial ;) to run multiple server processes on a single machine. > what i've observed is that writes are very slow, and seem to be > hanging on the close() (at least, on the client side). untarring > tar archives is also very slow. The slowness during untar is totally dependent on how fast we can make our RPC's. And as you can see from the description above, there is no trivial way to speed it up. Hanging on close() would typically indicate a problem in the 'ViceStore rpc'. We don't write back data until the file is closed, so if you have a large file, it will block until all the data is sent to the server. Also even with the reduced packet loss, our bulk transfers don't have the 'sophistication' of current TCP, we run the whole deal from userspace, which costs us some. There is a fixed window and it is tuned towards wireless networks where packet loss is typically caused by corruption and not congestion. i.e. we don't use 'slow start' and ramp up, but use simple stop and wait until all outstanding packets are acknowledged and then kick back to pushing data at the 'estimated bandwidth' of the link until something goes wrong and we fall back to stop-and-wait. Between a client and a single server, I'm seeing about 3.9MB/s. When talking to 2 servers it goes up to about 4.2MB/s, and because the data is sent to both servers at the same time, this is in fact a little more than 8.4MB/s on the wire. With 3 servers I seem to hit a limit, 2.6MB/s average, however that is still more than 7.8MB/s going through the wire. Some things that can be useful, running a codacon process on the client should give some indication of what RPC operations it is performing. Then there are rpc2 and sftp packet statistics. The server should dump these to the log once in a while, but you can force them with 'volutil printstats'. Half of the statistics seem to go to stdout, but the interesting ones in this case are dumped to /vice/srv/SrvLog. 10:00:28 RPC Packets retried = 25, Invalid packets received = 158, Busies sent = 335 * Retries, how many operations did the server have to resend because a client or server wasn't responding. * Invalid, number of packets that were not decoded as 'proper' RPC2 packets. * Busies, a client retried an operation before we were done, and the server tells it to back off for a while. 10:00:28 SFTP: datas 915, datar 196824, acks 25605, ackr 152, retries 0, duplicates 0 10:00:28 SFTP: timeouts 1, windowfulls 0, bogus 0, didpiggy 279 * sftp data packets sent, data packets received, sftp acks sent, acks received. If everything goes right, we should be fewer acks than the number of received data packets divided by 8. Same way for data sent and received acks. * retries and duplicates, typically no good, it means that we are sending or receiving data twice. * timeouts, windowfulls, bogus. Before rpc2-1.15, bogus would be enormous, we were dropping packets because the sftp thread hadn't been scheduled so 'nobody' was waiting for the incoming packet. Windowfulls should really be more than '0', the bandwidth estimate is probably too conservative so we never really have a full window of 32 packets on the wire. * didpiggy, the amount of data that the client requested was so small that we just stuck it on the back of the rpc2 reply and saved ourselves the overhead of tranferring the data seperately. I also monitor all servers with 'smon' which generates data for rrdtool. This data contains similar numbers so I can look at graphs of the average number of rpc2 operations or server CPU load over the past year for any server. JanReceived on 2003-03-28 10:32:03