(Illustration by Gaich Muramatsu)
On Wed, Jan 31, 2007 at 10:07:59AM +0100, Reiner Dassing wrote: > I have setup two coda servers, replicated and a client. > Venus can connect, a token is there and I see my volumes. > > When I am testing the connection from the client to the servers by > cp -rv . /coda/tarzan1.net/usr/iersdc > this copy is very very very slow: > small files, some kilobytes, are taking 10 seconds and more. > There is a 100 MBit/s net between the client and the servers > and tests via scp are performing as expected. > > What is the "normal" speed to expect for cp to /coda? > Where to look for bottleneck? I would expect the actual data transfers to be something like 3MB/s. But file and directory creation would be pretty slow compared to a local file system, all updates are synchronous and wait for the servers to sync all updates to disk before continuing. However if you client is operating in write-disconnected mode, we should to be much faster for small copies (where small is smaller than the available client cache size) and slightly slower when a large amount of data is copied. In write-disconnected operation the client keep a local log of pending updates which are written back periodically in the background, we call this reintegration. These reintegrations combine and commit up to 100 operations at a time, which is considerably faster compared to individual operations. But reintegration attempts only happen once every 5 seconds or so, during this time way we can get a nice little backlog and optimize away useless operations (e.g. a compilation may create temporary files that are removed immediately which don't have to be sent to the server). If we fill up the cache too quickly, the application will be slowed down or even blocks until reintegration has a chance to catch up. Most of the time you see unusual slowdowns, it is because we need to break callbacks to clients that have disappeared. Sometimes it is simply a mobile client that lost network connectivity or a client that was restarted. Sometimes it is caused by a masquerading firewall that times out the internal state too soon and when the client reprobes it is assigned a different outbound port on the firewall. One thing I noticed on our servers is that the new crypto code actually was draining the entropy pool of /dev/random pretty quickly during backups. Everytime a Coda/RPC2 application starts it reads about 48 bytes to seed the internal random number generator. But in our case when backing up a few hundred volumes, we would fork off a couple of hundred volutil commands to check the last backup time, clone the volume and dump the data to a TCP connection with the Amanda server. And once the pool is drained, processes start to block and some backups would fail. Same thing happened when I tried to use pam_coda authentication for some pages on the web-server, which then forked off a new clog process for every page hit. But I wouldn't expect you to hit such situations when running just venus and codasrv, since those aren't restarted all the time. I'm not sure why your connections seem so slow. Maybe there is a network issue that isn't triggered by TCP connections. For one, we don't really take the link mtu into account and assume that ip packets will be fragmented and reassembled by the underlying network when they are too big. Also the data transfer (SFTP) doesn't really scale it's window down to a single packet like TCP can. We assume that once we get an ack, that we can send at least 8 or 9 packets (1KB each), a router may end up consistently dropping the last couple of packets in such a series resulting in a timeout and retransmission of the missing packets. If you run 'vutil -swap ; vutil -stats', this will rotate the venus.log file (-swap) and dump a lot of statistics (-stats) in venus.log. The file is either at /var/log/coda/venus.log or /usr/coda/etc/venus.log. At the end of that file you can find the RPC2 and SFTP statistics, RPC Packets: RPC2: Sent: Total Retrys Busies Naks Uni: 211716 : 28257968 1739 15 0 Multi: 0 : 0 0 0 0 Received: Total Replys Reqs Busies Bogus Naks Uni: 211037 : 21419816 152008 : 3 56214 : 1592 1111 : 0 112 0 Multi: 0 : 0 0 : 0 0 : 0 0 : 0 0 0 SFTP: Sent: Total Starts Datas Acks Naks Busies Uni: 71120 : 61930461 0 57685 : 5 13435 0 0 Multi: 0 : 0 0 0 : 0 0 0 0 Received: Total Starts Datas Acks Naks Busies Uni: 82414 : 74268329 2764 71200 : 0 8450 0 0 Multi: 0 : 0 0 0 : 0 0 0 0 These numbers are from my client and looking at them now I wonder why we're not getting any MultiRPC or MultiSFTP numbers. But some of these numbers may give some indication why your system seems slow. For instance my client has sent 211K rpc2 requests and less than 1% needed to be retransmitted. My client sent a minimal number of BUSY packets, and bogus packets are received when tokens expire and we were trying to send something to the server based on an expired key. I don't see SFTP retransmissions counted anywhere, so maybe this isn't really enough information to analyze the problems though. JanReceived on 2007-02-07 16:25:03