(Illustration by Gaich Muramatsu)
On Thu, Mar 17, 2005 at 10:02:35AM -0300, Gabriel B. wrote: > I let the client copying some files overnight. The logs are below. > > My first tought was that it was a lack of inod (already hapened 3 > times here) but when the server tries to restart, it mentions that > /vicepa has 200k inodes used, and most of the 16M inodes free. Also > has lots of free space. > > the full error log: > No waiters, dropped incoming sftp packet (some 30 lines of that) > No waiters, dropped incoming sftp packet > rvm_malloc: error -9 size c7c00 file recovc.cc line 365 > RVMLIB_ASSERT: error in rvmlib_malloc Ah, growvnodes... haven't seen that one in a while. The problem is that the server is using an array for all vnodes (file system object) in the volume, and when this array fills up it does the something quite similar to the following. BeginTransaction(); new = rvm_malloc((size+128) * sizeof(vnode *)); memcpy(array, new, size * sizeof(vnode *)); memset(&new[size], 0, 128 * sizeof(vnode *)); rvm_free(array); array = new; EndTransaction(); Now there are 2 things happening here. First of all, everything happens in a single transaction, which means we records the old and the new state of all modifications so that we can roll back an aborted (or failed) operation, or reapply a completed transaction that hasn't actually been committed to RVM data yet. This is why the server is able to restart without a problem after the crash. But this means that our worst case log usage is the size of the newly allocated memory + the size of the released memory, possibly even including the changes made by the memcpy operation, because I believe we only optimize when applying the logged operations to the RVM data. So if we were growing the array from 204288 to 204544 vnodes, we'd need the RVM log to be able to contain at least (204544/malloc/ + 204544/state before memcpy+memset/ + 204288/free/ + 204544/committed state/) * 4 bytes + whatever overhead the logged RVM operations have. So that would be more than 3MB. However the peak usage would only hit around the EndTransaction operation, so I'm guessing you're not running out of log space in this situation. The other problem we hit here is that we need to be able to allocate a single contiguous chunk of (size+128) * 4 bytes to satisfy the rvm_malloc request. But we're allocating before we free the previously used space, and the available free space in RVM is probably fragmented by now. RVM tends to fragment over time, and unlike normal (non-persistent) allocators this fragmentation remains even when we a restart. So even if there is still more than 100MB of RVM available it might be that the largest available chunk is only a hundred kilobytes or less. The fragmentation is worse when we are only filling a single volume, if files never get deleted, any of space we previously used for the vnode array is per definition too small. The allocation over time would go something like this (A=array, v=vnode, .=unused) Avv............ (let's say we filled the array and need to grow) .vvAAvv........ .vv..vvAAAvv... At this point we're stuck, even though there is enough space to allocate the new array there is no contiguous area of 4 pages so we can't actually use any of the free space. Now if we did the same, but using several volumes instead of just one we'd see something more like, Avv............ .vvAAvv........ (let's say we are starting to fill the second volume) AvvAAvvvv...... At this point we can already store as many vnodes as we had in the previous case, but we still have a large consecutive chunk of allocatable memory. If the first volume happens to grow first, the space it leaves behind can be used to fit the array of the second volume. Now the actual allocation has been made a bit smarter in avoiding fragmentation, which is probably why we haven't seen a growvnodes report in a while, but this is most likely your problem. In any case, enough of the details, what can be done about it. Well, the real solution to the problem in a way already exists, it was implemented by Michael German, who got a server running with a million files or more, the time it took to fill a server up became a bottle neck and by that time he also started to get similar fragmentation problems on his clients. Instead of using a growing array of vnodes, it relies on a fixed size hash table off of which we have chains of vnodes hanging. The size of the hash table can be adjusted by the administrator at run time so that he can keep the vnode chains on frequently used volumes short and quick. The reason(s) why these fixes haven't been committed into the main codebase are, - Some code walks through the list of vnodes and expects to see them in a steadily increasing order. Volume salvage during startup was one, but that one got fixed pretty quickly. The other is when we try to incrementally update an existing backup volume during a backup clone operation, which still remains broken. Resolution might have some problems, i.e. it hasn't even been tested yet, but I don't really expect serious issues in that area. - It is totally incompatible with the RVM data layout of existing servers. There is no way to smoothly upgrade except by copying everything off of the existing server and building a new server from scratch. And since it requires such an invasive upgrade it might be interesting to consider what other major (future) changes we could anticipate and allow for during the upgrade. - The new code might not even be able to restore existing Coda format backup dumps, luckily the impact of this problem is greatly avoided by Satya's codadump2tar conversion tool which will allow us to at least convert the old dumps to a tarball and restore data that way. Maybe this code is interesting enough to start an experimental Coda branch which may completely break compatibility a couple of times as needs arise (i.e. don't assume that you can actually restart your servers after a 'cvs update; make; make install'). The second option, which is probably more realistic at the moment, is to increase the size of your RVM data segment, http://www.coda.cs.cmu.edu/misc/resize-rvm.html (and in the long term to store data on more than just a single volume) JanReceived on 2005-03-17 12:27:29