(Illustration by Gaich Muramatsu)
On Fri, Feb 01, 2008 at 01:29:18PM +0000, coda_at_bobich.net wrote: > Is there a particular reason why 256K directory size is used? What code > changes would be required to increase this to, for example, 64M? Directory data is allocated in units of 2KB pages. There is one page which just stores pointers to the other pages, so a directory will use at least 2 pages, one for the pointers and one page that holds the data for the '.' and '..' entries. The 2KB page of pointers can contain 128 pointers, I am not even sure if this page of pointers is allocated as 4KB on a 64-bit system. But 128 pointers 2KB pages results in a 256KB directory size limit. There are a bunch of other assumptions, like directory entries willl never cross one of these page boundaries and are allocated in units of multiples of 32-bytes. At some point I think the kernel modules were reading the data out of directories in 2KB units as well. There is also a very similar variant without the pointer-page which is used to send the directory data between servers and from the servers to the client. Technically it should be possible to double the basic page size, this would double the limit, even quadruple if we also double the size of the pointer page. However this breaks the way servers store directory data, so all servers would need to be rebuilt from scratch. It also affects clients, a client that is still using the old page size will now after unpacking even a smaller directory get confronted with entries that cross the former 2KB boundary and possible get more directory data than it can handle. Also when a directory is serialized to be sent to the client, the sent data is not collected via scatter-gather, but copied into a single large buffer. Finally it may affect the ability of various kernel modules to read the directory contents. And really that is just too much trouble for a change that doesn't really fix anything. So now we have a 1MB directory size limit which would be around 16000 maildir messages, enough to store one month of linux-kernel mail, but not anymore. A real solution would not only have direct pointers to data pages, but also indirect, double indirect and possibly even triple indirect. Or a btree layout. And definitely a better way to send data across the net that doesn't involve copying everything into a large memory buffer, same thing for passing directory contents to the kernel, we should avoid having to copy/convert the directory data every time a directory is opened. It may even be possible to just store the directory contents in an on-disk container file instead of in RVM, from some statistics and calculations I estimate that about 50% of the RVM data allocated on my servers is used to store directory contents. Of course, there are advantages of storing directories in RVM, the transactional and recovery guarantees, as well as the performance benefit of having everything in memory and accessible with just one pointer dereference. JanReceived on 2008-02-02 01:54:06