Coda File System

Re: crash in rvmlib_free (not necessarily) during repair

From: Piotr Isajew <pki_at_ex.com.pl>
Date: Mon, 8 Jul 2013 13:20:00 +0200
After last problem I reported, I did some more tests with two
server coda cell.

My goal was to create a replicated volume and populate it with
some files (mailboxes in maildir format, and some documentation
I keep in HTML and PDF for offline access).

The SCM machine runs coda 6.9.5. I started with the same on
non-SCM, but recently upgraded this to latest git sources, to see
if it changes anything. Both servers communicate over WiFi, so
it's possible that they will lose connectivity for a while.

I have no problem to operate on small data. Manipulating a few
files and a few megabytes works well.

For now, however, I'm pretty sure, that I just cannot do any
"massive" copy operation against those servers, as the crash is
just a matter of time. 

I start with creating a new replicated volume do on the SCM:

createvol_rep docs otwieracz.localnet/vicepa kontrabanda.localnet/vicepa 

On the client machine I can mount this volume without problems.

I want to copy manuals directory into it:

cp -r manuals /coda/coda.localnet/docs/

manuals is not something really big:

$ du -sh manuals
629M    manuals

$ find manuals | wc -l
31216

(I'm sure, I don't exceed 4k files per directory limit: there is
no more than 100 files per directory).

At some point of copy operation both servers crash:

SCM SrvErr:

No waiters, dropped incoming sftp packet
No waiters, dropped incoming sftp packet
(many lines like that)
No waiters, dropped incoming sftp packet
RVMLIB_ASSERT: Error in rvmlib_free

Assertion failed: 0, file "rvmlib.c", line 258
***BackTrace***
/usr/sbin/codasrv(coda_assert+0x5f)[0x4a4bff]
/usr/sbin/codasrv(rvmlib_free+0x181)[0x4a2ab1]
/usr/sbin/codasrv(_ZN5recle8FreeVarlEv+0x1aa)[0x4726da]
/usr/sbin/codasrv(_Z11TruncateLogP6VolumeP5VnodeP7vmindex+0xf9)[0x471729]
/usr/sbin/codasrv(_Z12InternalCOP2iP11ViceStoreIdP17ViceVersionVector+0x5fa)[0x4327ba]
/usr/sbin/codasrv(FS_ViceCOP2+0xad)[0x432a5d]
/usr/sbin/codasrv(srv_ExecuteRequest+0xf52)[0x454c62]
/usr/sbin/codasrv[0x41f8b4]
/usr/lib64/../lib64/liblwp.so.2(+0x5fe2)[0x7fa21e1bbfe2]
/lib64/libc.so.6(+0x36aa0)[0x7fa21d91caa0]
/lib64/libc.so.6(sigsuspend+0x16)[0x7fa21d91cd76]
/usr/lib64/../lib64/liblwp.so.2(lwp_makecontext+0x10e)[0x7fa21e1bc13e]
/lib64/libc.so.6(fflush+0x6b)[0x7fa21d95381b]
/lib64/libc.so.6(_longjmp+0x2b)[0x7fa21d91c8ab]
/usr/lib64/../lib64/liblwp.so.2(+0x5f72)[0x7fa21e1bbf72]
/usr/lib64/../lib64/liblwp.so.2(lwp_swapcontext+0x22)[0x7fa21e1bc022]
/usr/lib64/../lib64/liblwp.so.2(LWP_DispatchProcess+0x3bd)[0x7fa21e1baf7d]
/usr/lib64/../lib64/liblwp.so.2(LWP_QWait+0x57)[0x7fa21e1bb987]
/usr/sbin/codasrv(main+0xdc6)[0x41c286]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fa21d907a95]
/usr/sbin/codasrv[0x41cee9]
EXITING! Bye!


non-SCM SrvErr:

No waiters, dropped incoming sftp packet
No waiters, dropped incoming sftp packet
(repeated many times)
No waiters, dropped incoming sftp packet
RVMLIB_ASSERT: Error in rvmlib_free

Assertion failed: 0, file "rvmlib.c", line 258
***BackTrace***
/usr/sbin/codasrv(coda_assert+0x5f)[0x4a4bef]
/usr/sbin/codasrv(rvmlib_free+0x181)[0x4a2aa1]
/usr/sbin/codasrv(_ZN5recle8FreeVarlEv+0x1aa)[0x4726da]
/usr/sbin/codasrv(_Z11TruncateLogP6VolumeP5VnodeP7vmindex+0xf9)[0x471729]
/usr/sbin/codasrv(_Z12InternalCOP2iP11ViceStoreIdP17ViceVersionVector+0x5fa)[0x4327ba]
/usr/sbin/codasrv(FS_ViceCOP2+0xad)[0x432a5d]
/usr/sbin/codasrv(srv_ExecuteRequest+0xf52)[0x454c62]
/usr/sbin/codasrv[0x41f8b4]
/usr/lib64/../lib64/liblwp.so.2(+0x5fe2)[0x7f4c1dc8ffe2]
/lib64/libc.so.6(+0x36aa0)[0x7f4c1d3f0aa0]
/lib64/libc.so.6(sigsuspend+0x16)[0x7f4c1d3f0d76]
/usr/lib64/../lib64/liblwp.so.2(lwp_makecontext+0x10e)[0x7f4c1dc9013e]
/lib64/libc.so.6(fflush+0x6b)[0x7f4c1d42781b]
/lib64/libc.so.6(_longjmp+0x2b)[0x7f4c1d3f08ab]
/usr/lib64/../lib64/liblwp.so.2(+0x5f72)[0x7f4c1dc8ff72]
/usr/lib64/../lib64/liblwp.so.2(lwp_swapcontext+0x22)[0x7f4c1dc90022]
/usr/lib64/../lib64/liblwp.so.2(LWP_DispatchProcess+0x3bd)[0x7f4c1dc8ef7d]
/usr/lib64/../lib64/liblwp.so.2(LWP_QWait+0x57)[0x7f4c1dc8f987]
/usr/sbin/codasrv(main+0xdc6)[0x41c286]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4c1d3dba95]
/usr/sbin/codasrv[0x41cee9]
EXITING! Bye!


Of course I can restart both servers and resume copy operation
(which continues to local cache anyway), but this leads to
server/server conflicts sooner or later.

I have to populate that volume in some way. Maybe I should just
shutdown non-SCM and copy to SCM only? Or maybe running a client
directly on the SCM, just for that copy operation would be a
better option?


Piotr
Received on 2013-07-08 07:21:06