(Illustration by Gaich Muramatsu)
Hi, I've encountered a problem I'm not sure how to deal with. I've got a 2 server coda 6.9.5 cell on linux (3.9.8) kernel. Recently I was copying large (well, relatively large, a few gigabytes) amount of files to coda and it happened, that during copying SCM went off the network. Copying resumed to secondary server and failed some time after that (I don't remember why it was though, probably some venus crash). So after restarting venus, doing purgeeml and restoring connectivity to all servers, I had connection to all the (two) servers in cell, but SCM having not up to date replica (not having some files, that were copied to non-SCM replica). This triggered several server/server conflicts for the directories involved. I was supprised that this kind of conflict is not resolved automatically, but tried to use repair on directories causing problems. This appears to be the way to hell. I'm able to beginrepair, comparedirs generates reasonable fix: replica 192.168.9.6 02000001 removed java replica 192.168.10.1 01000002 but if I invoke dorepair, or removeinc non-SCM crashes. repair just reports error due to lost connectivity with non-SCM. non-SCM (192.168.9.6) SrvErr shows as follows: XMIT: Sent long packet (subsys 5893, opcode 20, length 2236) XMIT: Sent long packet (subsys 5893, opcode 20, length 2236) No waiters, dropped incoming sftp packet XMIT: Sent long packet (subsys 5893, opcode -8, length 2224) repair_getdfile: starting : Success repair_getdfile: file opened: Success repair_getdfile: list created: Success repair_getdfile: replicas parsed: Success repair_getdfile: replica processed: Success repair_getdfile: completed!: Success RVMLIB_ASSERT: Error in rvmlib_free Assertion failed: 0, file "rvmlib.c", line 258 ***BackTrace*** /usr/sbin/codasrv(coda_assert+0x5f)[0x4a4bff] /usr/sbin/codasrv(rvmlib_free+0x181)[0x4a2ab1] /usr/sbin/codasrv(_ZN5recle8FreeVarlEv+0x1aa)[0x4726da] /usr/sbin/codasrv(_Z8PurgeLogP9rec_dlistP6VolumeP7vmindex+0x86)[0x4715c6] /usr/sbin/codasrv(_Z10PutObjectsiP6VolumeiP5dlistiii+0x9c1)[0x4263d1] /usr/sbin/codasrv(FS_ViceRepair+0x105)[0x430455] /usr/sbin/codasrv[0x449aac] /usr/sbin/codasrv(srv_ExecuteRequest+0x125c)[0x454f6c] /usr/sbin/codasrv[0x41f8b4] /usr/lib64/../lib64/liblwp.so.2(+0x5fe2)[0x7f8db5446fe2] /lib64/libc.so.6(+0x36aa0)[0x7f8db4ba7aa0] /lib64/libc.so.6(sigsuspend+0x16)[0x7f8db4ba7d76] /usr/lib64/../lib64/liblwp.so.2(lwp_makecontext+0x10e)[0x7f8db544713e] /lib64/libc.so.6(fflush+0x6b)[0x7f8db4bde81b] /lib64/libc.so.6(_longjmp+0x2b)[0x7f8db4ba78ab] /usr/lib64/../lib64/liblwp.so.2(+0x5f72)[0x7f8db5446f72] /usr/lib64/../lib64/liblwp.so.2(lwp_swapcontext+0x22)[0x7f8db5447022] /usr/lib64/../lib64/liblwp.so.2(LWP_DispatchProcess+0x3bd)[0x7f8db5445f7d] /usr/lib64/../lib64/liblwp.so.2(LWP_QWait+0x57)[0x7f8db5446987] /usr/sbin/codasrv(main+0xdc6)[0x41c286] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f8db4b92a95] /usr/sbin/codasrv[0x41cee9] EXITING! Bye! After restarting everything I still have the conflict in the same node or it's parent node depending on the situation. Is there any hack, that would allow me to recover from that situation? Bests, PiotrReceived on 2013-07-04 01:40:18