Coda File System

Re: The system crashed...

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 5 Dec 2003 19:10:15 -0500
On Fri, Dec 05, 2003 at 03:30:05AM +0100, Lionix wrote:
> I'm currently using rsync to transfer data from nfs to coda....
> Browsing the ML I know that it's not the better idea  :
> http://www.coda.cs.cmu.edu/maillists/codalist/codalist-2003/5122.html

Well as I said in that email...

>>>>> If everyone sends more email to codalist, it will cause more
>>>>> mailarchive rebuilds and I get more of these conflicts. Perhaps
>>>>> that will push this higher up in the list of things that really
>>>>> need to be looked at.

But in fact things have been improving steadily. I haven't really had
many problems lately.

> Change date of codaproc2.cc was 9 months ago so it's in 6.0.x builds.
> And I can't consider spamming the ML as a "politicaly correct" solution.
> But I recognize i'm writing a lot last weeks... :o)
> 
> So I started playing on cache size parameter, having a look at the load 
> average....
> Resolving problems as I could...  practicing and trying to improve...
> 
> For the first time I get a server side problem during rsync.
> 
> ====SrvErr====
> No waiters, dropped incoming sftp packet
> No waiters, dropped incoming sftp packet
> [....]
> No waiters, dropped incoming sftp packet
> Assertion failed: l, file "recov_vollog.cc", line 309
> EXITING! Bye!
> ==========
> 
> OK ! Some troubles in recovering volume log....  
> Let me bet and correct me if wrong...
> Something like this function return a pointer, to the volumelog, after 
> he tried to grow it ( grow index ? ) because he wants  to insert a new 
> transaction log ?

That sure looks like it, the server logs an entry for every operation.
This entry is removed as a result of either one of two things. One, the
client sends a 'COP2' message indicating success at all replicas of the
volume. Or we just successfully triggered server-server resolution on
the object related to the logged operation.

The log has a fixed length, although it can be made larger with 'volutil
setlogparms', that typically won't help unless the real problem is fixed.

Now one question is, is this a replicated volume, or only a single
replica? Because singly replicated volumes are never resolved, we
disable resolution logging, but perhaps your server got into the
disabled logging code anyways.

> ======SrvErr
> Assertion failed: size == s, file "recov_vollog.cc", line 386
> EXITING! Bye!
> 
> Uhu....  SalvageLog function... don't understand too much but I see 
> something browsing volume log to track if space could be freed othevise 
> tryied to increase Logsize no ?

There are more (or fewer) logrecords than expected, I'm not entirely
sure what it going on here.

> 00:43:14 SalvageIndex:  Vnode 0x2ae has no inodeNumber
> 00:43:14 SalvageIndex: Creating an empty object for it

Hmm, the server was asked to create files, but the client never actually
stored data into them. Possibly not-yet reintegrated.

> 00:43:14 Entering DCC(0x1000013)
> 00:43:22 DCC: Salvaging Logs for volume 0x1000013
> 
> Reading this it seems I wasn't syncing only one volume....
> 
> An other restart failed with the same SrvErr message.... And same 
> feather for SrvLog
> Should I continue untill he succeed starting :- ?

You could create an entry for this volume in /vice/srv/skipsalvage, or
was it /vice/vol/skipsalvage... Any case the content would be

1
0x1000013

The 1 indicates that one volume id will follow, and then the volumeid
that should be ignored during salvage. This should bring your server up
with everything but this one problematic volume.

> I then went to the other server, restarting hoping puting him at same 
> state his "brother"  but he's still refusing to start with no errors..
> Freezing on the :
> 02:55:33 Main thread just did a RVM_SET_THREAD_DATA
> 02:55:33 Setting Rvm Truncate threshhold to 5.

Hmm, old server maybe still running or something? killall -9 codasrv and
retry. If you manage to get this one up and running without a problem
and the 0x1000013 volume is part of a replicated volume you can get
everything back up and running without too much hassle.

You would have to delete the corrupt replica (0x1000013), and then
recreate it as an empty volume. If everything is done right,
server-server resolution will simply copy everything back from the
surviving replica.

You need to know... Replicated volumeid for 0x1000013, something like
0x7f0000XX. The unique volumename, this depends a bit on what the other
one happens to be named as. Lets say your replicated volume is 'volume',
then one replica will be 'volume.0' and the other 'volume.1'. So you
have to check (volutil info?) what the name of the the surviving replica
is. Finally you need to know/decide where this volume should be stored
(vicepa)

With all of this info we can do,

    volutil purge 0x1000013 volume.X
    volutil create_rep /vicepa volume.X 0x7f0000XX 0x1000013

All of this info might still be listed in one of the files in
/vice/vol/remote/XXXX on the SCM. Finally, ls -lR on the client in the
related volume should trigger runt-resolution and bring all the data
back.

Jan
Received on 2003-12-05 19:12:02