(Illustration by Gaich Muramatsu)
Of course the cleanest solution would be to implement some sort of operation logging. The fundamental problem is undo -- undoing the data changes can clobber the heap if the heap has been manipulated between the time the object was freed and the time of the abort. If you logged an operation instead, you could undo it and still maintain the integrity of the heap. (I think GC would have the same problem, but I haven't thought that through entirely). Another quick-and-dirty hack would be to check for the double-free during fake_free, and again at do_free. I can't remember if I did that or not. This would catch a majority of bugs, and miss only those in which the first free was done between the time of the fake free and the end of the transaction. Sorry for the inconsistency between the man page and the code. In these situations its probably better to trust the code, as I live by Old Ben's famous words: "use the force, read the source":) david. ps, we (at least used to) have the practice of keeping very large RVM logs on the server in order to have a larger update history. This allowed us to find these sorts of bugs much more easily. I think Josh or PK hacked some support for this form of debugging into rvmutl. At 10:39 AM 4/3/98 -0500, you wrote: > >Your idea did occur to me -- but as you'll see below I was mainly >changing infrastructure, not (yet) rds. > >We use magic constants in guards before and after the regions >allocated. I suppose that rds_fake_free could change the constant >from "malloced" to "fakefreed" and then rds_do_free would change it >again to "freed". In that way we could stop right at the offending >point. > >First of all, there seems to be a misunderstanding either in the code >or in the man page for rds_fake_free. The man page says that >rds_do_free is to be called just BEFORE rvm_end_transaction. I think >this is not safe and must be called AFTER rvm_end_transaction. >Calling AFTER is safe -- only the risk of memory leak exists. By >calling rds_do_free first there is a risk of reallocating memory and >clobbering which is much more serious. So I think the man page for >rds_fake_free is wrong and the code is right. > >One would have to think _extremely_ carefully about the following >problem: which transaction (if any) is this change of the guard from >"malloced" to "fakefreed" part of? What happens if the system crashes >between "rvm_end_transaction" and "rds_do_free" -- should we end with >"malloced" or "fakefreed" in the constants. Currently this situation >can lead to a memory leak because we keep malloced. In any case it >cannot be put to "freed" because another thread would then be able to >reallocate and clobber. The memory leadk is very unlikely since we do >not flush rvm_end_transaction, we only flush at rds_do_free even if >the transaction is flush. (This happens in rvmlib_end_transaction -- >the version you have is pretty unintelligible, but the new one is >pretty clean.) > >It might be best to make the "fakefreed" constant part of the main >transaction and additionally weave the fake-freed area into a (new) >"fakefreed_list" -- rvmlib_rec_free should do this. Then the >following holds: > >- main transaction: changes guard from malloced to fakefreed. Moves >area onto the fakefreed list. rds_malloc cannot reallocate. >- rds_do_free transaction changes guard from fakefreed to free. Moves >area from the fakefreed list to the free list. > >The crashing behaviour is then as follows: >- crash before rvm_end_transaction: we end up with the state before >begin transaction. Good. >- crash after rvm_end_transaction and before rds_do_free: the >transaction went through and the faked blocks are visible on the >fakefreed_list. We can garbage collect them on the next startup of >rds. We have eliminated the memory leak. > >In fact most of the work I needed to put in was to rewrite the >rvmlib.{c,h} files to make it possible to step into the routines with >gdb; I also removed the non-local gotos used for aborting >transactions. Then the printing was a mess (it printed addresses that >were totally wrong, then only wrong by sizes of guards etc. and hard >to understand). Finally the assertions were wrong since they >removed a stack frame from the stack, and sadly they always removed >the most important one. So basically I just went straight ahead and >didn't modify the rds_fake_free routine. > >Unfortunately, introducing an fakefreed_list might mean a >reinitialization of rvm which is not pleasant... > >Peter > >J.A. Harkes writes: > > > > "Peter J. Braam" wrote: > > > > > > This is a message about improvements I have made to the RVM handling, with > > > a view to increasing our debugging capability dramatically. > > > > > <snip> > > > > Isn't the same checking possible by using magic-numbers. A the state of > > a segement (allocated/freed) is marked using some `magic' value. A > > subsequent begin_transaction or free operation can then test whether the > > segment is already `marked' free, and complain loudly when it is. > > > > Jan > > >Received on 1998-04-03 13:31:36