Either I'm confused or the gc is

Fri Oct 23 19:14:34 UTC 2020

On 10/23/20 2:31 PM, donallen wrote:
> On Friday, 23 October 2020 at 15:49:49 UTC, Steven Schveighoffer wrote:
>> On 10/23/20 8:47 AM, donallen wrote:

>> This all looks ok. The only think I can possibly consider is that you 
>> are using bind_text a lot, to presumably bind sqlite prepared 
>> statements to D strings.
>>
>> This means it's *possible* that the prepared statements have pointers 
>> to the strings, but nothing else does (hard to work it out from just 
>> reading the code, I can't tell when things go out of scope easily, and 
>> your prepared statements clearly exist outside these functions). If a 
>> prepared statement is a C-malloc'd item, then it's not scanned by the 
>> GC. This means that what possibly might be happening is that you bind 
>> a D string to a text parameter, the D string goes out of scope, the GC 
>> collects it and allocates it elsewhere, and then sqlite tries to use 
>> it to send some requests.
>>
>> To test this theory, maybe try a version of the code that when calling 
>> bind_text, you append the string itself to a GC allocated array stored 
>> in a global. Might not prove anything, but if it's fixing the problem, 
>> maybe that's where the GC is collecting things that it shouldn't.
> 
> It's a good theory, but I don't think it explains this.
> 
> What you are hypothesizing is that there's an instance where the 
> lifetime of a D string that is used in a sqlite bind_text is shorter 
> than the lifetime of the sqlite binding. I don't think that's the case.
> 
> The calls to bind_text all immediately precede calls to run_dm_stmt, 
> one_row, etc. Those calls either run an insert or update, or in the case 
> of one_row, a select. When those
> calls return, the statement has been reset, which means, among other 
> things, that the
> bindings have been released, ending the binding's life.

OK, I didn't know how that worked.

> 
> The other case to look at is the use of next_row_available_p in a loop. 
> There, bindings are established before entering the loop, which 
> processes multiple rows coming back from a select. When 
> next_row_available_p runs out of rows, it calls the cleanup function, 
> usually reset_stmt (which just calls sqlite3_reset; I found that I could 
> not pass an extern (c) function to next_row_available_p, so used this 
> workaround; maybe there is a way to do this, but I just used this simple 
> expedient).

extern(C) function types are different than extern(D) function types. So 
it depends on what the parameter type is.

> So at that point the bindings are cleared and 
> next_row_available_p returns false, ending the loop and life of the 
> bindings. Unless I've missed something, I think you will find that all 
> the strings passed to bind_text (which get massaged by toStringz, so 
> they look like C strings) live beyond the point where the statements 
> they are bound to are reset.
> 
> Most of the calls to bind_text in the tree walker bind fields in 
> account, which is an argument to the tree walker and thus lives until it 
> returns. The others are to local variables that are declared before they 
> are used in bind-text calls and the statement gets reset before the 
> scope ends.
> 
> Yes, your theory could be tested using some variant of what you 
> describe, in case my there's a flaw in my hand-waving above.
> 
> Some food for thought:
> 
> If I insert a writeln at the very beginning of walk_account_tree, 
> printing, say, the name and guid of the account, the problem goes away 
> -- the program behaves correctly.

as with many memory corruption issues, introducing new code paths can 
change where the corruption occurs, or change that the corruption occurs 
at all.

You might want to try instead using printf, to avoid affecting anything 
in the D world. In the past to try and find corruption issues like this 
that are finnicky, I've also pre-allocated data to write to, and don't 
touch it for the duration of the problematic area, and then print it after.

> 
> If I change
> children[i].path = format("%s:%s", account.path, children[i].name);
> to
> children[i].path = format("%s:%s", account.path, children[i].name).dup;
> 
> in the loop that fills in the child Account structs in children, same 
> thing -- correct behavior. The dup of the format should not be 
> necessary, since format returns a new GC-allocated string.

Hm... that's interesting. You are correct that it allocates a new GC 
block. The dup is (should be) completely unnecessary. I don't think 
Appender (which is what format uses) has GC memory corruption issues, 
because it's used literally everywhere, it should be rock-solid. Always 
possible though.

> I suspect all of these odd "cures" have the effect they do because they 
> change the allocation pattern of this section of the code. Without any 
> of them, it appears that an allocation is triggering a GC at exactly the 
> wrong moment. What is not understood is what is wrong with the moment.

Yeah, I agree. For sure most of these constructs you are using are very 
well tested. There are many programs which use them, for many years, 
which still points me back at the sqlite library wrapper/binding. Not 
impossible that format and the GC have a bug, but it seems unlikely. Any 
time you have GC memory being prematurely collected, it smells like 
something was stored where it shouldn't be. The other major cause of 
this is generally alignment. But it doesn't seem like you are specifying 
alignment anywhere.

In any case, I understand if you don't want to go through any more 
exercises, but if you do, it might help find something that is 
super-obscure. Thanks for taking the time to investigate as much as you 
have!

-Steve