converting D's string to use with C API with unicode

tsbockman thomas.bockman at gmail.com
Sat Dec 5 20:45:40 UTC 2020


On Saturday, 5 December 2020 at 19:51:14 UTC, Jack wrote:
>>version(Windows) extern(C) export
>>struct C_ProcessResult
>>{
>>	wchar*[] output;

In D, `T[]` (where T is some element type, `wchar*` in this case) 
is a slice structure that bundles a length and a pointer 
together. It is NOT the same thing as `T[]` in C. You will get 
memory corruption if you try to use `T[]` directly when 
interfacing with C.

Instead, you must use a bare pointer, plus a separate length/size 
if the C API accepts one. I'm guessing that 
`C_ProcessResult.output` should have type `wchar**`, but I can't 
say for sure without seeing the Windows API documentation or C 
header file in which the C structure is detailed.

>>	bool ok;
>>}

>>struct ProcessResult
>>{
>>	string[] output;
>>	bool ok;
>>
>>	C_ProcessResult toCResult()
>>	{
>>		auto r = C_ProcessResult();
>>		r.ok = this.ok; // just copy, no conversion needed
>>		foreach(s; this.output)
>>			r.output ~= cast(wchar*)s.ptr;

This is incorrect, and will corrupt memory. `cast(wchar*)` is a 
reinterpret cast, and an invalid one at that. It says, "just take 
my word for it, the data at the address stored in `s.ptr` is 
UTF16 encoded." But, that's not true: the data is UTF8 encoded, 
because `s` is a `string`, so this will thoroughly confuse things 
and not do what you want at all. The text will be garbled and you 
will likely trigger a buffer overrun on the C side of things.

What you need to do instead is allocate a separate array of 
`wchar[]`, and then use the UTF8 to UTF16 conversion algorithm to 
fill the new `wchar[]` array based on the `char` elements in `s`.

The conversion algorithm is non-trivial, but the `std.encoding` 
module can do it for you.

>>		return r;
>>	}
>>}
>

Note also that when exchanging heap-allocated data (such as most 
strings or arrays) with a C API, you must figure out who is 
responsible for de-allocating the memory at the proper time - and 
NOT BEFORE. If you allocate memory with D's GC (using `new` or 
the slice concatenation operators `~` and `~=`), watch out that 
you keep a reference to it alive on the D side until after the C 
API is completely done with it. Otherwise, D's GC may not realize 
it's still in use, and may de-allocate it early, causing memory 
corruption in a way that is very difficult to debug.


More information about the Digitalmars-d-learn mailing list