Thin UTF8 string wrapper

Fri Dec 6 17:42:07 UTC 2019

On Friday, 6 December 2019 at 16:48:21 UTC, Joseph Rushton 
Wakeling wrote:
> Hello folks,
>
> I have a use-case that involves wanting to create a thin struct 
> wrapper of underlying string data (the idea is to have a type 
> that guarantees that the string has certain desirable 
> properties).
>
> The string is required to be valid UTF-8.  The question is what 
> the most useful API is to expose from the wrapper: a sliceable 
> random-access range?  A getter plus `alias this` to just treat 
> it like a normal string from the reader's point of view?
>
> One factor that I'm not sure how to address w.r.t. a full range 
> API is how to handle iterating over elements: presumably they 
> should be iterated over as `dchar`, but how to implement a 
> `front` given that `std.encoding` gives no way to decode the 
> initial element of the string that doesn't also pop it off the 
> front?
>
> I'm also slightly disturbed to see that 
> `std.encoding.codePoints` requires `immutable(char)[]` input: 
> surely it should operate on any range of `char`?
>
> I'm inclining towards the "getter + `alias this`" approach, but 
> I thought I'd throw the problem out here to see if anyone has 
> any good experience and/or advice.
>
> Thanks in advance for any thoughts!
>
> All the best,
>
>      -- Joe

Good questions. I don't have answers to them all but I hope this 
information is helpful.

I use wrapper structs to represent properties in this way as 
well.  For example my  "mar" library has the SentinelPtr and 
SentinelArray types which guarantee that the underlying pointer 
and/or array is terminted by some value (i.e. like a 
null-terminated C string).

If I'm creating and use these wrapper types inside a 
self-contained program then I don't really care about API 
compatibility so I would use a simple powerful mechanism like 
"alias this".  For libraries where the API boundary is important 
I implement the most limited API I can.  The reason for this, is 
it allows you to see all possible interaction with the type.  
This way, when you need to change the API you know all the 
existing ways it can be interacted with and iterate on the API 
design appropriately.  This is the case for SentinelPtr and 
SentinelArray.  For this case I only implement the operations I 
know are being used, and I made this easy by creating a simple 
module I call "wrap.d" 
(https://github.com/dragon-lang/mar/blob/master/src/mar/wrap.d).

If you have a struct that wraps a string and guarantees it's UTF8 
encoded, wrap.d lets you declare that it's a wrapper type and 
allows you to mixin the operations you want to expose like this:

struct Utf8String
{
     private string str;
     import mar.wrap;

     // this verifies the size of the wrapper struct and the 
underlying field
     // are the same, and creates the wrappedValueRef method that 
the other
     // wrapper mixins use to access the underlying wrapped value
     mixin WrapperFor!"str";

     // Now you can mixin different operations, for example
     mixin WrapOpCast;
     mixin WrapOpIndex;
     mixin WrapOpSlice;
}

On the topic of immutable(char)[] vs const(char)[]. If a function 
takes const data, I take it to mean that the function won't 
change the data.  If it takes immutable data, I take it to mean 
that the function won't change it AND the caller must ensure data 
won't change while the function has it.  However in practice, 
functions that require immutable data sill declare their data be 
"const" instead of "immutable".  I think this is because 
declaring it as immutable would require extra boiler-plate all 
over your code to cast data to immutable all the time.  So most 
functions end up using const even though they require immutable.