Range of chars (narrow string ranges)

Tue Apr 28 09:48:46 PDT 2015

On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
> Would it be much work to show have example code or even an 
> experimental module that gets rid of auto-decoding, so we could 
> see what would be affected in general and how actual code we 
> have would be affected by it?
>
> The topic keeps coming up again and again, and while I'm in 
> favor of anything that enhances performance, I'm afraid of 
> having to refactor large chunks of my code. However, this fear 
> may be unfounded, but I would need some examples to visualize 
> the problem.

Honestly, most code won't care. If we just switched out all of 
the auto-decoding right now, pretty much anything using only 
ASCII would just work, and most anything that's trying to 
manipulate ASCII characters in a Unicode string will just work, 
whereas code that's specifically manipulating Unicode characters 
might have problems (e.g. comparing front with a dchar will no 
longer have the same result, since front would just be the first 
code unit rather than necessarily the first code point). Since 
most Phobos range-based functions which operate on strings are 
special-cased on strings already, many of them would continue to 
just work (e.g. find returns the same range type as what's passed 
to it even if it's given a string, so it might just work with the 
change, or it might need to be tweaked slightly), and those that 
would then generally either need to call encode on an argument to 
make it match the string type in the cases string types mix (e.g. 
"foo".find("fo"d) would need to call encode on "fo"d to make it a 
string for comparison), or the caller would need to use 
std.utf.byDchar or std.uni.byGrapheme to operate on code points 
or graphemes rather than code units.

The two biggest places in Phobos that would potentially have 
problems are functions that special-cased strings but still used 
front and those which have to return a new range type. e.g. 
filter would be a good example, because it's forced to return a 
new range type. Right now, it would filter on dchars, but with 
the change, it would filter on the code unit type (most typically 
char). If you're filtering on ASCII characters, it wouldn't 
matter aside from the fact that the resulting range would have an 
element type of char rather than dchar, but if you're filtering 
on Unicode characters, it wouldn't work anymore. For situations 
like that, you'd be forced do use std.utf.byDchar or 
std.uni.byGrapheme. However, since most string code tends to 
operate on substrings rather than characters, I don't know how 
common it even is to use a function like filter on a string (as 
opposed to a range of strings). Such code might actually be 
fairly rare.

So, there _are_ a few functions which stop working the same way 
in a potentially silent manner if we just made it so that front 
didn't autodecode anymore. However, in general, because Phobos 
almost always special-cases strings, calls to Phobos functions 
probably wouldn't need to change in most cases, and when they do, 
a call to byDchar would restore the old behavior. But of course, 
we'd want to do the transition in a way that didn't result in 
silent behavioral changes that would break code, even though in 
most cases, it wouldn't matter, because most code will be 
operating on ASCII strings even if the strings themselves contain 
Unicode - e.g. unicodeString.find(asciiString) is far more common 
than unicodeString.find(otherUnicodeString).

I suspect that the code that's at the greatest risk is code that 
checks for is(Unqual!(ElementType!Range) == dchar) to operate on 
strings and wrapper ranges around strings, since it would then 
only match the cases where byDchar had been used. In general 
though, the code that's going to run into the most trouble is 
user code that contains range-based functions similar to what you 
might find in Phobos rather than code that's simply using the 
Phobos functions like startsWith and find - i.e. if you're 
writing range-base code that worries about doing stuff like 
special-casing strings or which specifically needs to operate on 
code points, then you're going to have to make changes, whereas 
to a great extent, if all you're doing is passing strings to 
Phobos functions, your code will tend to just work.

To actually see what the impact would be, we'd have to just 
change Phobos, I think, and then see what the impact was on user 
code. It could be surprising how much or how little it affects 
things, though in most cases, I expect that it'll mean that code 
will just work. And if we really wanted to do that, we could 
create a version flag that turned of autodecoding and version the 
changes in Phobos appropriately to see what we got. In many 
cases, if we simply made sure that Phobos functions which 
special-cased strings didn't use front directly but instead 
didn't care whether they were operating on ranges of char, wchar, 
or dchar, then we wouldn't even need to version anything (e.g. 
find could easily be made to work that way if it doesn't 
already), but some functions (like filter) would need to be 
versioned differently.

So, maybe what we need to do to start is to just go through 
Phobos and make as many functions as possible not care about 
whether they're dealing with strings as ranges of char, wchar, or 
dchar. And at least then, we'd minimize how much code would have 
to be versioned differently if we were to test out getting rid of 
autodecoding with versioning.

- Jonathan M Davis