Range of chars (narrow string ranges)

Wed Apr 29 03:02:08 PDT 2015

On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
> On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
>> Would it be much work to show have example code or even an 
>> experimental module that gets rid of auto-decoding, so we 
>> could see what would be affected in general and how actual 
>> code we have would be affected by it?
>>
>> The topic keeps coming up again and again, and while I'm in 
>> favor of anything that enhances performance, I'm afraid of 
>> having to refactor large chunks of my code. However, this fear 
>> may be unfounded, but I would need some examples to visualize 
>> the problem.
>
> Honestly, most code won't care. If we just switched out all of 
> the auto-decoding right now, pretty much anything using only 
> ASCII would just work, and most anything that's trying to 
> manipulate ASCII characters in a Unicode string will just work, 
> whereas code that's specifically manipulating Unicode 
> characters might have problems (e.g. comparing front with a 
> dchar will no longer have the same result, since front would 
> just be the first code unit rather than necessarily the first 
> code point). Since most Phobos range-based functions which 
> operate on strings are special-cased on strings already, many 
> of them would continue to just work (e.g. find returns the same 
> range type as what's passed to it even if it's given a string, 
> so it might just work with the change, or it might need to be 
> tweaked slightly), and those that would then generally either 
> need to call encode on an argument to make it match the string 
> type in the cases string types mix (e.g. "foo".find("fo"d) 
> would need to call encode on "fo"d to make it a string for 
> comparison), or the caller would need to use std.utf.byDchar or 
> std.uni.byGrapheme to operate on code points or graphemes 
> rather than code units.
>
> The two biggest places in Phobos that would potentially have 
> problems are functions that special-cased strings but still 
> used front and those which have to return a new range type. 
> e.g. filter would be a good example, because it's forced to 
> return a new range type. Right now, it would filter on dchars, 
> but with the change, it would filter on the code unit type 
> (most typically char). If you're filtering on ASCII characters, 
> it wouldn't matter aside from the fact that the resulting range 
> would have an element type of char rather than dchar, but if 
> you're filtering on Unicode characters, it wouldn't work 
> anymore. For situations like that, you'd be forced do use 
> std.utf.byDchar or std.uni.byGrapheme. However, since most 
> string code tends to operate on substrings rather than 
> characters, I don't know how common it even is to use a 
> function like filter on a string (as opposed to a range of 
> strings). Such code might actually be fairly rare.
>
> So, there _are_ a few functions which stop working the same way 
> in a potentially silent manner if we just made it so that front 
> didn't autodecode anymore. However, in general, because Phobos 
> almost always special-cases strings, calls to Phobos functions 
> probably wouldn't need to change in most cases, and when they 
> do, a call to byDchar would restore the old behavior. But of 
> course, we'd want to do the transition in a way that didn't 
> result in silent behavioral changes that would break code, even 
> though in most cases, it wouldn't matter, because most code 
> will be operating on ASCII strings even if the strings 
> themselves contain Unicode - e.g. 
> unicodeString.find(asciiString) is far more common than 
> unicodeString.find(otherUnicodeString).
>
> I suspect that the code that's at the greatest risk is code 
> that checks for is(Unqual!(ElementType!Range) == dchar) to 
> operate on strings and wrapper ranges around strings, since it 
> would then only match the cases where byDchar had been used. In 
> general though, the code that's going to run into the most 
> trouble is user code that contains range-based functions 
> similar to what you might find in Phobos rather than code 
> that's simply using the Phobos functions like startsWith and 
> find - i.e. if you're writing range-base code that worries 
> about doing stuff like special-casing strings or which 
> specifically needs to operate on code points, then you're going 
> to have to make changes, whereas to a great extent, if all 
> you're doing is passing strings to Phobos functions, your code 
> will tend to just work.
>
> To actually see what the impact would be, we'd have to just 
> change Phobos, I think, and then see what the impact was on 
> user code. It could be surprising how much or how little it 
> affects things, though in most cases, I expect that it'll mean 
> that code will just work. And if we really wanted to do that, 
> we could create a version flag that turned of autodecoding and 
> version the changes in Phobos appropriately to see what we got. 
> In many cases, if we simply made sure that Phobos functions 
> which special-cased strings didn't use front directly but 
> instead didn't care whether they were operating on ranges of 
> char, wchar, or dchar, then we wouldn't even need to version 
> anything (e.g. find could easily be made to work that way if it 
> doesn't already), but some functions (like filter) would need 
> to be versioned differently.
>
> So, maybe what we need to do to start is to just go through 
> Phobos and make as many functions as possible not care about 
> whether they're dealing with strings as ranges of char, wchar, 
> or dchar. And at least then, we'd minimize how much code would 
> have to be versioned differently if we were to test out getting 
> rid of autodecoding with versioning.
>
> - Jonathan M Davis

This sounds like a good starting point for a transition plan. One 
important thing, though, would be to do some benchmarking with 
and without autodecoding, to see if it really boosts performance 
in a way that would justify the transition.