Range of chars (narrow string ranges)
Chris via Digitalmars-d
digitalmars-d at puremagic.com
Wed Apr 29 03:02:08 PDT 2015
On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
> On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
>> Would it be much work to show have example code or even an
>> experimental module that gets rid of auto-decoding, so we
>> could see what would be affected in general and how actual
>> code we have would be affected by it?
>>
>> The topic keeps coming up again and again, and while I'm in
>> favor of anything that enhances performance, I'm afraid of
>> having to refactor large chunks of my code. However, this fear
>> may be unfounded, but I would need some examples to visualize
>> the problem.
>
> Honestly, most code won't care. If we just switched out all of
> the auto-decoding right now, pretty much anything using only
> ASCII would just work, and most anything that's trying to
> manipulate ASCII characters in a Unicode string will just work,
> whereas code that's specifically manipulating Unicode
> characters might have problems (e.g. comparing front with a
> dchar will no longer have the same result, since front would
> just be the first code unit rather than necessarily the first
> code point). Since most Phobos range-based functions which
> operate on strings are special-cased on strings already, many
> of them would continue to just work (e.g. find returns the same
> range type as what's passed to it even if it's given a string,
> so it might just work with the change, or it might need to be
> tweaked slightly), and those that would then generally either
> need to call encode on an argument to make it match the string
> type in the cases string types mix (e.g. "foo".find("fo"d)
> would need to call encode on "fo"d to make it a string for
> comparison), or the caller would need to use std.utf.byDchar or
> std.uni.byGrapheme to operate on code points or graphemes
> rather than code units.
>
> The two biggest places in Phobos that would potentially have
> problems are functions that special-cased strings but still
> used front and those which have to return a new range type.
> e.g. filter would be a good example, because it's forced to
> return a new range type. Right now, it would filter on dchars,
> but with the change, it would filter on the code unit type
> (most typically char). If you're filtering on ASCII characters,
> it wouldn't matter aside from the fact that the resulting range
> would have an element type of char rather than dchar, but if
> you're filtering on Unicode characters, it wouldn't work
> anymore. For situations like that, you'd be forced do use
> std.utf.byDchar or std.uni.byGrapheme. However, since most
> string code tends to operate on substrings rather than
> characters, I don't know how common it even is to use a
> function like filter on a string (as opposed to a range of
> strings). Such code might actually be fairly rare.
>
> So, there _are_ a few functions which stop working the same way
> in a potentially silent manner if we just made it so that front
> didn't autodecode anymore. However, in general, because Phobos
> almost always special-cases strings, calls to Phobos functions
> probably wouldn't need to change in most cases, and when they
> do, a call to byDchar would restore the old behavior. But of
> course, we'd want to do the transition in a way that didn't
> result in silent behavioral changes that would break code, even
> though in most cases, it wouldn't matter, because most code
> will be operating on ASCII strings even if the strings
> themselves contain Unicode - e.g.
> unicodeString.find(asciiString) is far more common than
> unicodeString.find(otherUnicodeString).
>
> I suspect that the code that's at the greatest risk is code
> that checks for is(Unqual!(ElementType!Range) == dchar) to
> operate on strings and wrapper ranges around strings, since it
> would then only match the cases where byDchar had been used. In
> general though, the code that's going to run into the most
> trouble is user code that contains range-based functions
> similar to what you might find in Phobos rather than code
> that's simply using the Phobos functions like startsWith and
> find - i.e. if you're writing range-base code that worries
> about doing stuff like special-casing strings or which
> specifically needs to operate on code points, then you're going
> to have to make changes, whereas to a great extent, if all
> you're doing is passing strings to Phobos functions, your code
> will tend to just work.
>
> To actually see what the impact would be, we'd have to just
> change Phobos, I think, and then see what the impact was on
> user code. It could be surprising how much or how little it
> affects things, though in most cases, I expect that it'll mean
> that code will just work. And if we really wanted to do that,
> we could create a version flag that turned of autodecoding and
> version the changes in Phobos appropriately to see what we got.
> In many cases, if we simply made sure that Phobos functions
> which special-cased strings didn't use front directly but
> instead didn't care whether they were operating on ranges of
> char, wchar, or dchar, then we wouldn't even need to version
> anything (e.g. find could easily be made to work that way if it
> doesn't already), but some functions (like filter) would need
> to be versioned differently.
>
> So, maybe what we need to do to start is to just go through
> Phobos and make as many functions as possible not care about
> whether they're dealing with strings as ranges of char, wchar,
> or dchar. And at least then, we'd minimize how much code would
> have to be versioned differently if we were to test out getting
> rid of autodecoding with versioning.
>
> - Jonathan M Davis
This sounds like a good starting point for a transition plan. One
important thing, though, would be to do some benchmarking with
and without autodecoding, to see if it really boosts performance
in a way that would justify the transition.
More information about the Digitalmars-d
mailing list