VLERange: a range in between BidirectionalRange and RandomAccessRange
Michel Fortin
michel.fortin at michelf.com
Sun Jan 16 13:20:13 PST 2011
On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> said:
> On 1/15/11 10:45 PM, Michel Fortin wrote:
>> No doubt it's easier to implement it that way. The problem is that in
>> most cases it won't be used. How many people really know what is a
>> grapheme?
>
> How many people really should care?
I think the only people who should *not* care are those who have
validated that the input does not contain any combining code point. If
you know the input *can't* contain combining code points, then it's
safe to ignore them.
If we don't make correct Unicode handling the default, someday someone
is going to ask a developer to fix a problem where his system doesn't
handle some text correctly. Later that day, he'll come to the
realization that almost none of his D code and none of the D libraries
he use handle unicode correctly, and he'll say: can't fix this. His
peer working on a similar Objective-C program will have a good laugh.
Sure, correct Unicode handling is slower and more complicated to
implement, but at least you know you'll get the right results.
>> Of those, how many will forget to use byGrapheme at one time
>> or another? And so in most programs string manipulation will misbehave
>> in the presence of combining characters or unnormalized strings.
>
> But most strings don't contain combining characters or unnormalized strings.
I think we should expect combining marks to be used more and more as
our OS text system and fonts start supporting them better. Them being
rare might be true today, but what do you know about tomorrow?
A few years ago, many Unicode symbols didn't even show up correctly on
Windows. Today, we have Unicode domain names and people start putting
funny symbols in them (for instance: <http://◉.ws>). I haven't seen it
yet, but we'll surely see combining characters in domain names soon
enough (if only as a way to make fun of programs that can't handle
Unicode correctly). Well, let me be the first to make fun of such
programs: <http://☺̭̏.michelf.com/>.
Also, not all combining characters are marks meant to be used by some
foreign languages. Some are used for mathematics for instance. Or you
could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay
indicating some kind of prohibition.
>> If you want to help D programmers write correct code when it comes to
>> Unicode manipulation, you need to help them iterate on real characters
>> (graphemes), and you need the algorithms to apply to real characters
>> (graphemes), not the approximation of a Unicode character that is a code
>> point.
>
> I don't think the situation is as clean cut, as grave, and as urgent as
> you say.
I agree it's probably not as clean cut as I say (I'm trying to keep
complicated things simple here), but it's something important to decide
early because the cost of changing it increase as more code is written.
Quoting the first part of the same post (out of order):
> Disagreement as that might be, a simple fact that needs to be taken
> into account is that as of right now all of Phobos uses UTF arrays for
> string representation and dchar as element type.
>
> Besides, for one I do dispute the idea that a grapheme element is
> better than a dchar element for iterating over a string. The grapheme
> has the attractiveness of being theoretically clean but at the same
> time is woefully inefficient and helps languages that few D users need
> to work with. At least that's my perception, and we need some serious
> numbers instead of convincing rhetoric to make a big decision.
You'll no doubt get more performance from a grapheme-aware specialized
algorithm working directly on code points than by iterating on
graphemes returned as string slices. But both will give *correct*
results.
Implementing a specialized algorithm of this kind becomes an
optimization, and it's likely you'll want an optimized version for most
string algorithms.
I'd like to have some numbers too about performance, but I have none at
this time.
> It's all a matter of picking one's trade-offs. Clearly ASCII is out as
> no serious amount of non-English text can be trafficked without
> diacritics. So switching to UTF makes a lot of sense, and that's what D
> did.
>
> When I introduced std.range and std.algorithm, they'd handle char[] and
> wchar[] no differently than any other array. A lot of algorithms simply
> did the wrong thing by default, so I attempted to fix that situation by
> defining byDchar(). So instead of passing some string str to an
> algorithm, one would pass byDchar(str).
>
> A couple of weeks went by in testing that state of affairs, and before
> late I figured that I need to insert byDchar() virtually _everywhere_.
> There were a couple of algorithms (e.g. Boyer-Moore) that happened to
> work with arrays for subtle reasons (needless to say, they won't work
> with graphemes at all). But by and large the situation was that the
> simple and intuitive code was wrong and that the correct code
> necessitated inserting byDchar().
>
> So my next decision, which understandably some of the people who didn't
> go through the experiment may find unintuitive, was to make byDchar()
> the default. This cleaned up a lot of crap in std itself and saved a
> lot of crap in the yet-unwritten client code.
But were your algorithms *correct* in the first place? I'd argue that
by making byDchar the default you've not saved yourself from any crap
because dchar isn't the right layer of abstraction.
> I think it's reasonable to understand why I'm happy with the current
> state of affairs. It is better than anything we've had before and
> better than everything else I've tried.
It is indeed easy to understand why you're happy with the current state
of affairs: you never had to deal with multi-code-point character and
can't imagine yourself having to deal with them on a semi-frequent
basis. Other people won't be so happy with this state of affairs, but
they'll probably notice only after most of their code has been written
unaware of the problem.
> Now, thanks to the effort people have spent in this group (thank you!),
> I have an understanding of the grapheme issue. I guarantee that
> grapheme-level iteration will have a high cost incurred to it:
> efficiency and changes in std. The languages that need composing
> characters for producing meaningful text are few and far between, so it
> makes sense to confine support for them to libraries that are not the
> default, unless we find ways to not disrupt everyone else.
We all are more aware of the problem now, that's a good thing. :-)
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list