VLERange: a range in between BidirectionalRange and RandomAccessRange
Jonathan M Davis
jmdavisProg at gmx.com
Sun Jan 16 19:14:11 PST 2011
On Sunday 16 January 2011 18:45:26 Andrei Alexandrescu wrote:
> On 1/16/11 6:42 PM, Daniel Gibson wrote:
> > Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
> >> On 1/16/11 3:20 PM, Michel Fortin wrote:
> >>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
> >>>
> >>> <SeeWebsiteForEmail at erdani.org> said:
> >>>> But most strings don't contain combining characters or unnormalized
> >>>> strings.
> >>>
> >>> I think we should expect combining marks to be used more and more as
> >>> our OS text system and fonts start supporting them better. Them being
> >>> rare might be true today, but what do you know about tomorrow?
> >>
> >> I don't think languages will acquire more diacritics soon. I do hope, of
> >> course, that D applications gain more usage in the Arabic, Hebrew etc.
> >> world.
> >
> > So why does D use unicode anyway?
> > If you don't care about not-often used languages anyway, you could have
> > used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
> > which encoding he wants/needs).
> >
> > You could as well say "we don't need to use dchar to represent a proper
> > code point, wchar is enough for most use cases and has fewer overhead
> > anyway".
>
> I consider UTF8 superior to all of the above.
>
> >>>> I think it's reasonable to understand why I'm happy with the current
> >>>> state of affairs. It is better than anything we've had before and
> >>>> better than everything else I've tried.
> >>>
> >>> It is indeed easy to understand why you're happy with the current state
> >>> of affairs: you never had to deal with multi-code-point character and
> >>> can't imagine yourself having to deal with them on a semi-frequent
> >>> basis.
> >>
> >> Do you, and can you?
> >>
> >>> Other people won't be so happy with this state of affairs, but
> >>> they'll probably notice only after most of their code has been written
> >>> unaware of the problem.
> >>
> >> They can't be unaware and write said code.
> >
> > Fun fact: Germany recently introduced a new ID card and some of the
> > software that was developed for this and is used in some record sections
> > fucks up when a name contains diacritics.
> >
> > I think especially when you're handling names (and much software does, I
> > think) it's crucial to have proper support for all kinds of chars.
> > Of course many programmers are not aware that, if Umlaute and ß works it
> > doesn't mean that all other kinds of strange characters work as well.
> >
> >
> > Cheers,
> > - Daniel
>
> I think German text works well with dchar.
I think that whether dchar will be enough will depend primarily on where the
unicode is coming from and what the programmer is doing with it. There's plenty
which will just work regardless of whether code poinst are pre-combined or not,
and there's other stuff which will have subtle bugs if they're not pre-combined.
For the most part, Western languages should have pre-combined characters, but
whether a program sees them in combined form or not will depend on where the
text comes from. If it comes from a file, then it all depends on the program
which wrote the file. If it comes from the console, then it depends on what that
console does. If it comes from a socket or pipe or whatnot, then it depends on
whatever program is sending the data.
So, the question becomes what the norm is? Are unicode characters normally pre-
combined or left as separate code points? The majority of English text will be
fine regardless, since English only uses accented characters and the like when
including foreign words, but most any other European language will have accented
characters and then it's an open question. If it's more likely that a D program
will receive pre-combined characters than not, then many programs will likely be
safe treating a code point as a character. But if the odds are high that a D
program will receive characters which are not yet combined, then certain sets of
text will invariably result in bugs in your average D program.
I don't think that there's much question that from a performance standpoint and
from the standpoint of trying to avoid breaking TDPL and a lot of pre-existing
code, we should continue to treat a code point - a dchar - as an abstract
character. Moving to graphemes could really harm performance - and there _are_
plenty of programs that couldn't care less about unicode. However, it's quite
clear that in a number of circumstances, that's going to result in buggy code.
The question then is whether it's okay to take a performance hit just to
correctly handle unicode. And I expect that a _lot_ of people are going to say
no to that.
D already does better at handling unicode than many other languages, so it's
definitely a step up as it is. The cost for handling unicode completely correctly
is quite high from the sounds of it - all of a sudden you're effectively (if not
literally) dealing with arrays of arrays instead of arrays. So, I think that
it's a viable option to say that the default path that D will take is the
_mostly_ correct but still reasonably efficient path, and then - through 3rd party
libraries or possibly even with a module in Phobos - we'll provide a means to
handle unicode 100% correctly for those who really care.
At minimum, we need the tools to handle unicode correctly, but if we can't
handle it both correctly and efficiently, then I'm afraid that it's just not going
to be reasonable to handle it correctly - especially if we can handle it
_almost_ correctly and still be efficient.
Regardless, the real question is how likely a D program is to deal with unicode
which is not pre-combined. If the odds are relatively low in the general case,
then sticking to dchar should be fine. But if the adds or relatively high, then
not going to graphemes could mean that there will be a _lot_ of buggy D programs
out there.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list