VLERange: a range in between BidirectionalRange and RandomAccessRange

Sun Jan 16 19:14:11 PST 2011

On Sunday 16 January 2011 18:45:26 Andrei Alexandrescu wrote:
> On 1/16/11 6:42 PM, Daniel Gibson wrote:
> > Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
> >> On 1/16/11 3:20 PM, Michel Fortin wrote:
> >>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
> >>> 
> >>> <SeeWebsiteForEmail at erdani.org> said:
> >>>> But most strings don't contain combining characters or unnormalized
> >>>> strings.
> >>> 
> >>> I think we should expect combining marks to be used more and more as
> >>> our OS text system and fonts start supporting them better. Them being
> >>> rare might be true today, but what do you know about tomorrow?
> >> 
> >> I don't think languages will acquire more diacritics soon. I do hope, of
> >> course, that D applications gain more usage in the Arabic, Hebrew etc.
> >> world.
> > 
> > So why does D use unicode anyway?
> > If you don't care about not-often used languages anyway, you could have
> > used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
> > which encoding he wants/needs).
> > 
> > You could as well say "we don't need to use dchar to represent a proper
> > code point, wchar is enough for most use cases and has fewer overhead
> > anyway".
> 
> I consider UTF8 superior to all of the above.
> 
> >>>> I think it's reasonable to understand why I'm happy with the current
> >>>> state of affairs. It is better than anything we've had before and
> >>>> better than everything else I've tried.
> >>> 
> >>> It is indeed easy to understand why you're happy with the current state
> >>> of affairs: you never had to deal with multi-code-point character and
> >>> can't imagine yourself having to deal with them on a semi-frequent
> >>> basis.
> >> 
> >> Do you, and can you?
> >> 
> >>> Other people won't be so happy with this state of affairs, but
> >>> they'll probably notice only after most of their code has been written
> >>> unaware of the problem.
> >> 
> >> They can't be unaware and write said code.
> > 
> > Fun fact: Germany recently introduced a new ID card and some of the
> > software that was developed for this and is used in some record sections
> > fucks up when a name contains diacritics.
> > 
> > I think especially when you're handling names (and much software does, I
> > think) it's crucial to have proper support for all kinds of chars.
> > Of course many programmers are not aware that, if Umlaute and ß works it
> > doesn't mean that all other kinds of strange characters work as well.
> > 
> > 
> > Cheers,
> > - Daniel
> 
> I think German text works well with dchar.

I think that whether dchar will be enough will depend primarily on where the 
unicode is coming from and what the programmer is doing with it. There's plenty 
which will just work regardless of whether code poinst are pre-combined or not, 
and there's other stuff which will have subtle bugs if they're not pre-combined.

For the most part, Western languages should have pre-combined characters, but 
whether a program sees them in combined form or not will depend on where the 
text comes from. If it comes from a file, then it all depends on the program 
which wrote the file. If it comes from the console, then it depends on what that 
console does. If it comes from a socket or pipe or whatnot, then it depends on 
whatever program is sending the data.

So, the question becomes what the norm is? Are unicode characters normally pre-
combined or left as separate code points? The majority of English text will be 
fine regardless, since English only uses accented characters and the like when 
including foreign words, but most any other European language will have accented 
characters and then it's an open question. If it's more likely that a D program 
will receive pre-combined characters than not, then many programs will likely be 
safe treating a code point as a character. But if the odds are high that a D 
program will receive characters which are not yet combined, then certain sets of 
text will invariably result in bugs in your average D program.

I don't think that there's much question that from a performance standpoint and 
from the standpoint of trying to avoid breaking TDPL and a lot of pre-existing 
code, we should continue to treat a code point - a dchar - as an abstract 
character. Moving to graphemes could really harm performance - and there _are_ 
plenty of programs that couldn't care less about unicode. However, it's quite 
clear that in a number of circumstances, that's going to result in buggy code. 
The question then is whether it's okay to take a performance hit just to 
correctly handle unicode. And I expect that a _lot_ of people are going to say 
no to that.

D already does better at handling unicode than many other languages, so it's 
definitely a step up as it is. The cost for handling unicode completely correctly 
is quite high from the sounds of it - all of a sudden you're effectively (if not 
literally) dealing with arrays of arrays instead of arrays. So, I think that 
it's a viable option to say that the default path that D will take is the 
_mostly_ correct but still reasonably efficient path, and then - through 3rd party 
libraries or possibly even with a module in Phobos - we'll provide a means to 
handle unicode 100% correctly for those who really care.

At minimum, we need the tools to handle unicode correctly, but if we can't 
handle it both correctly and efficiently, then I'm afraid that it's just not going 
to be reasonable to handle it correctly - especially if we can handle it 
_almost_ correctly and still be efficient.

Regardless, the real question is how likely a D program is to deal with unicode 
which is not pre-combined. If the odds are relatively low in the general case, 
then sticking to dchar should be fine. But if the adds or relatively high, then 
not going to graphemes could mean that there will be a _lot_ of buggy D programs 
out there.

- Jonathan M Davis