VLERange: a range in between BidirectionalRange and RandomAccessRange

Fri Jan 14 15:02:32 PST 2011

Andrei Alexandrescu Wrote:

> That's a strong indicator, but we shouldn't get ahead of ourselves.
> 
> D took a certain risk by defaulting to Unicode at a time where the 
> dominant extant systems languages left the decision to more or less 
> exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other 
> languages were just starting to adopt Unicode.
> 
> I think that risk was justified because the relative loss in speed was 
> often acceptable and the gains were there. Even so, there are people in 
> this who protest against the loss in efficiency and argue that life is 
> harder for ASCII users.
> 
> Switching to variable-length representation of graphemes as bundles of 
> dchars and committing to that through and through will bring with it a 
> larger hit in efficiency and an increased difficulty in usage. I agree 
> that at a level that's the "right" thing to do, but I don't have yet the 
> feeling that combining characters are a widely-adopted winner. For the 
> most part, fonts don't support combining characters, and as a font 
> dilettante I can tell that putting arbitrary sets of diacritics on top 
> of characters is not what one should be doing as it'll look terrible. 
> Unicode is begrudgingly acknowledging combining characters. Only a 
> handful of libraries deal with them. I don't know how many applications 
> need or care for them, versus how many applications do fine with 
> precombined characters. I have trouble getting combining characters to 
> combine on this machine in any of the applications I use - and this is a 
> Mac.
> 
> 
> Andrei

Combining marks do need to be supported.
Some languages use combining marks extensively (see my other post) and of course font for those languages exist and they do support this. Mac doesn't support all languages so I'm unsure if it's the best example out there. 
here's an example of the Hebrew bible: 
http://www.scripture4all.org/OnlineInterlinear/Hebrew_Index.htm

Just look at the any of the PDFs there to see how Hebrew looks like with all sorts of different marks. 
In the same vain I could have found a Japanese text with ruby (where a Kanji letter has on top of it Hiragana text that tells you how to read it)

Using a dchar as a string element instead of a proper grapheme will make it really hard to work with texts in such languages. 

Regarding efficiency concerns for ASCII users - there's no rule that forces us to have a single string type,  just look for comparison at how many integral types D has. I believe that the correct thing is to have a 'universal string' type be the default (just like int is for integral types) and provide additional types for other commonly useful encodings such as ASCII.

A geneticist for instance should use a 'DNA' type that encodes the four DNA letters instead of an ASCII string or even worse, a universal (Unicode) string.