VLERange: a range in between BidirectionalRange and RandomAccessRange

Sun Jan 16 19:38:42 PST 2011

Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:
> On 1/16/11 6:42 PM, Daniel Gibson wrote:
>> Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
>>> On 1/16/11 3:20 PM, Michel Fortin wrote:
>>>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>>>> <SeeWebsiteForEmail at erdani.org> said:
>>>>> But most strings don't contain combining characters or unnormalized
>>>>> strings.
>>>>
>>>> I think we should expect combining marks to be used more and more as our
>>>> OS text system and fonts start supporting them better. Them being rare
>>>> might be true today, but what do you know about tomorrow?
>>>
>>> I don't think languages will acquire more diacritics soon. I do hope, of
>>> course, that D applications gain more usage in the Arabic, Hebrew etc.
>>> world.
>>>
>>
>> So why does D use unicode anyway?
>> If you don't care about not-often used languages anyway, you could have
>> used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
>> which encoding he wants/needs).
>>
>> You could as well say "we don't need to use dchar to represent a proper
>> code point, wchar is enough for most use cases and has fewer overhead
>> anyway".
>
> I consider UTF8 superior to all of the above.
>

Really? UTF32 - maybe. But IMHO even when not considering graphemes and such 
UTF8 sucks hard in comparison to those because one code point consists of 1-4 
code units (even in German 1-2 code units).

>>>>> I think it's reasonable to understand why I'm happy with the current
>>>>> state of affairs. It is better than anything we've had before and
>>>>> better than everything else I've tried.
>>>>
>>>> It is indeed easy to understand why you're happy with the current state
>>>> of affairs: you never had to deal with multi-code-point character and
>>>> can't imagine yourself having to deal with them on a semi-frequent
>>>> basis.
>>>
>>> Do you, and can you?
>>>
>>>> Other people won't be so happy with this state of affairs, but
>>>> they'll probably notice only after most of their code has been written
>>>> unaware of the problem.
>>>
>>> They can't be unaware and write said code.
>>>
>>
>> Fun fact: Germany recently introduced a new ID card and some of the
>> software that was developed for this and is used in some record sections
>> fucks up when a name contains diacritics.
>>
>> I think especially when you're handling names (and much software does, I
>> think) it's crucial to have proper support for all kinds of chars.
>> Of course many programmers are not aware that, if Umlaute and ß works it
>> doesn't mean that all other kinds of strange characters work as well.
>>
>>
>> Cheers,
>> - Daniel
>
> I think German text works well with dchar.
>

Yes, but even in Germany there are people whose names contain "strange" 
characters ;)
Is it common to have programs that deal with text in a specific language but not 
with names?

I do understand your resistance to support Unicode properly - it's a lot of 
trouble and makes things inefficient (more inefficient than UTF8/16 already are 
because of that code point != code unit thing).
Another thing is that due to bad support from fonts or console/GUI technology it 
may happen (quite often) that one grapheme is *not* displayed as a single 
character, thus messing up formatting anyway (Still you probably should cut a 
string within a grapheme).

So here's what I think can be done (and, at least the first two points, 
especially the first, should be done):

1. Mention the Grapheme and Digraph situation in string related documentation 
(std.string and maybe string-related stuff in std.algorithm like Splitter) to 
make sure people who use Phobos are aware of the problem. Then at least they 
can't say that nobody told them when their Objective-C using colleagues are 
laughing at their broken unicode-support ;)

2. Maybe add some functions that *do* deal with this.
Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people can 
check themselves, if they just split their string within a grapheme or something.

3. Include a proper Unicode-string type/module, if somebody has the time and 
knowledge to develop one. spir already started something like that AFAIK and 
Steven Schveighoffer also is even working on a complete string type - maybe 
these efforts could be combined?
I guess default strings will stay mostly the way they are (but please add an 
ASCII type or allow ubyte[] asciiStr = "asdf";).
Having an additional type in Phobos that works correctly in all cases (e.g. 
Arabic, Hebrew, Japanese, ..) would be really great, though.

   UniString uStr = new UniString("sdfüñẫ");
   UniString uStr2 = uStr[3..$]; // "üñẫ"
   UniGraph ug = uStr[5]; // 'ẫ'
   size_t i = uStr2.length; // 3
something like that maybe (of course plus a lot of other stuff like proper 
comparison for different encodings of the same char like a modified icmp() 
discussed before).
But something like
   size_t len = uniLen("sdfüñẫ"); // 6
   string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length);
etc may be just as good.

(I hope this all made sense)

>
> Andrei

Cheers,
- Daniel