VLERange: a range in between BidirectionalRange and RandomAccessRange

Sun Jan 16 19:48:24 PST 2011

Am 17.01.2011 04:38, schrieb Daniel Gibson:
> Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:
>> On 1/16/11 6:42 PM, Daniel Gibson wrote:
>>> Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
>>>> On 1/16/11 3:20 PM, Michel Fortin wrote:
>>>>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>>>>> <SeeWebsiteForEmail at erdani.org> said:
>>>>>> But most strings don't contain combining characters or unnormalized
>>>>>> strings.
>>>>>
>>>>> I think we should expect combining marks to be used more and more as our
>>>>> OS text system and fonts start supporting them better. Them being rare
>>>>> might be true today, but what do you know about tomorrow?
>>>>
>>>> I don't think languages will acquire more diacritics soon. I do hope, of
>>>> course, that D applications gain more usage in the Arabic, Hebrew etc.
>>>> world.
>>>>
>>>
>>> So why does D use unicode anyway?
>>> If you don't care about not-often used languages anyway, you could have
>>> used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
>>> which encoding he wants/needs).
>>>
>>> You could as well say "we don't need to use dchar to represent a proper
>>> code point, wchar is enough for most use cases and has fewer overhead
>>> anyway".
>>
>> I consider UTF8 superior to all of the above.
>>
>
> Really? UTF32 - maybe. But IMHO even when not considering graphemes and such
> UTF8 sucks hard in comparison to those because one code point consists of 1-4
> code units (even in German 1-2 code units).
>
>>>>>> I think it's reasonable to understand why I'm happy with the current
>>>>>> state of affairs. It is better than anything we've had before and
>>>>>> better than everything else I've tried.
>>>>>
>>>>> It is indeed easy to understand why you're happy with the current state
>>>>> of affairs: you never had to deal with multi-code-point character and
>>>>> can't imagine yourself having to deal with them on a semi-frequent
>>>>> basis.
>>>>
>>>> Do you, and can you?
>>>>
>>>>> Other people won't be so happy with this state of affairs, but
>>>>> they'll probably notice only after most of their code has been written
>>>>> unaware of the problem.
>>>>
>>>> They can't be unaware and write said code.
>>>>
>>>
>>> Fun fact: Germany recently introduced a new ID card and some of the
>>> software that was developed for this and is used in some record sections
>>> fucks up when a name contains diacritics.
>>>
>>> I think especially when you're handling names (and much software does, I
>>> think) it's crucial to have proper support for all kinds of chars.
>>> Of course many programmers are not aware that, if Umlaute and ß works it
>>> doesn't mean that all other kinds of strange characters work as well.
>>>
>>>
>>> Cheers,
>>> - Daniel
>>
>> I think German text works well with dchar.
>>
>
> Yes, but even in Germany there are people whose names contain "strange"
> characters ;)
> Is it common to have programs that deal with text in a specific language but not
> with names?
>
>
> I do understand your resistance to support Unicode properly - it's a lot of
> trouble and makes things inefficient (more inefficient than UTF8/16 already are
> because of that code point != code unit thing).
> Another thing is that due to bad support from fonts or console/GUI technology it
> may happen (quite often) that one grapheme is *not* displayed as a single
> character, thus messing up formatting anyway (Still you probably should cut a
> string within a grapheme).

I meant you should *not* cut a string within a grapheme.

>
> So here's what I think can be done (and, at least the first two points,
> especially the first, should be done):
>
> 1. Mention the Grapheme and Digraph situation in string related documentation
> (std.string and maybe string-related stuff in std.algorithm like Splitter) to
> make sure people who use Phobos are aware of the problem. Then at least they
> can't say that nobody told them when their Objective-C using colleagues are
> laughing at their broken unicode-support ;)
>
> 2. Maybe add some functions that *do* deal with this.
> Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people can
> check themselves, if they just split their string within a grapheme or something.
>
> 3. Include a proper Unicode-string type/module, if somebody has the time and
> knowledge to develop one. spir already started something like that AFAIK and
> Steven Schveighoffer also is even working on a complete string type - maybe
> these efforts could be combined?
> I guess default strings will stay mostly the way they are (but please add an
> ASCII type or allow ubyte[] asciiStr = "asdf";).
> Having an additional type in Phobos that works correctly in all cases (e.g.
> Arabic, Hebrew, Japanese, ..) would be really great, though.
>
> UniString uStr = new UniString("sdfüñẫ");
> UniString uStr2 = uStr[3..$]; // "üñẫ"
> UniGraph ug = uStr[5]; // 'ẫ'
> size_t i = uStr2.length; // 3

of course I forgot:
   string s = uStr2.toString();
   dstring s2 = uStr2.toDString();
to convert it back to a "normal" string

> something like that maybe (of course plus a lot of other stuff like proper
> comparison for different encodings of the same char like a modified icmp()
> discussed before).
> But something like
> size_t len = uniLen("sdfüñẫ"); // 6
> string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length);
> etc may be just as good.
>
> (I hope this all made sense)
>
>>
>> Andrei
>
> Cheers,
> - Daniel