Why the hell doesn't foreach decode strings

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Sat Oct 29 07:42:54 PDT 2011


On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
> On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
> <simen.kjaras at gmail.com> wrote:
>
>> On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
>> <schveiguy at yahoo.com> wrote:
>>
>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>> character, but may be a separate code point.
>>
>> If this is correct (and it is), then decoding to dchar is simply not
>> enough.
>> You seem to advocate decoding to graphemes, which is a whole different
>> matter.
>
> I am advocating that. And it's a matter of perception. D can say "we
> only support code-point decoding" and what that means to a user is, "we
> don't support language as you know it." Sure it's a part of unicode, but
> it takes that extra piece to make it actually usable to people who
> require unicode.
>
> Even in English, fiancé has an accent. To say D supports unicode, but
> then won't do a simple search on a file which contains a certain *valid*
> encoding of that word is disingenuous to say the least.

Why doesn't that simple search work?

foreach (line; stdin.byLine()) {
     if (line.canFind("fiancé")) {
        writeln("There it is.");
     }
}

> D needs a fully unicode-aware string type. I advocate D should use it as
> the default string type, but it needs one whether it's the default or
> not in order to say it supports unicode.

How do you define "supports Unicode"? For my money, the main sin of 
(w)string is that it offers [] and .length with potentially confusing 
semantics, so if I could I'd curb, not expand, its interface.


Andrei


More information about the Digitalmars-d mailing list