Why the hell doesn't foreach decode strings

Mon Oct 31 06:20:52 PDT 2011

On Sat, 29 Oct 2011 10:42:54 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail at erdani.org> wrote:

> On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
>> On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
>> <simen.kjaras at gmail.com> wrote:
>>
>>> On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
>>> <schveiguy at yahoo.com> wrote:
>>>
>>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>>> character, but may be a separate code point.
>>>
>>> If this is correct (and it is), then decoding to dchar is simply not
>>> enough.
>>> You seem to advocate decoding to graphemes, which is a whole different
>>> matter.
>>
>> I am advocating that. And it's a matter of perception. D can say "we
>> only support code-point decoding" and what that means to a user is, "we
>> don't support language as you know it." Sure it's a part of unicode, but
>> it takes that extra piece to make it actually usable to people who
>> require unicode.
>>
>> Even in English, fiancé has an accent. To say D supports unicode, but
>> then won't do a simple search on a file which contains a certain *valid*
>> encoding of that word is disingenuous to say the least.
>
> Why doesn't that simple search work?
>
> foreach (line; stdin.byLine()) {
>      if (line.canFind("fiancé")) {
>         writeln("There it is.");
>      }
> }

I think Jonathan answered that quite well, nothing else to add...

>
>> D needs a fully unicode-aware string type. I advocate D should use it as
>> the default string type, but it needs one whether it's the default or
>> not in order to say it supports unicode.
>
> How do you define "supports Unicode"? For my money, the main sin of  
> (w)string is that it offers [] and .length with potentially confusing  
> semantics, so if I could I'd curb, not expand, its interface.

LOL, I'm so used to programming that I'm trying to figure out what the  
meaning of sin(string) (as in sine) means :)

I think there are two problems with [] and .length.  First, that they  
imply "get nth character" and "number of characters" respectively, and  
second, that many times they *actually are* those things.

So I agree with you the proposed string type needs to curb that interface,  
while giving us a fully character/grapheme aware interface (which is  
currently lacking).

I made an early attempt at doing this, and I will eventually get around to  
finishing it.  I was in the middle of creating an algorithm to  
as-efficiently-as-possible delineate a grapheme, when I got sidetracked by  
other things :)  There are still lingering issues with the language which  
makes this a less-than-ideal replacement (arrays currently enjoy a lot of  
"extra features" that custom types do not).

-Steve