Why the hell doesn't foreach decode strings
Steven Schveighoffer
schveiguy at yahoo.com
Mon Oct 31 06:20:52 PDT 2011
On Sat, 29 Oct 2011 10:42:54 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> wrote:
> On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
>> On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
>> <simen.kjaras at gmail.com> wrote:
>>
>>> On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
>>> <schveiguy at yahoo.com> wrote:
>>>
>>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>>> character, but may be a separate code point.
>>>
>>> If this is correct (and it is), then decoding to dchar is simply not
>>> enough.
>>> You seem to advocate decoding to graphemes, which is a whole different
>>> matter.
>>
>> I am advocating that. And it's a matter of perception. D can say "we
>> only support code-point decoding" and what that means to a user is, "we
>> don't support language as you know it." Sure it's a part of unicode, but
>> it takes that extra piece to make it actually usable to people who
>> require unicode.
>>
>> Even in English, fiancé has an accent. To say D supports unicode, but
>> then won't do a simple search on a file which contains a certain *valid*
>> encoding of that word is disingenuous to say the least.
>
> Why doesn't that simple search work?
>
> foreach (line; stdin.byLine()) {
> if (line.canFind("fiancé")) {
> writeln("There it is.");
> }
> }
I think Jonathan answered that quite well, nothing else to add...
>
>> D needs a fully unicode-aware string type. I advocate D should use it as
>> the default string type, but it needs one whether it's the default or
>> not in order to say it supports unicode.
>
> How do you define "supports Unicode"? For my money, the main sin of
> (w)string is that it offers [] and .length with potentially confusing
> semantics, so if I could I'd curb, not expand, its interface.
LOL, I'm so used to programming that I'm trying to figure out what the
meaning of sin(string) (as in sine) means :)
I think there are two problems with [] and .length. First, that they
imply "get nth character" and "number of characters" respectively, and
second, that many times they *actually are* those things.
So I agree with you the proposed string type needs to curb that interface,
while giving us a fully character/grapheme aware interface (which is
currently lacking).
I made an early attempt at doing this, and I will eventually get around to
finishing it. I was in the middle of creating an algorithm to
as-efficiently-as-possible delineate a grapheme, when I got sidetracked by
other things :) There are still lingering issues with the language which
makes this a less-than-ideal replacement (arrays currently enjoy a lot of
"extra features" that custom types do not).
-Steve
More information about the Digitalmars-d
mailing list