Why the hell doesn't foreach decode strings

Mon Oct 24 12:41:57 PDT 2011

On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
<simen.kjaras at gmail.com> wrote:

> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer  
> <schveiguy at yahoo.com> wrote:
>
>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright  
>> <newshound2 at digitalmars.com> wrote:
>>
>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>> Which operations do you believe would be less efficient?
>>>
>>> All of the ones that don't require decoding, such as searching, would  
>>> be less efficient if decoding was done.
>>
>> Searching that does not do decoding is fundamentally incorrect.  That  
>> is, if you want to find a substring in a string, you cannot just  
>> compare chars.
>
> Assuming both string are valid UTF-8, you can. Continuation bytes can  
> never
> be confused with the first byte of a code point, and the first byte  
> always
> identifies how many continuation bytes there should be.
>

As others have pointed out in the past to me (and I thought as you did
once), the same characters can be encoded in *different ways*.  They must
be normalized to accurately compare.

Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point.  If that's on the last
character in the word such as fiancé, then searching for fiance will
result in a match without proper decoding!  Or if fiancé uses a
precomposed é, it won't match.  So two valid representations of the word
either match or they don't.  It's just a complete mess without proper
unicode decoding.

-Steve