Why the hell doesn't foreach decode strings
Steven Schveighoffer
schveiguy at yahoo.com
Mon Oct 24 12:41:57 PDT 2011
On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
<simen.kjaras at gmail.com> wrote:
> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
> <schveiguy at yahoo.com> wrote:
>
>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
>> <newshound2 at digitalmars.com> wrote:
>>
>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>> Which operations do you believe would be less efficient?
>>>
>>> All of the ones that don't require decoding, such as searching, would
>>> be less efficient if decoding was done.
>>
>> Searching that does not do decoding is fundamentally incorrect. That
>> is, if you want to find a substring in a string, you cannot just
>> compare chars.
>
> Assuming both string are valid UTF-8, you can. Continuation bytes can
> never
> be confused with the first byte of a code point, and the first byte
> always
> identifies how many continuation bytes there should be.
>
As others have pointed out in the past to me (and I thought as you did
once), the same characters can be encoded in *different ways*. They must
be normalized to accurately compare.
Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point. If that's on the last
character in the word such as fiancé, then searching for fiance will
result in a match without proper decoding! Or if fiancé uses a
precomposed é, it won't match. So two valid representations of the word
either match or they don't. It's just a complete mess without proper
unicode decoding.
-Steve
More information about the Digitalmars-d
mailing list