Why the hell doesn't foreach decode strings

Dmitry Olshansky dmitry.olsh at gmail.com
Mon Oct 24 13:18:57 PDT 2011


On 24.10.2011 23:41, Steven Schveighoffer wrote:
> On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
> <simen.kjaras at gmail.com> wrote:
>
>> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
>> <schveiguy at yahoo.com> wrote:
>>
>>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
>>> <newshound2 at digitalmars.com> wrote:
>>>
>>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>>> Which operations do you believe would be less efficient?
>>>>
>>>> All of the ones that don't require decoding, such as searching,
>>>> would be less efficient if decoding was done.
>>>
>>> Searching that does not do decoding is fundamentally incorrect. That
>>> is, if you want to find a substring in a string, you cannot just
>>> compare chars.
>>
>> Assuming both string are valid UTF-8, you can. Continuation bytes can
>> never
>> be confused with the first byte of a code point, and the first byte
>> always
>> identifies how many continuation bytes there should be.
>>
>
> As others have pointed out in the past to me (and I thought as you did
> once), the same characters can be encoded in *different ways*. They must
> be normalized to accurately compare.
>

Assuming language support stays on stage of "codepoint is a character" 
it's totaly expected to ignore modifiers and compare identically 
normalized UTF without decoding. Yes, it risks to hit certain issues.

> Plus, a combining character (such as an umlaut or accent) is part of a
> character, but may be a separate code point. If that's on the last
> character in the word such as fiancé, then searching for fiance will
> result in a match without proper decoding!

Now if you are going to do real characters... If source/needle are 
normalized you still can avoid lots of work by searching without 
decoding. All you need to decode is one codepoint on each successful 
match to see if there is a modifier at end of matched portion.
But it depends on how you want to match if it's case-insensitive search 
it will be a lot more complicated, but anyway it boils down to this:
1) do inexact search, get likely match ( false positives are OK, 
negatives not) no decoding here
2) once found check it (or parts of it) with proper decoding

There are cultural subtleties, that complicate these steps if you take 
them into account, but it's doable.

Or if fiancé uses a
> precomposed é, it won't match. So two valid representations of the word
> either match or they don't. It's just a complete mess without proper
> unicode decoding.

It's a complete mess even with proper decoding ;)


>
> -Steve


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list