Why the hell doesn't foreach decode strings

Mon Oct 24 15:12:46 PDT 2011

On 10/24/2011 10:52 PM, Steven Schveighoffer wrote:
> On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
> <dmitry.olsh at gmail.com> wrote:
>
>> On 24.10.2011 23:41, Steven Schveighoffer wrote:
>>> On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
>>> <simen.kjaras at gmail.com> wrote:
>>>
>>>> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
>>>> <schveiguy at yahoo.com> wrote:
>>>>
>>>>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
>>>>> <newshound2 at digitalmars.com> wrote:
>>>>>
>>>>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>>>>> Which operations do you believe would be less efficient?
>>>>>>
>>>>>> All of the ones that don't require decoding, such as searching,
>>>>>> would be less efficient if decoding was done.
>>>>>
>>>>> Searching that does not do decoding is fundamentally incorrect. That
>>>>> is, if you want to find a substring in a string, you cannot just
>>>>> compare chars.
>>>>
>>>> Assuming both string are valid UTF-8, you can. Continuation bytes can
>>>> never
>>>> be confused with the first byte of a code point, and the first byte
>>>> always
>>>> identifies how many continuation bytes there should be.
>>>>
>>>
>>> As others have pointed out in the past to me (and I thought as you did
>>> once), the same characters can be encoded in *different ways*. They must
>>> be normalized to accurately compare.
>>>
>>
>> Assuming language support stays on stage of "codepoint is a character"
>> it's totaly expected to ignore modifiers and compare identically
>> normalized UTF without decoding. Yes, it risks to hit certain issues.
>
> Again, the "risk" is that it fails to achieve the goal you ask of it!
>
> D-language: Here, use this search algorithm, it works most of the time,
> but may not work correctly in some cases. If you run into one of those
> cases, you'll have to run a specialized search algorithm for strings.
> User: How do I know I hit one of those cases?
> D-language: You'll have to run the specialized version to find out.
> User: Why wouldn't I just run the specialized version first?
> D-language: Well, because it's slower!
> User: But don't I have to use both algorithms to make sure I find the data?
> D-language: Only if you "care" about accuracy!
>
> Call me ludicrous, but is this really what we want to push on someone as
> a "unicode-aware" language?
>
>>
>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>> character, but may be a separate code point. If that's on the last
>>> character in the word such as fiancé, then searching for fiance will
>>> result in a match without proper decoding!
>>
>> Now if you are going to do real characters... If source/needle are
>> normalized you still can avoid lots of work by searching without
>> decoding. All you need to decode is one codepoint on each successful
>> match to see if there is a modifier at end of matched portion.
>> But it depends on how you want to match if it's case-insensitive
>> search it will be a lot more complicated, but anyway it boils down to
>> this:
>> 1) do inexact search, get likely match ( false positives are OK,
>> negatives not) no decoding here
>> 2) once found check it (or parts of it) with proper decoding
>>
>> There are cultural subtleties, that complicate these steps if you take
>> them into account, but it's doable.
>
> I agree with you that simple searches using only byte (or dchar)
> comparison does not work, and can be optimized based on several factors.
> The easiest thing is to find a code unit sequence that only has one
> valid form, then search for that without decoding. Then when found,
> decode the characters around it. Or if that isn't possible, create all
> the un-normalized forms for one grapheme (based on how likely it is to
> occur), and search for one of those in the undecoded stream.
>
> This can all be built into a specialized string type. There's actually
> some really interesting problems to solve in this space I think. I've
> created a basic string type that has lamented in my unfinished pile of
> stuff to do. I think it can be done in a way that is close to as
> efficient as arrays for the most common operations (slicing, indexing,
> etc.), but is *correct* before it is efficient. You should always be
> able to drop into "array mode" and deal with the code-units.
>
>> Or if fiancé uses a
>>> precomposed é, it won't match. So two valid representations of the word
>>> either match or they don't. It's just a complete mess without proper
>>> unicode decoding.
>>
>> It's a complete mess even with proper decoding ;)
>
> Yes, all the more reason to solve the problem correctly so the hapless
> unicode novice user doesn't have to!
>
> -Steve

The answer is probably to get proper unicode normalization into Phobos. 
The efficient versions can then specify that they are guaranteed to work 
perfectly accurate on normalized input only.