Why the hell doesn't foreach decode strings

Dmitry Olshansky dmitry.olsh at gmail.com
Mon Oct 24 15:25:25 PDT 2011


On 25.10.2011 0:52, Steven Schveighoffer wrote:
> On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
> <dmitry.olsh at gmail.com> wrote:
>
>> On 24.10.2011 23:41, Steven Schveighoffer wrote:
>>> On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
>>> <simen.kjaras at gmail.com> wrote:
>>>
>>>> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
>>>> <schveiguy at yahoo.com> wrote:
>>>>
>>>>> On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
>>>>> <newshound2 at digitalmars.com> wrote:
>>>>>
>>>>>> On 10/22/2011 2:21 AM, Peter Alexander wrote:
>>>>>>> Which operations do you believe would be less efficient?
>>>>>>
>>>>>> All of the ones that don't require decoding, such as searching,
>>>>>> would be less efficient if decoding was done.
>>>>>
>>>>> Searching that does not do decoding is fundamentally incorrect. That
>>>>> is, if you want to find a substring in a string, you cannot just
>>>>> compare chars.
>>>>
>>>> Assuming both string are valid UTF-8, you can. Continuation bytes can
>>>> never
>>>> be confused with the first byte of a code point, and the first byte
>>>> always
>>>> identifies how many continuation bytes there should be.
>>>>
>>>
>>> As others have pointed out in the past to me (and I thought as you did
>>> once), the same characters can be encoded in *different ways*. They must
>>> be normalized to accurately compare.
>>>
>>
>> Assuming language support stays on stage of "codepoint is a character"
>> it's totaly expected to ignore modifiers and compare identically
>> normalized UTF without decoding. Yes, it risks to hit certain issues.
>
> Again, the "risk" is that it fails to achieve the goal you ask of it!
>

Last time I checked, the legion of "everything is ASCII" zealots was 
pretty large ;) So the "goal" is a blury line, meaning that "search for 
a substring on in this chunk of unicode text" can mean very different 
things, and be correct in it's own sense on about 3 different levels.
There is no default way, so I'm not sure we have to embed all of this 
complexity into language right now. Phobos is were we must first 
implement this, once it works we may look at revisiting our pick of 
default comparison method, etc.

> D-language: Here, use this search algorithm, it works most of the time,
> but may not work correctly in some cases. If you run into one of those
> cases, you'll have to run a specialized search algorithm for strings.
> User: How do I know I hit one of those cases?
> D-language: You'll have to run the specialized version to find out.
> User: Why wouldn't I just run the specialized version first?
> D-language: Well, because it's slower!
> User: But don't I have to use both algorithms to make sure I find the data?
> D-language: Only if you "care" about accuracy!
>
> Call me ludicrous, but is this really what we want to push on someone as
> a "unicode-aware" language?
>

No, but we might want to fix a string search to do a little more - 
namely check if it skewed a graphem (assuming normalizations match). BTW 
adding normalization to std.uni is another thing to do right now. That 
should be a good enough start, and looking at unicode standard, things 
are rather fluid there meaning that "unicode-aware" language should be 
sufixed by version number :)

>>
>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>> character, but may be a separate code point. If that's on the last
>>> character in the word such as fiancé, then searching for fiance will
>>> result in a match without proper decoding!
>>
>> Now if you are going to do real characters... If source/needle are
>> normalized you still can avoid lots of work by searching without
>> decoding. All you need to decode is one codepoint on each successful
>> match to see if there is a modifier at end of matched portion.
>> But it depends on how you want to match if it's case-insensitive
>> search it will be a lot more complicated, but anyway it boils down to
>> this:
>> 1) do inexact search, get likely match ( false positives are OK,
>> negatives not) no decoding here
>> 2) once found check it (or parts of it) with proper decoding
>>
>> There are cultural subtleties, that complicate these steps if you take
>> them into account, but it's doable.
>
> I agree with you that simple searches using only byte (or dchar)
> comparison does not work, and can be optimized based on several factors.
> The easiest thing is to find a code unit sequence that only has one
> valid form, then search for that without decoding. Then when found,
> decode the characters around it. Or if that isn't possible, create all
> the un-normalized forms for one grapheme (based on how likely it is to
> occur), and search for one of those in the undecoded stream.
>

Yes, it's all revolves around find all of variations of a substring.
If user is willing to spend some time upfront, this could be done in one 
pass e.g. using the trick I employed in new std.regex.
For a rough idea see ShiftOr string search,
that is easy & fast inexact search:
http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

There are even better variations of this stuff in the wild.

> This can all be built into a specialized string type. There's actually
> some really interesting problems to solve in this space I think. I've
> created a basic string type that has lamented in my unfinished pile of
> stuff to do. I think it can be done in a way that is close to as
> efficient as arrays for the most common operations (slicing, indexing,
> etc.), but is *correct* before it is efficient. You should always be
> able to drop into "array mode" and deal with the code-units.
>

Here is where we probably differ, to me there is no simple safe fast 
string type. Why? Because if you need anything "not pedantic and safe 
above all" you work with implicit data type of e.g. "a chunk of UTF-8 
bytes that are normalized in NFC", ...
I'd rather see them as sealed array with a bunch of meta info, and 
string functions to specialize on them in order to do some cool speed 
hacks. Of course, drilling down to row data should be possible for the 
user as well.

>> Or if fiancé uses a
>>> precomposed é, it won't match. So two valid representations of the word
>>> either match or they don't. It's just a complete mess without proper
>>> unicode decoding.
>>
>> It's a complete mess even with proper decoding ;)
>
> Yes, all the more reason to solve the problem correctly so the hapless
> unicode novice user doesn't have to!
>


Hapless novice unicode user don't stand a chance.
I'm serious. One needs to know a lot of these "basics" to e.g. know why 
indexing by "true" character could be so slow, about encodings, 
normalizations etc. Yet we can save a metric ton of special cased crap 
from ever hitting his eyes.


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list