Why the hell doesn't foreach decode strings

Dmitry Olshansky dmitry.olsh at gmail.com
Wed Oct 26 14:51:32 PDT 2011


On 26.10.2011 16:12, Steven Schveighoffer wrote:
<snip>
>
> I agree the string type needs to be fleshed out before the language
> adopts it. I think we could legitimately define a string type that
> auto-assigns from it's respective char[] array type. Once it's shown
> that the type is nearly as fast as a char[] array, it can be used as the
> default type for string literals (the only real place where the language
> gets in the way).
>

That's it.

>>> D-language: Here, use this search algorithm, it works most of the time,
>>> but may not work correctly in some cases. If you run into one of those
>>> cases, you'll have to run a specialized search algorithm for strings.
>>> User: How do I know I hit one of those cases?
>>> D-language: You'll have to run the specialized version to find out.
>>> User: Why wouldn't I just run the specialized version first?
>>> D-language: Well, because it's slower!
>>> User: But don't I have to use both algorithms to make sure I find the
>>> data?
>>> D-language: Only if you "care" about accuracy!
>>>
>>> Call me ludicrous, but is this really what we want to push on someone as
>>> a "unicode-aware" language?
>>>
>>
>> No, but we might want to fix a string search to do a little more -
>> namely check if it skewed a graphem (assuming normalizations match).
>
> That's a big assumption. Valid unicode is valid unicode, even if it's
> not normalized.
>
>> BTW adding normalization to std.uni is another thing to do right now.
>> That should be a good enough start, and looking at unicode standard,
>> things are rather fluid there meaning that "unicode-aware" language
>> should be sufixed by version number :)
>
> I agree we need normalization and it is not necessary to involve the
> compiler in this. I'd suggest a two to three phase approach:
>
> 1. Leave phobos' schizo view of "char arrays are not arrays" for now,
> and build the necessary framework to get a string type that actually works.
> 2. Remove schizo view.
> 3. (optional) make compiler use library-defined string type for string
> literals.
>

Something like that, I may land a hand here, as I plan to merge some 
unicode stuff between std.regex and std.uni. As extras probably striding 
by grapheme and support for various degree of case-insensitive string 
comparison.

>>
>>>>
>>>>> Plus, a combining character (such as an umlaut or accent) is part of a
>>>>> character, but may be a separate code point. If that's on the last
>>>>> character in the word such as fiancé, then searching for fiance will
>>>>> result in a match without proper decoding!
>>>>
>>>> Now if you are going to do real characters... If source/needle are
>>>> normalized you still can avoid lots of work by searching without
>>>> decoding. All you need to decode is one codepoint on each successful
>>>> match to see if there is a modifier at end of matched portion.
>>>> But it depends on how you want to match if it's case-insensitive
>>>> search it will be a lot more complicated, but anyway it boils down to
>>>> this:
>>>> 1) do inexact search, get likely match ( false positives are OK,
>>>> negatives not) no decoding here
>>>> 2) once found check it (or parts of it) with proper decoding
>>>>
>>>> There are cultural subtleties, that complicate these steps if you take
>>>> them into account, but it's doable.
>>>
>>> I agree with you that simple searches using only byte (or dchar)
>>> comparison does not work, and can be optimized based on several factors.
>>> The easiest thing is to find a code unit sequence that only has one
>>> valid form, then search for that without decoding. Then when found,
>>> decode the characters around it. Or if that isn't possible, create all
>>> the un-normalized forms for one grapheme (based on how likely it is to
>>> occur), and search for one of those in the undecoded stream.
>>>
>>
>> Yes, it's all revolves around find all of variations of a substring.
>> If user is willing to spend some time upfront, this could be done in
>> one pass e.g. using the trick I employed in new std.regex.
>> For a rough idea see ShiftOr string search,
>> that is easy & fast inexact search:
>> http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060
>>
>> There are even better variations of this stuff in the wild.
>
> This sounds good. I'll have a look at it when I have time.
>
>>
>>> This can all be built into a specialized string type. There's actually
>>> some really interesting problems to solve in this space I think. I've
>>> created a basic string type that has lamented in my unfinished pile of
>>> stuff to do. I think it can be done in a way that is close to as
>>> efficient as arrays for the most common operations (slicing, indexing,
>>> etc.), but is *correct* before it is efficient. You should always be
>>> able to drop into "array mode" and deal with the code-units.
>>>
>>
>> Here is where we probably differ, to me there is no simple safe fast
>> string type. Why? Because if you need anything "not pedantic and safe
>> above all" you work with implicit data type of e.g. "a chunk of UTF-8
>> bytes that are normalized in NFC", ...
>
> Specialized types which dictate normalization could be created. We do
> not have to always assume the worst if we are optimizing.
>
> But in general, when you read a utf-8 file, it could have anything in
> it. It may not even be normalized.
>

There should be a way of e.g.
auto my_string = normalize(readText("file.txt"), Normalize.NFKC);
//my_string has typed normalized UTF-8 or somesuch
Yes, that should keep general non-assuming case. Though it still can do 
some clever things, since as W3C points out >90% of text on web is 
already normalized in NFC (I can search for a proof link, if hard pressed).

>> I'd rather see them as sealed array with a bunch of meta info, and
>> string functions to specialize on them in order to do some cool speed
>> hacks. Of course, drilling down to row data should be possible for the
>> user as well.
>
> This is exactly what I am envisioning ;) Except I'd build the meta-info
> into the type.
>

Yeah, I meant that the type caries this info.
Looks like we are more in agreement then I suspected ;)

>>>> Or if fiancé uses a
>>>>> precomposed é, it won't match. So two valid representations of the
>>>>> word
>>>>> either match or they don't. It's just a complete mess without proper
>>>>> unicode decoding.
>>>>
>>>> It's a complete mess even with proper decoding ;)
>>>
>>> Yes, all the more reason to solve the problem correctly so the hapless
>>> unicode novice user doesn't have to!
>>>
>>
>>
>> Hapless novice unicode user don't stand a chance.
>> I'm serious. One needs to know a lot of these "basics" to e.g. know
>> why indexing by "true" character could be so slow, about encodings,
>> normalizations etc. Yet we can save a metric ton of special cased crap
>> from ever hitting his eyes.
>
> I disagree. 99% of the time I use strings, I don't care about indexes. I
> just want to deal with whole strings and substrings. I rarely use
> arbitrary indexes, and when I do, I'm assuming ascii data.
>
> The hapless novice unicode user doesn't need to know unicode to do:
>
> writefln("hello, world!");
>
> Or to do:
>
> if(find(str, "xyz")) {...}
>
> Or to even use regex!
>

To print chars you don't need to know unicode,  you OS needs to :) 
Though that wasn't string processing. However find/match is indeed a 
good example. You are probably right, if we cover enough of common 
operations most people might not have to "look inside".


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list