Major performance problem with std.array.front()

Dmitry Olshansky dmitry.olsh at gmail.com
Sun Mar 9 11:12:33 PDT 2014


09-Mar-2014 21:16, Andrei Alexandrescu пишет:
> On 3/9/14, 4:34 AM, Peter Alexander wrote:
>> I think this is the main confusion: the belief that iterating by code
>> point has utility.
>>
>> If you care about normalization then neither by code unit, by code
>> point, nor by grapheme are correct (except in certain language subsets).
>
> I suspect that code point iteration is the worst as it works only with
> ASCII and perchance with ASCII single-byte extensions. Then we have code
> unit iteration that works with a larger spectrum of languages.

Was clearly meant to be: code point <--> code unit

> One
> question would be how large that spectrum it is. If it's larger than
> English, then that would be nice because we would've made progress.
>

Code points help only in so far that many (~all) high-level algorithms 
in Unicode are described in terms of code points. Code points have 
properties, code unit do not have anything. Code points with assigned 
semantic value are "abstract characters".

It's up to programmer to implement a particular algorithm to make it "as 
if" decoding really happened, working directly on code units or do 
decoding and work with code points which is simpler.

Current std.uni offering mostly work on code points and decodes, crucial 
building block to work directly on code units is in review:

https://github.com/D-Programming-Language/phobos/pull/1685

> I don't know about normalization beyond discussions in this group, but
> as far as I understand from
> http://www.unicode.org/faq/normalization.html, normalization would be a
> one-step process, after which code point iteration would cover still
> more human languages. No? I'm pretty sure it's more complicated than
> that, so please illuminate me :o).

Technically most apps just assume say "input comes in UTF-8 that is in 
normalization form C". Other such as browsers strive to get uniform 
representation on any input, do normalization of any input (often times 
normalization turns out to be just a no-op).


>> If you don't care about normalization then by code unit is just as good
>> as by code point, but you don't need to specialise everywhere in Phobos.
>>
>> AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
>> but as Vladimir correctly points out: (a) by code point, this is still
>> broken in the face of normalization, and (b) are there any real
>> applications that search a string for a specific non-ASCII character?
>
> What happened to counting characters and such?

Counting chars is dubious. But, for instance, collation is defined in 
terms of code points. Regex pattern matching is _defined_ in terms of 
codepoints (even the mystical level 3 Unicode support of it). So there 
is certain merit to work at that level. But hacking it to be this way 
isn't the way to go.

The least intrusive change would be to generalize the current choice 
w.r.t. to RA ranges of char/wchar.

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list