Major performance problem with std.array.front()

Peter Alexander peter.alexander.au at gmail.com
Sun Mar 9 10:34:05 PDT 2014


On Sunday, 9 March 2014 at 17:15:59 UTC, Andrei Alexandrescu 
wrote:
> On 3/9/14, 4:34 AM, Peter Alexander wrote:
>> I think this is the main confusion: the belief that iterating 
>> by code
>> point has utility.
>>
>> If you care about normalization then neither by code unit, by 
>> code
>> point, nor by grapheme are correct (except in certain language 
>> subsets).
>
> I suspect that code point iteration is the worst as it works 
> only with ASCII and perchance with ASCII single-byte 
> extensions. Then we have code unit iteration that works with a 
> larger spectrum of languages. One question would be how large 
> that spectrum it is. If it's larger than English, then that 
> would be nice because we would've made progress.
>
> I don't know about normalization beyond discussions in this 
> group, but as far as I understand from 
> http://www.unicode.org/faq/normalization.html, normalization 
> would be a one-step process, after which code point iteration 
> would cover still more human languages. No? I'm pretty sure 
> it's more complicated than that, so please illuminate me :o).

It depends what you mean by "cover" :-)

If we assume strings are normalized then substring search, 
equality testing, sorting all work the same with either code 
units or code points.


>> If you don't care about normalization then by code unit is 
>> just as good
>> as by code point, but you don't need to specialise everywhere 
>> in Phobos.
>>
>> AFAIK, there is only one exception, stuff like s.all!(c => c 
>> == 'é'),
>> but as Vladimir correctly points out: (a) by code point, this 
>> is still
>> broken in the face of normalization, and (b) are there any real
>> applications that search a string for a specific non-ASCII 
>> character?
>
> What happened to counting characters and such?

I can't think of any case where you would want to count 
characters.

* If you want an index to slice from, then you need code units.
* If you want a buffer size, then you need code units.
* If you are doing something like word wrapping then you need to 
count glyphs, which is not the same as counting code points (and 
that only works with mono-spaced fonts anyway -- with variable 
width fonts you need to add up the widths of those glyphs)


>> To those that think the status quo is better, can you give an 
>> example of
>> a real-life use case that demonstrates this?
>
> split(ter) comes to mind.

splitter is just an application of substring search, no? 
substring search works the same with both code units and code 
points (e.g. strstr in C works with UTF encoded strings without 
any need to decode).

All you need to do is ensure that mismatched encodings in the 
delimeter are re-encoded (you want to do this for performance 
anyway)

auto splitter(string str, dchar delim)
{
     char[4] enc;
     return splitter(str, enc[0..encode(enc, delim)]);
}


More information about the Digitalmars-d mailing list