Major performance problem with std.array.front()

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Sun Mar 9 10:16:00 PDT 2014


On 3/9/14, 4:34 AM, Peter Alexander wrote:
> I think this is the main confusion: the belief that iterating by code
> point has utility.
>
> If you care about normalization then neither by code unit, by code
> point, nor by grapheme are correct (except in certain language subsets).

I suspect that code point iteration is the worst as it works only with 
ASCII and perchance with ASCII single-byte extensions. Then we have code 
unit iteration that works with a larger spectrum of languages. One 
question would be how large that spectrum it is. If it's larger than 
English, then that would be nice because we would've made progress.

I don't know about normalization beyond discussions in this group, but 
as far as I understand from 
http://www.unicode.org/faq/normalization.html, normalization would be a 
one-step process, after which code point iteration would cover still 
more human languages. No? I'm pretty sure it's more complicated than 
that, so please illuminate me :o).

> If you don't care about normalization then by code unit is just as good
> as by code point, but you don't need to specialise everywhere in Phobos.
>
> AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
> but as Vladimir correctly points out: (a) by code point, this is still
> broken in the face of normalization, and (b) are there any real
> applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?

> To those that think the status quo is better, can you give an example of
> a real-life use case that demonstrates this?

split(ter) comes to mind.

> I do think it's probably too late to change this, but I think there is
> value in at least getting everyone on the same page.

Awesome.


Andrei



More information about the Digitalmars-d mailing list