Major performance problem with std.array.front()
Peter Alexander
peter.alexander.au at gmail.com
Sun Mar 9 10:34:05 PDT 2014
On Sunday, 9 March 2014 at 17:15:59 UTC, Andrei Alexandrescu
wrote:
> On 3/9/14, 4:34 AM, Peter Alexander wrote:
>> I think this is the main confusion: the belief that iterating
>> by code
>> point has utility.
>>
>> If you care about normalization then neither by code unit, by
>> code
>> point, nor by grapheme are correct (except in certain language
>> subsets).
>
> I suspect that code point iteration is the worst as it works
> only with ASCII and perchance with ASCII single-byte
> extensions. Then we have code unit iteration that works with a
> larger spectrum of languages. One question would be how large
> that spectrum it is. If it's larger than English, then that
> would be nice because we would've made progress.
>
> I don't know about normalization beyond discussions in this
> group, but as far as I understand from
> http://www.unicode.org/faq/normalization.html, normalization
> would be a one-step process, after which code point iteration
> would cover still more human languages. No? I'm pretty sure
> it's more complicated than that, so please illuminate me :o).
It depends what you mean by "cover" :-)
If we assume strings are normalized then substring search,
equality testing, sorting all work the same with either code
units or code points.
>> If you don't care about normalization then by code unit is
>> just as good
>> as by code point, but you don't need to specialise everywhere
>> in Phobos.
>>
>> AFAIK, there is only one exception, stuff like s.all!(c => c
>> == 'é'),
>> but as Vladimir correctly points out: (a) by code point, this
>> is still
>> broken in the face of normalization, and (b) are there any real
>> applications that search a string for a specific non-ASCII
>> character?
>
> What happened to counting characters and such?
I can't think of any case where you would want to count
characters.
* If you want an index to slice from, then you need code units.
* If you want a buffer size, then you need code units.
* If you are doing something like word wrapping then you need to
count glyphs, which is not the same as counting code points (and
that only works with mono-spaced fonts anyway -- with variable
width fonts you need to add up the widths of those glyphs)
>> To those that think the status quo is better, can you give an
>> example of
>> a real-life use case that demonstrates this?
>
> split(ter) comes to mind.
splitter is just an application of substring search, no?
substring search works the same with both code units and code
points (e.g. strstr in C works with UTF encoded strings without
any need to decode).
All you need to do is ensure that mismatched encodings in the
delimeter are re-encoded (you want to do this for performance
anyway)
auto splitter(string str, dchar delim)
{
char[4] enc;
return splitter(str, enc[0..encode(enc, delim)]);
}
More information about the Digitalmars-d
mailing list