Major performance problem with std.array.front()
Peter Alexander
peter.alexander.au at gmail.com
Sun Mar 9 07:57:30 PDT 2014
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
> IMO, the "normalization" argument is overrated. I've yet to
> encounter a real-world case of normalization: only hand written
> counter-examples. Not saying it doesn't exist, just that:
> 1. It occurs only in special cases that the program should be
> aware of before hand.
> 2. Arguably, be taken care of eagerly, or in a special pass.
>
> As for "the belief that iterating by code point has utility." I
> have to strongly disagree. Unicode is composed of codepoints,
> and that is what we handle. The fact that it can be be encoded
> and stored as UTF is implementation detail.
We don't "handle" code points (when have you ever wanted to
handle a combining character separate to the character it
combines with?)
You are just thinking of a subset of languages and locales.
Normalization is an issue any time you have a user enter text
into your program and you then want to search for that text. I
hope we can agree this isn't a rare occurrence.
>> AFAIK, there is only one exception, stuff like s.all!(c => c
>> == 'é'), but as Vladimir correctly points out: (a) by code
>> point, this is still broken in the face of normalization, and
>> (b) are there any real applications that search a string for a
>> specific non-ASCII character?
>
> But *what* other kinds of algorithms are there? AFAIK, the
> *only* type of algorithm that doesn't need decoding is
> searching, and you know what? std.algorithm.find does it
> perfectly well. This trickles into most other algorithms too:
> split, splitter or findAmong don't decode if they don't have
> too.
Searching, equality testing, copying, sorting, hashing,
splitting, joining...
I can't think of a single use-case for searching for a non-ASCII
code point. You can search for strings, but searching by code
unit is just as good (and fast by default).
> AFAIK, the most common algorithm "case insensitive search"
> *must* decode.
But it must also normalize and take locales into account, so by
code point is insufficient (unless you are willing to ignore
languages like Turkish). See Turkish I.
http://en.wikipedia.org/wiki/Turkish_I
Sure, if you just want to ignore normalization and several
languages then by code point is just fine... but that's the
point: by code point is incorrect in general.
> There may still be cases where it is still not working as
> intended in the face of normalization, but it is still leaps
> and bounds better than what we get iterating with codeunits.
>
> To turn it the other way around, *what* are you guys doing,
> that doesn't require decoding, and where performance is such a
> killer?
Searching, equality testing, copying, sorting, hashing,
splitting, joining...
The performance thing can be fixed in the library, but my concern
is (a) it takes a significant amount of code to do so (b)
complicates implementations. There are many, many algorithms in
Phobos that are special cased for strings, and I don't think it
needs to be that way.
>> To those that think the status quo is better, can you give an
>> example of a real-life use case that demonstrates this?
>
> I do not know of a single bug report in regards to buggy phobos
> code that used front/popFront. Not_a_single_one (AFAIK).
>
> On the other hand, there are plenty of cases of bugs for
> attempting to not decode strings, or incorrectly decoding
> strings. They are being corrected on a continuous basis.
Can you provide a link to a bug?
Also, you haven't answered the question :-) Can you give a
real-life example of a case where code point decoding was
necessary where code units wouldn't have sufficed?
You have mentioned case-insensitive searching, but I think I've
adequately demonstrated that this doesn't work in general by code
point: you need to normalize and take locales into account.
More information about the Digitalmars-d
mailing list