Major performance problem with std.array.front()

Fri Mar 7 15:50:51 PST 2014

On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:
> On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev 
> wrote:
>> On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
>> >On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
>> >wrote:
>> >>No, it doesn't.
>> >>
>> >>import std.algorithm;
>> >>
>> >>void main()
>> >>{
>> >>   auto s = "cassé";
>> >>   assert(s.canFind('é'));
>> >>}
>> >>
>> >
>> >Hm, I'm not following? Works perfectly fine on my system?
>
> Probably because your browser is normalizing the unicode string 
> when you
> copy-n-paste Vladimir's message? See below:
>
>
>> Something's messing with your Unicode. Try downloading and 
>> compiling
>> this file:
>> http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
>
> I downloaded the file and looked at it through `od -ctx1`: the 
> first é
> is encoded as the byte sequence 65 cc 81, that is, [U+65, 
> U+301] (small
> letter e + combining diacritic acute accent), whereas the 
> second é is
> encoded as c3 a9, that is, U+E9 (precomposed small letter e 
> with acute
> accent).
>
> This illustrates one of my objections to Andrei's post: by 
> auto-decoding
> behind the user's back and hiding the intricacies of unicode 
> from him,
> it has masked the fact that codepoint-for-codepoint comparison 
> of a
> unicode string is not guaranteed to always return the correct 
> results,
> due to the possibility of non-normalized strings.
>
> Basically, to have correct behaviour in all cases, the user 
> must be
> aware of, and use, the Unicode collation / normalization 
> algorithms
> prescribed by the Unicode standard. What we have in 
> std.algorithm right
> now is an incomplete implementation with non-working edge cases 
> (like
> Vladimir's example) that has poor performance to start with. 
> Its only
> redeeming factor is that the auto-decoding hack has given it the
> illusion of being correct, when actually it's not correct 
> according to
> the Unicode standard. I don't see how this is necessarily 
> superior to
> Walter's proposal.
>
>
> T

To me, the status quo feels like an ok compromise between 
performance and correctness. Everyone is pointing out that 
working at the code point level is bad because it's not correct 
but working at the code unit level as Walter proposes is correct 
even less often so that's not really an argument for moving to 
that. It is, however, an argument for forcing the user to decide 
what level of correctness and performance they need.

Walter's idea (code unit level) would be fastest but least 
correct.
The current is somewhat fast and is somewhat correct.
The next level, graphemes, would be slowest of all but most 
correct.

It seems like there is just no way to avoid the tradeoff between 
speed and correctness so we shouldn't try, only try to force the 
user to make a decision.

Maybe some more string types are in order (hrm). In order of 
performance to correctness:

  string, wstring (code units)
  dstring         (code points)
+gstring         (graphemes)

(do grapheme's completely normalize? If not probably need another 
level, say, nstring)

Then if a user needs correctness over performance they just work 
with gstrings. If they need performance over correctness they 
work with strings (assuming some of Walter's idea happens, 
otherwise they'd work with string.representation).