Major performance problem with std.array.front()

Vladimir Panteleev vladimir at thecybershadow.net
Sat Mar 8 12:38:40 PST 2014


On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu 
wrote:
> Searching for characters in strings would be difficult to deem 
> inappropriate.

The notion of "character" exists only in certain writing systems. 
It is thus a flawed practice, and I think it should not be 
encouraged, as it will only make writing truly-international 
software more difficult. A more correct approach is searching for 
a certain substring. If non-exact matching is needed 
(normalization, case insensitivity etc.), then the appropriate 
solution is to use the Unicode algorithms.

If you look at the situation from this point of view, single code 
points become merely an implementation detail.

> 1. All algorithms would by default operate on strings at 
> char/wchar level (i.e. code unit). That would cause the usual 
> issues and confusions I was aware of from C++. Certain 
> algorithms would require specialization and/or the user using 
> byDchar for correctness.

As previously discussed, "correctness" here is conditional. I 
would not use that word, it is another extreme.

> From experience with C++ I knew (1) had a bad track record, and 
> (2) "generically conservative, specialize for speed" was a 
> successful pattern.
>
> What would you have chosen given that context?

Ideally, we would have the Unicode algorithms in the standard 
library from day 1, and advocated their use throughout the 
documentation.

>> I'm inclined to say that the correct approach is to
>> state that algorithms operate explicitly on a T.sizeof basis 
>> and that if
>> the data contained in a particular range has some 
>> multi-element encoding
>> then separate, specialized routines should be used with the 
>> T.sizeof
>> behavior will not produce the desired result.
>
> That sounds quite like C++ plus ICU. It doesn't strike me as 
> the golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its 
amazing slicing and range capabilities, of course.

>> So the problem to me is that we're stuck not fixing something 
>> that's
>> horribly broken just because it's broken in a way that people 
>> presumably
>> now expect.
>
> Clearly I'm being subjective here but again I'd find it 
> difficult to get convinced we have something horribly broken 
> from the evidence I gathered inside and outside Facebook.

Have you or anyone you personally know tried to process text in D 
containing a writing system such as Sanskrit's?

>> I'd personally like to see this fixed and I think the new 
>> behavior is
>> preferable overall, but I do share Andrei's concern that such 
>> a big
>> change might hurt the language anyway.
>
> I've said this once and I'm saying it again: the best way to 
> convert this discussion into something useful is to devise 
> ideas for useful non-breaking additions.

I disagree. As I've argued, I believe that currently most uses of 
dchars in an application are incorrect, and ultimately a time 
bomb for proper internationalization support. We need to apply 
the same procedure that we do with any language construct that 
was deemed to have been a poor decision: put it through a 
deprecation cycle and fix it.


More information about the Digitalmars-d mailing list