Major performance problem with std.array.front()

Marc Schütz" <schuetzm at gmx.net> Marc Schütz" <schuetzm at gmx.net>
Sun Mar 9 07:12:28 PDT 2014


On Friday, 7 March 2014 at 23:13:50 UTC, H. S. Teoh wrote:
> On Fri, Mar 07, 2014 at 10:35:46PM +0000, Sarath Kodali wrote:
>> On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
>> wrote:
>> >On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
>> >wrote:
> [...]
>> >>Clearly one might argue that their app has no business 
>> >>dealing
>> >>with diacriticals or Asian characters. But that's the typical
>> >>provincial view that marred many languages' approach to UTF 
>> >>and
>> >>internationalization.
>> >
>> >So is yours, if you think that making everything magically a 
>> >dchar
>> >is going to solve all problems.
>> >
>> >The TDPL example only showcases the problem. Yes, it works 
>> >with
>> >Swedish. Now try it again with Sanskrit.
>> 
>> +1
>> In Indian languages, a character consists of one or more 
>> UNICODE
>> code points. For example, in Sanskrit "ddhrya"
>> http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
>> consists of 7 UNICODE code points. So to search for this char 
>> I have
>> to use string search.
> [...]
>
> That's what I've been arguing for. The most general form of 
> character
> searching in Unicode requires substring searching, and 
> similarly many
> character-based operations on Unicode strings are effectively
> substring-based operations, because said "character" may be a 
> multibyte
> code point, or, in your case, multiple code points. Since 
> that's the
> case, we might as well just forget about the distinction between
> "character" and "string", and treat all such operations as 
> substring
> operations (even if the operand is supposedly "just 1 character 
> long").
>
> This would allow us to get rid of the hackish auto-decoding of 
> narrow
> strings, and thus eliminate the needless overhead of always 
> decoding.

That won't work, because your needle might be in a different 
normalization form than your haystack, thus a byte-by-byte 
comparison will not be able to find it.


More information about the Digitalmars-d mailing list