Major performance problem with std.array.front()

Sat Mar 8 13:45:06 PST 2014

On Saturday, 8 March 2014 at 20:52:40 UTC, H. S. Teoh wrote:
> Or more to the point, do you know of any experience that you 
> can share
> about code that attempts to process these sorts of strings on a 
> per
> character basis? My suspicion is that any code that operates on 
> such
> strings, if they have any claim to correctness at all, must be
> substring-based, rather than character-based.

That's pretty much it. Unless you are working in the confines of 
certain languages (alphabets, scripts, etc.), many notions that 
are valid for English or European languages lose meaning in 
general. This includes the notion of "characters" - at full 
abstraction, you can only treat a string as a stream of code 
units (or code points, if you wish, but as has been discussed to 
death this is rarely useful).

An application which has to handle user text (said text being 
possibly in any language), has to pretty much treat string 
variables as "holy":
- no indexing
- no slicing
- no counting anything
- no toUpper/toLower (std.ascii or std.uni)
etc.

All processing and transformations (line breaking, normalization, 
etc.) needs to be done using the relevant Unicode algorithms.

I've posted something earlier which I'd like to take back:

> [a-z] makes sense in English, and [а-я] makes sense in Russian

[а-я] makes sense for Russian, but it doesn't for Ukrainian, in 
the same way how [a-z] is useless for Portuguese. There are 
probably only a few such ranges in Unicode which encompass 
exactly one alphabet, due to how much letters overlap across 
alphabets of similar languages.