Major performance problem with std.array.front()
Vladimir Panteleev
vladimir at thecybershadow.net
Sat Mar 8 13:45:06 PST 2014
On Saturday, 8 March 2014 at 20:52:40 UTC, H. S. Teoh wrote:
> Or more to the point, do you know of any experience that you
> can share
> about code that attempts to process these sorts of strings on a
> per
> character basis? My suspicion is that any code that operates on
> such
> strings, if they have any claim to correctness at all, must be
> substring-based, rather than character-based.
That's pretty much it. Unless you are working in the confines of
certain languages (alphabets, scripts, etc.), many notions that
are valid for English or European languages lose meaning in
general. This includes the notion of "characters" - at full
abstraction, you can only treat a string as a stream of code
units (or code points, if you wish, but as has been discussed to
death this is rarely useful).
An application which has to handle user text (said text being
possibly in any language), has to pretty much treat string
variables as "holy":
- no indexing
- no slicing
- no counting anything
- no toUpper/toLower (std.ascii or std.uni)
etc.
All processing and transformations (line breaking, normalization,
etc.) needs to be done using the relevant Unicode algorithms.
I've posted something earlier which I'd like to take back:
> [a-z] makes sense in English, and [а-я] makes sense in Russian
[а-я] makes sense for Russian, but it doesn't for Ukrainian, in
the same way how [a-z] is useless for Portuguese. There are
probably only a few such ranges in Unicode which encompass
exactly one alphabet, due to how much letters overlap across
alphabets of similar languages.
More information about the Digitalmars-d
mailing list