VLERange: a range in between BidirectionalRange and RandomAccessRange
Steven Schveighoffer
schveiguy at yahoo.com
Sat Jan 15 09:39:32 PST 2011
On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn
<lutger.blijdestijn at gmail.com> wrote:
> Steven Schveighoffer wrote:
>
> ...
>>> I think a good standard to evaluate our handling of Unicode is to see
>>> how easy it is to do things the right way. In the above, foreach would
>>> slice the string grapheme by grapheme, and the == operator would
>>> perform
>>> a normalized comparison. While it works correctly, it's probably not
>>> the
>>> most efficient way to do thing however.
>>
>> I think this is a good alternative, but I'd rather not impose this on
>> people like myself who deal mostly with English. I think this should be
>> possible to do with wrapper types or intermediate ranges which have
>> graphemes as elements (per my suggestion above).
>>
>> Does this sound reasonable?
>>
>> -Steve
>
> If its a matter of choosing which is the 'default' range, I'd think
> proper
> unicode handling is more reasonable than catering for english / ascii
> only.
> Especially since this is already the case in phobos string algorithms.
English and (if I understand correctly) most other languages. Any
language which can be built from composable graphemes would work. And in
fact, ones that use some graphemes that cannot be composed will also work
to some degree (for example, opEquals).
What I'm proposing (or think I'm proposing) is not exactly catering to
English and ASCII, what I'm proposing is simply not catering to more
complex languages such as Hebrew and Arabic. What I'm trying to find is a
middle ground where most languages work, and the code is simple and
efficient, with possibilities to jump down to lower levels for performance
(i.e. switch to char[] when you know ASCII is all you are using) or jump
up to full unicode when necessary.
Essentially, we would have three levels of types:
char[], wchar[], dchar[] -- Considered to be arrays in every way.
string_t!T (string, wstring, dstring) -- Specialized string types that do
normalization to dchars, but do not handle perfectly all graphemes. Works
with any algorithm that deals with bidirectional ranges. This is the
default string type, and the type for string literals. Represented
internally by a single char[], wchar[] or dchar[] array.
* utfstring_t!T -- specialized string to deal with full unicode, which may
perform worse than string_t, but supports everything unicode supports.
May require a battery of specialized algorithms.
* - name up for discussion
Also note that phobos currently does *no* normalization as far as I can
tell for things like opEquals. Two char[]'s that represent equivalent
strings, but not in the same way, will compare as !=.
-Steve
More information about the Digitalmars-d
mailing list