Higher level built-in strings

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Mon Jul 19 21:50:30 PDT 2010


On 07/19/2010 11:29 PM, Walter Bright wrote:
> bearophile wrote:
>> Walter Bright:
>>> 1. most string operations, such as copying and searching, even
>>> regular expressions, work just fine using regular indices.
>>>
>>> 2. doing the operations in (1) using code points and having to
>>> continually
>>> decode the strings would result in disastrously slow code.
>>
>> In my original post I have forgotten another difference over arrays:
>> 5b) a
>> method like ".unit()" that allows to index code units. So
>> "foo".unit(1) is
>> always O(1). Lower level code can use this method as [] is used for
>> arrays.
>
> This is backwards. The [i] should behave as expected for arrays. As it
> turns out, indexing by byte is *far* more common than indexing by code
> unit, in fact, I've never ever needed to index by code unit.
>
> (Though it is sometimes necessary to step through by code unit, that's
> different from indexing by code unit.)

Exactly. And that's what the bidirectional range interface is doing for 
strings.

>>> 3. the user can always layer a code point interface over the strings,
>>> but
>>> going the other way is not so practical.
>>
>> This is true. But it makes the string usage unnecessarily low-level and
>> hard...
>
> I don't believe that manipulating strings in D is hard, even if you do
> have to work with multibyte characters. You do have to be aware they are
> multibyte, but I think that just comes with being a programmer.
>
>
> A better design in a smart system language as D is to give strings a
>> default high level "interface" that sees strings as what they are at high
>> level, and add a second lower level interface when you need faster
>> lower-level fiddling (so they have [] that returns code points and unit()
>> that returns code units).
>
> I have some moderate experience with using utf. First there's the D
> javascript engine, which is fully utf'd. The D string design fits in
> with it perfectly. Then there are chunks of C++ ascii-only code I've
> translated to D, and it then worked with utf-8 without further
> modification.
>
> Based on that, I believe the D string design hits the sweet spot between
> efficiency and utility.

I agree. In fact there is no language I know that can compete with D at 
UTF string handling.


Andrei


More information about the Digitalmars-d mailing list