[review] new string type
Steven Schveighoffer
schveiguy at yahoo.com
Wed Dec 1 13:44:42 PST 2010
On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
<public at kyllingen.nospamnet> wrote:
> On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
>
>> On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
>> <jmdavisProg at gmx.com> wrote:
>>
>> [...]
>>
>>> 4. Indexing is no longer O(1), which violates the guarantees of the
>>> index operator.
>>
>> Indexing is still O(1).
>>
>>> 5. Slicing (other than a full slice) is no longer O(1), which violates
>>> the
>>> guarantees of the slicing operator.
>>
>> Slicing is still O(1).
>>
>> [...]
>
> It feels extremely weird that the indices refer to code units and not
> code points. If I write
>
> auto str = mystring("hæ?");
> writeln(str[1], " ", str[2]);
>
> I expect it to print "æ ?", not "æ æ" like it does now.
I don't think it's possible to do that with any implementation without
making indexing not O(1). This just isn't possible, unless you want to
use dchar[].
But your point is well taken. I think what I'm going to do is throw an
exception when accessing an invalid index. While also surprising, it
doesn't result in "extra data". I feel it's probably very rare to just
access hard-coded indexes like that unless you are sure of the data in the
string. Or to use a for-loop to access characters, etc.
> On a side note: It seems to me that the only reason to have char, wchar,
> and dchar as separate types in the language is that arrays of said types
> are UTF-encoded strings. If a type such as the proposed one were to
> become the default string type in D, it might as well wrap an array of
> ubyte/ushort/uint, since direct user manipulation of the underlying array
> will generally only happen in the rare cases when one wants to deal
> directly with code units.
I'd still want a char[] array for easy manipulation and eventual
printing. Wrapping a ubyte[] with a string just to print would be
strange. My goal in this exercise is to try and give control of what to
deal with (code-points vs. code-units) back to the user. Right now, the
library forces you to view them as code-points, and the compiler forces
you to view them as code-units (except via foreach).
But it is an interesting idea (removing small char types).
-Steve
More information about the Digitalmars-d
mailing list