String implementations

Sat Jan 19 18:54:51 PST 2008

On Thu, 17 Jan 2008 13:40:12 -0800, Walter Bright wrote:

> Because I've worked with internationalized code in C/C++ where the
> encoding isn't specified, and it's very bad.

I was more referring to the required switching to and from different utf 
types just to change a few characters around, I wasn't really referring 
to letting programmer decide what kind of string encoding to use.

> It is impractical (i.e. very inefficient) to index arrays otherwise,
> especially in getting array lengths, doing slicing, etc. In fact, it is
> rather rare to index by code units. The times you might want to do it
> are easily handled by foreach(dchar c, string).

Well yes I'm sure there's a performance hit for changing how it is 
indexed, but at the same time who would honestly prefer to index by the 
code point without first getting the unit points? You're practically 
stabbing in the dark if you try to slice a char[] array without first 
iterating over it with foreach to find its points.

> It does, see foreach. In general, I don't think it's a good idea for the
> language to try to completely hide the multibyte nature of UTF. For
> example, when you're allocating and copying strings around, you need the
> byte length, not the number of code points.

string str = "etc";
int strlen = str.length;
int arrsize = str.sizeof;

Seems pretty simple to me. And you don't have to completely hide the 
multibyte nature. Casting to a byte[] would allow full access to each 
point, which might sound hackish but at the same time manipulating 
individual code points in a string sounds like you're more than likely 
doing something just as hackish.

> I was surprised to discover that most indexing work in strings, such as
> searching, work more efficiently by *not* trying to index by code
> points. There are standard library functions in std.utf to index by code
> points, if you do need it.

Efficiency at the cost of the programmer. :(
Perhaps you could design methods to access a string by either unit or 
index if you see the need to keep index-by-byte behaviour. Something like 
a toggle method would suit me just fine.
str.indexByByte(true);

> I believe D already has found the right approach to handling UTF
> strings. All I can say is try it out for a while.

I am, and it's making working with user-editable config files an 
annoyance that perl avoids very easily.