String implementations
Jarrod
qwerty at ytre.wq
Sat Jan 19 18:54:51 PST 2008
On Thu, 17 Jan 2008 13:40:12 -0800, Walter Bright wrote:
> Because I've worked with internationalized code in C/C++ where the
> encoding isn't specified, and it's very bad.
I was more referring to the required switching to and from different utf
types just to change a few characters around, I wasn't really referring
to letting programmer decide what kind of string encoding to use.
> It is impractical (i.e. very inefficient) to index arrays otherwise,
> especially in getting array lengths, doing slicing, etc. In fact, it is
> rather rare to index by code units. The times you might want to do it
> are easily handled by foreach(dchar c, string).
Well yes I'm sure there's a performance hit for changing how it is
indexed, but at the same time who would honestly prefer to index by the
code point without first getting the unit points? You're practically
stabbing in the dark if you try to slice a char[] array without first
iterating over it with foreach to find its points.
> It does, see foreach. In general, I don't think it's a good idea for the
> language to try to completely hide the multibyte nature of UTF. For
> example, when you're allocating and copying strings around, you need the
> byte length, not the number of code points.
string str = "etc";
int strlen = str.length;
int arrsize = str.sizeof;
Seems pretty simple to me. And you don't have to completely hide the
multibyte nature. Casting to a byte[] would allow full access to each
point, which might sound hackish but at the same time manipulating
individual code points in a string sounds like you're more than likely
doing something just as hackish.
> I was surprised to discover that most indexing work in strings, such as
> searching, work more efficiently by *not* trying to index by code
> points. There are standard library functions in std.utf to index by code
> points, if you do need it.
Efficiency at the cost of the programmer. :(
Perhaps you could design methods to access a string by either unit or
index if you see the need to keep index-by-byte behaviour. Something like
a toggle method would suit me just fine.
str.indexByByte(true);
> I believe D already has found the right approach to handling UTF
> strings. All I can say is try it out for a while.
I am, and it's making working with user-editable config files an
annoyance that perl avoids very easily.
More information about the Digitalmars-d
mailing list