Why Strings as Classes?

Benji Smith dlanguage at benjismith.net
Mon Aug 25 16:10:05 PDT 2008


Oh, man, I forgot one other thing... And it's a biggie...

The D _Arrays_ page says that "A string is an array of characters. 
String literals are just an easy way to write character arrays."

http://digitalmars.com/d/1.0/arrays.html

In my previous post, I also use the "character array" terminology.

Unfortunately, though, it's just not true.

A char[] is actually an array of UTF-8 encoded octets, where each 
character may consume one or more consecutive elements of the array. 
Retrieving the str.length property may or may not tell you how many 
characters are in the string. And pretty much any code that tries to 
iterate character-by-character through the array elements is 
fundamentally broken.

Take a look at this code, for example:

------------------------------------------------------------------
import tango.io.Stdout;

void main() {

    // Create a string with UTF-8 content
    char[] str = "mötley crüe";
    Stdout.formatln("full string value: {}", str);

    Stdout.formatln("len: {}", str.length);
    // --> "len: 13" ... but there are only 11 characters!

    Stdout.formatln("2nd char: '{}'", str[1]);
    // --> "2nd char: ''" ... where'd my character go?

    Stdout.formatln("first 3 chars: '{}'", str[0..3]);
    // --> "first 3 chars: 'mö'" ... why only 2?

    char o_umlat = 'ö';
    Stdout.formatln("char value: '{}'", o_umlat);
    // --> "char value: ''" ... where's my char?

}
------------------------------------------------------------------

So you can't actually iterate the the char elements of a char[] without 
risking that you'll turn your string data into garbage. And you can't 
trust that the length property tells you how many characters there are. 
And you can't trust that an index or a slice will return valid data.

Also: take a look at the Phobos string "find" functions:

   int find(char[] s, dchar c);
   int ifind(char[] s, dchar c);
   int rfind(char[] s, dchar c);
   int irfind(char[] s, dchar c);

Huh?

To find a character in a char[] array, you have to use a dchar?

To me, that's like looking for a long within an int[] array.

So.. If a char[] actually consists of dchar elements, does that mean I 
can append a dchar to a char[] array?

   dchar u_umlat = 'ü';
   char[] newString = "mötley crüe" ~ u_umlat;

No. Of course not. The compiler complains that you can't concatenate a 
dchar to a char[] array. Even though the "find" functions indicate that 
the array is truly a collection of dchar elements.

Now, don't get me wrong. I understand why the string is encoded as 
UTF-8. And I understand that the encoding prevents accurate element 
iteration, indexing, slicing, and all the other nice array goodies.

The existing D string implementation is exactly what I'd expect to see 
inside the guts of a string class, because encodings are important and 
efficiency is important. But those implementation details shouldn't be 
exposed through a public API.

To claim that D strings are actually usable as character arrays is more 
than a little spurious, since direct access of the array elements can 
return fragmented garbage bytes.

If accurate string manipulation is impossible without a set of 
special-purpose functions, then I'll argue that the implementation is 
already equivalent to that of a class, but without any of the niceties 
of encapsulation and polymorphism.

--benji



More information about the Digitalmars-d mailing list