Why Strings as Classes?

superdan super at dan.org
Mon Aug 25 17:03:55 PDT 2008


Benji Smith Wrote:

> Oh, man, I forgot one other thing... And it's a biggie...
> 
> The D _Arrays_ page says that "A string is an array of characters. 
> String literals are just an easy way to write character arrays."
> 
> http://digitalmars.com/d/1.0/arrays.html
> 
> In my previous post, I also use the "character array" terminology.
> 
> Unfortunately, though, it's just not true.
> 
> A char[] is actually an array of UTF-8 encoded octets, where each 
> character may consume one or more consecutive elements of the array. 
> Retrieving the str.length property may or may not tell you how many 
> characters are in the string. And pretty much any code that tries to 
> iterate character-by-character through the array elements is 
> fundamentally broken.

try this:

foreach (dchar c; str)
{
    process c
}

> Take a look at this code, for example:
> 
> ------------------------------------------------------------------
> import tango.io.Stdout;
> 
> void main() {
> 
>     // Create a string with UTF-8 content
>     char[] str = "mötley crüe";
>     Stdout.formatln("full string value: {}", str);
> 
>     Stdout.formatln("len: {}", str.length);
>     // --> "len: 13" ... but there are only 11 characters!
> 
>     Stdout.formatln("2nd char: '{}'", str[1]);
>     // --> "2nd char: ''" ... where'd my character go?
> 
>     Stdout.formatln("first 3 chars: '{}'", str[0..3]);
>     // --> "first 3 chars: 'mö'" ... why only 2?
> 
>     char o_umlat = 'ö';
>     Stdout.formatln("char value: '{}'", o_umlat);
>     // --> "char value: ''" ... where's my char?
> 
> }
> ------------------------------------------------------------------
> 
> So you can't actually iterate the the char elements of a char[] without 
> risking that you'll turn your string data into garbage. And you can't 
> trust that the length property tells you how many characters there are. 
> And you can't trust that an index or a slice will return valid data.

you can iterate with foreach or lib functions. an index or slice won't return valid data indeed, but it couldn't anyway. there's no o(1) indexing into a string unless it's utf32.

> Also: take a look at the Phobos string "find" functions:
> 
>    int find(char[] s, dchar c);
>    int ifind(char[] s, dchar c);
>    int rfind(char[] s, dchar c);
>    int irfind(char[] s, dchar c);
> 
> Huh?
> 
> To find a character in a char[] array, you have to use a dchar?
> 
> To me, that's like looking for a long within an int[] array.

because you're wrong. you look for a dchar which can represent all characters in an array of a given encoding. the comparison is off.

> So.. If a char[] actually consists of dchar elements, does that mean I 
> can append a dchar to a char[] array?
> 
>    dchar u_umlat = 'ü';
>    char[] newString = "mötley crüe" ~ u_umlat;
> 
> No. Of course not. The compiler complains that you can't concatenate a 
> dchar to a char[] array. Even though the "find" functions indicate that 
> the array is truly a collection of dchar elements.

that's a bug in the compiler. report it.

> Now, don't get me wrong. I understand why the string is encoded as 
> UTF-8. And I understand that the encoding prevents accurate element 
> iteration, indexing, slicing, and all the other nice array goodies.

i know you understand. you should also understand 

> The existing D string implementation is exactly what I'd expect to see 
> inside the guts of a string class, because encodings are important and 
> efficiency is important. But those implementation details shouldn't be 
> exposed through a public API.

exactly at this point your argument kinda explodes. yes, you should see that stuff inside the guts of a string. which means builtin strings should be just arrays that you build larger stuff from. but wait. that's exactly what happens right now.

> To claim that D strings are actually usable as character arrays is more 
> than a little spurious, since direct access of the array elements can 
> return fragmented garbage bytes.

agreed.

> If accurate string manipulation is impossible without a set of 
> special-purpose functions, then I'll argue that the implementation is 
> already equivalent to that of a class, but without any of the niceties 
> of encapsulation and polymorphism.

and without the disadvantages.



More information about the Digitalmars-d mailing list