Why Strings as Classes?
Benji Smith
dlanguage at benjismith.net
Mon Aug 25 16:10:05 PDT 2008
Oh, man, I forgot one other thing... And it's a biggie...
The D _Arrays_ page says that "A string is an array of characters.
String literals are just an easy way to write character arrays."
http://digitalmars.com/d/1.0/arrays.html
In my previous post, I also use the "character array" terminology.
Unfortunately, though, it's just not true.
A char[] is actually an array of UTF-8 encoded octets, where each
character may consume one or more consecutive elements of the array.
Retrieving the str.length property may or may not tell you how many
characters are in the string. And pretty much any code that tries to
iterate character-by-character through the array elements is
fundamentally broken.
Take a look at this code, for example:
------------------------------------------------------------------
import tango.io.Stdout;
void main() {
// Create a string with UTF-8 content
char[] str = "mötley crüe";
Stdout.formatln("full string value: {}", str);
Stdout.formatln("len: {}", str.length);
// --> "len: 13" ... but there are only 11 characters!
Stdout.formatln("2nd char: '{}'", str[1]);
// --> "2nd char: ''" ... where'd my character go?
Stdout.formatln("first 3 chars: '{}'", str[0..3]);
// --> "first 3 chars: 'mö'" ... why only 2?
char o_umlat = 'ö';
Stdout.formatln("char value: '{}'", o_umlat);
// --> "char value: ''" ... where's my char?
}
------------------------------------------------------------------
So you can't actually iterate the the char elements of a char[] without
risking that you'll turn your string data into garbage. And you can't
trust that the length property tells you how many characters there are.
And you can't trust that an index or a slice will return valid data.
Also: take a look at the Phobos string "find" functions:
int find(char[] s, dchar c);
int ifind(char[] s, dchar c);
int rfind(char[] s, dchar c);
int irfind(char[] s, dchar c);
Huh?
To find a character in a char[] array, you have to use a dchar?
To me, that's like looking for a long within an int[] array.
So.. If a char[] actually consists of dchar elements, does that mean I
can append a dchar to a char[] array?
dchar u_umlat = 'ü';
char[] newString = "mötley crüe" ~ u_umlat;
No. Of course not. The compiler complains that you can't concatenate a
dchar to a char[] array. Even though the "find" functions indicate that
the array is truly a collection of dchar elements.
Now, don't get me wrong. I understand why the string is encoded as
UTF-8. And I understand that the encoding prevents accurate element
iteration, indexing, slicing, and all the other nice array goodies.
The existing D string implementation is exactly what I'd expect to see
inside the guts of a string class, because encodings are important and
efficiency is important. But those implementation details shouldn't be
exposed through a public API.
To claim that D strings are actually usable as character arrays is more
than a little spurious, since direct access of the array elements can
return fragmented garbage bytes.
If accurate string manipulation is impossible without a set of
special-purpose functions, then I'll argue that the implementation is
already equivalent to that of a class, but without any of the niceties
of encapsulation and polymorphism.
--benji
More information about the Digitalmars-d
mailing list