Higher level built-in strings

bearophile bearophileHUGS at lycos.com
Sun Jul 18 18:48:14 PDT 2010


This odd post comes from reading the nice part about strings of chapter 4 of TDPL. In the last few years I have seen changes in how D strings are meant and managed, changes that make them less and less like arrays (random-access sequences of mutable code units) and more and more what they are at high level (immutable bidirectional sequences of code points).

So a final jump is to make string types something different from normal arrays. This frees them to behave better as high level strings. Probably other people have had similar ideas, so if you think this post is boring or useless, please ignore it.

So strings can have some differences compared to arrays:

1) length returns code points, but it's immutable and stored in the string, so it's an O(1) operation (returns what std.utf.count() returns).

2) Assigning a string to another one is allowed:
string s1 = s2;
But changing the length manually is not allowed:
s1.length += 2; // error
This makes them more close to immutable, more similar to Python strings.

3) Another immutable string-specific attribute can be added that returns the number of code units in the string, for example .codeunits or .nunits or something similar.

4) s1[i] is not O(1), it's generally slower, and returns the i-th code point (there are ways to speed this operation up to O(log n) with some internal indexes). (Code points in dstrings can be accessed in O(1)).

5) foreach(c; s1) yields a code point dchars regardless if s1 is string, dstring, wstring.
But you can use foreach(char c; s1) if s1 is a string and you are sure s1 uses only 7 bit chars. But in such cases you can also use a immutable(ubyte)[]. Python3 does something similar, its has a str type that's always UTF16 (or UTF32) and a bytes type that is similar to a ubyte[]. I think D can do something similar. std.string functions can made to work on ubyte[] and immutable(ubyte)[] too.


Some more things, I am not sure about them:

6) Strings can contain their hash value. This field is initialized with a empty value (like -1) and computed lazily on-demand. (as done in Python). This can make them faster when they often are put inside associative arrays and sets, avoiding to compute their hash value again and again. So strings are not fully immutable, because this value gets initialized. But it's a pure value, it's determined only by the immutable contents of the string, so I don't think this can cause big problems in multi-thread programs. If two threads update it, they find the same result.

7) the validation (std.utf.validat(), or a cast) can be necessary to create a string. This means that the type string/dstring/wstring implies it's validated :-)

8) If strings are immutable, then the append can always create a new string. So the extra information at the end of the memory block (used for appending to dynamic arrays) is not necessary.

9)
- Today memory, even RAM, is cheap, but moving RAM to the CPU caches is not so fast, so a program that works on just the cache is faster.
- UFT8 and UTF16 make strings bidirectional ranges.
- UTF encodings are just data compression, but it's not so efficient.
So a smarter compression scheme can compress strings more in memory, and the decrease in cache misses can compensate for the increased CPU work to decompress them. (But keeping strings compressed can turn them from bidirectional ranges to forward ranges, this is not so bad).
There is a compressor that gives a decompression speed of about 3 times slower than memcpy():
http://www.oberhumer.com/opensource/lzo/
LZO can be used transparently to compress strings in RAM when strings become long enough.
Hash computation and equality tests done among compressed strings are faster.


Turning strings into something different from arrays looks like a loss, but in practice they already are not arrays, thinking about them as normal arrays is no so useful and it's UTF-unsafe, using [] to read the code units doesn't seem so useful. Code that manages strings/wstrings as normal arrays is not correct in general.

Bye,
bearophile


More information about the Digitalmars-d mailing list