Making all strings UTF ranges has some risk of WTF

Thu Feb 4 04:58:06 PST 2010

I'd like D to give the right results and to be not bug-prone by default (even if it's a bit slower) and do something more optimized but maybe faster when I want it and I know what I am doing. So I like the idea of making strings more correct.

I liked the idea of D strings acting like normal arrays, but they are different data structures, so it's better to accept them as something different, even if similar (especially UTF32 ones). So a specific struct, named like String, can be used to represent a string. String literals too return such struct.

In C language the strlen() is O(n), so people can learn that finding the length of most strings is O(n) in D too (UTF32 strings today are not common. If D gets success in a future the UTF32 strings can become the only used strings. Such things are happened many times in computer history). The String structs can cache inside themselves the length and hash value the first time such values are computed (so each String is a struct 4 words long).

Time ago you (Andrei) told me that you don't like a function like len() to have a different computational complexity, like O(n) or O(1), according to the input. But a simple possibility is for "foo".length property to return the length of the walk on all chars, that is O(n) the first time you call it (it's O(1) on UTF32 strings). Some algorithms or some parts of code (and string literals, etc.) can know what's the length of the string they return or use, so they can put this this length value inside the String cached value and there's never a need to compute it.

The struct String that represents the string supports the [] operator too (there can be a MutableString too, that is not used often), but it usually walks the string to find the right character to return. Then a method like "foo".ubyte(i) can be used to return the i-th ubyte of the string. So initially this [] is done with just a O(n) walk, later some performance optimization can be added, like a simple skip tree based on indexes.

Bye,
bearophile