String implementations

Thu Jan 17 04:00:31 PST 2008

Jarrod Wrote:

> On Wed, 16 Jan 2008 10:27:53 -0500, Steven Schveighoffer wrote:
> > 
> > The algorithmic penalties would be O(n) for a indexed lookup then
> > instead of O(1).
> 
> I understand this, but the compiler could probably optimize this for most 
> situations. Most string access would be sequential and thus positions 
> could be cached on access when need be, and string literals that aren't 
> modified and have all single-byte chars could be optimized into normal 
> indexing. Furthermore, modern processors are incredibly good at 
> sequential iteration and I know from personal experience that they can 
> parse over massive chunks of memory in mere milliseconds (hashing entire 
> executables in memory for potential changes is a common example of this). 
> It shouldn't be noticeable at all to scan over a string. I do believe the 
> author of the article that bearophile linked agrees with me on this 
> regard, in his mention of charAt implementation.
> 
> > I think the correct method in this case is to convert to utf32 first,
> > then index.  Then at least you only take the O(n) penalty once.  
> 
> Well, converting to dchar[] means a full iteration over the entire string 
> to split up the units. Then the program has to allocate space, copy chars 
> over, and add padding. Is it really all that much more efficient? And why 
> should the programmer have to worry about the conversion anyway? Good 
> languages avoid cognitive load on the programmers.
> 
> > Or why not just use dchar[] instead of char[] to begin with?
> 
> Yes, you could just use dchar[] all the time, but how many people do 
> that? It's very space-inefficient which is the whole reason utf-8 exists. 
> If dchar[]s were meant to be used more often Walter probably would have 
> made them the default string type.
> 
> Eh, I guess this is just one of those annoying little 'loose ends' I see 
> when I look at D.

Certainly is a whole lot better than the loose ends in other languages; at least we're in UTF and not ASCII (or undefined language).

I personally prefer UTF-8.  I can write any UTF character in UTF8 if I accept that odd case of a UTF-32 character will be stored as \uXXXX.  To be honest, that's acceptable; and gives me the memory savings and O(1) as long as I've got the foresight to predict where the \u's are.

I love D's handling of strings, in fact it is my *favorite* feature in D.