String implementations

Wed Jan 16 19:32:22 PST 2008

On Wed, 16 Jan 2008 10:27:53 -0500, Steven Schveighoffer wrote:
> 
> The algorithmic penalties would be O(n) for a indexed lookup then
> instead of O(1).

I understand this, but the compiler could probably optimize this for most 
situations. Most string access would be sequential and thus positions 
could be cached on access when need be, and string literals that aren't 
modified and have all single-byte chars could be optimized into normal 
indexing. Furthermore, modern processors are incredibly good at 
sequential iteration and I know from personal experience that they can 
parse over massive chunks of memory in mere milliseconds (hashing entire 
executables in memory for potential changes is a common example of this). 
It shouldn't be noticeable at all to scan over a string. I do believe the 
author of the article that bearophile linked agrees with me on this 
regard, in his mention of charAt implementation.

> I think the correct method in this case is to convert to utf32 first,
> then index.  Then at least you only take the O(n) penalty once.  

Well, converting to dchar[] means a full iteration over the entire string 
to split up the units. Then the program has to allocate space, copy chars 
over, and add padding. Is it really all that much more efficient? And why 
should the programmer have to worry about the conversion anyway? Good 
languages avoid cognitive load on the programmers.

> Or why not just use dchar[] instead of char[] to begin with?

Yes, you could just use dchar[] all the time, but how many people do 
that? It's very space-inefficient which is the whole reason utf-8 exists. 
If dchar[]s were meant to be used more often Walter probably would have 
made them the default string type.

Eh, I guess this is just one of those annoying little 'loose ends' I see 
when I look at D.