String implementations
Dan
murpsoft at hotmail.com
Thu Jan 17 04:00:31 PST 2008
Jarrod Wrote:
> On Wed, 16 Jan 2008 10:27:53 -0500, Steven Schveighoffer wrote:
> >
> > The algorithmic penalties would be O(n) for a indexed lookup then
> > instead of O(1).
>
> I understand this, but the compiler could probably optimize this for most
> situations. Most string access would be sequential and thus positions
> could be cached on access when need be, and string literals that aren't
> modified and have all single-byte chars could be optimized into normal
> indexing. Furthermore, modern processors are incredibly good at
> sequential iteration and I know from personal experience that they can
> parse over massive chunks of memory in mere milliseconds (hashing entire
> executables in memory for potential changes is a common example of this).
> It shouldn't be noticeable at all to scan over a string. I do believe the
> author of the article that bearophile linked agrees with me on this
> regard, in his mention of charAt implementation.
>
> > I think the correct method in this case is to convert to utf32 first,
> > then index. Then at least you only take the O(n) penalty once.
>
> Well, converting to dchar[] means a full iteration over the entire string
> to split up the units. Then the program has to allocate space, copy chars
> over, and add padding. Is it really all that much more efficient? And why
> should the programmer have to worry about the conversion anyway? Good
> languages avoid cognitive load on the programmers.
>
> > Or why not just use dchar[] instead of char[] to begin with?
>
> Yes, you could just use dchar[] all the time, but how many people do
> that? It's very space-inefficient which is the whole reason utf-8 exists.
> If dchar[]s were meant to be used more often Walter probably would have
> made them the default string type.
>
> Eh, I guess this is just one of those annoying little 'loose ends' I see
> when I look at D.
Certainly is a whole lot better than the loose ends in other languages; at least we're in UTF and not ASCII (or undefined language).
I personally prefer UTF-8. I can write any UTF character in UTF8 if I accept that odd case of a UTF-32 character will be stored as \uXXXX. To be honest, that's acceptable; and gives me the memory savings and O(1) as long as I've got the foresight to predict where the \u's are.
I love D's handling of strings, in fact it is my *favorite* feature in D.
More information about the Digitalmars-d
mailing list