String implementations

Sun Jan 20 03:45:40 PST 2008

On 1/20/08, Jarrod <qwerty at ytre.wq> wrote:
> > But I still don't see what this has got to do with whether or not a[n]
> > should identify the (n+1)th character rather than the (n+1)th code unit.
>
> Because this issue isn't really to do with the input file itself, it's to
> do with the potential input characters given in the file.

You mean the plain text config file of unknown encoding?

> As far as I can
> tell (I'm using a C library to parse the input) it should be ascii or
> UTF-8 encoding.
> Anything else would probably cause the C lexer to screw up.

If it's an unknown encoding, you store it in a ubyte array. Then you
identify the encoding, convert it to UTF-8 and store the result in a
char array.

> > Cool. So what is the real world use case that necessitates that
> > sequences of UTF-8 code units must be addressable by character index as
> > the default?
>
> The most important one right now is splicing. I'm allowing both user-
> defined and program-defined macros in the input data. They can be
> anywhere within a string, so I need to splice them out and replace them
> with their correct counterparts.

That works right now with ordinary char arrays. Just use find(),
rfind(), etc., and slicing.

> I hear the std lib provided with D is
> unreliable

Huh? Please elucidate.

> so I'm unwilling to use it.

That's your loss, but you can hardly expect Walter to consider adding
new language features just because you are unwilling to use Phobos.

> I also wish to cut off input at a certain letter count for spacing issues
> in both the GUI and dealing with the webscript.

Well, I hate to spoil things, but even /characters/ are not sufficient
to help you figure out spacing issues. For that, you need to be
working on the level of /glyphs/.

For example, consider the word "café". (Just in case that didn't
render properly, that's c, a, f, followed be e-with-an-acute-accent).
You can write this as either

    caf\u00E9

which consists of five UTF-8 code units, or four characters, or four
glyphs; or you can write it as

    cafe\u0301

which consist of seven UTF-8 code units, or five characters, or four
glyphs. In the first case, the e-acute glyph is represented as a
single character; in the second case, it is represented as an e
character followed by a combining-acute character. In other words,
even indexing by character is not sufficient to achieve your goals.
You need to index by glyph.

At some point, you have to say to yourself: wait a minute - this would
be better implemented in a library than in the primitive types of the
language.

> The other thing I'm using is single-letter replacement. Simple stuff like
> capitalising letters and replacing spaces with underscores.

I guess what you're getting at here is that uppercasing a character
might result in a UTF-8 string longer than that of the original
character. And so it might. On the other hand, if you use a foreach
loop to do this sort of thing, your problems are solved.

> I can think of a lot of other situations that would benefit from proper
> multibyte support too,

UTF-8 support /is/ proper multibyte support. That's why D has it built in.

> ( think of the poor Japanese coders :< ).

Which is why D uses Unicode. Again, I say, D got it right.

> I've seen a lot of programs that print a specified number of characters
> before wrapping around or trailing off, too. The humble gnome console is
> a good example of that. Very handy to have character indexing in this
> case.

I don't agree. This is a problem in font rendering. If you happen to
be using a proportional font, then even character counting won't work.
You need to be counting rendered width in pixels - an operation which
should be generic enough to work for both fixed-width and proportional
fonts.

> In the end I think I'm just tired of having to jump through hoops when it
> comes to string manipulation. I want to be able to say 'this is a
> character, I don't care what it is. Store it, change it, splice it, print
> it.'

dchar.

> happen, and I'll be able to type char chr = 'Δ'; and chr will be a proper
> utf-8 character that I can print, insert into an array, and change.
> What a beautiful day that will be.

dchar.

Put another way, you want to be insulated from the internal
representation. UTF-8 is an implementation detail, wheras what you
want is an array of Unicode characters (whose implementation is not
necessarily dchar[] but you want to be shielded from it anyway). Again
I say, this is a problem for a library class, not a builtin type. And
you're probably going to want even higher level abstractions dealing
with glyphs too (and then font-rendering tools after that).

D allows you to write such libraries.

But the builtin types do exactly what is says on the tin. Their
behaviour is well-defined, and it's up to the programmer to understand
that behaviour.