First Impressions
Chad J
"gamerChad\" at spamIsBad gmail.com
Sat Sep 30 17:15:56 PDT 2006
Anders F Björklund wrote:
>>> What is not powerful enough about the foreach(dchar c; str) ?
>>> It will step through that UTF-8 array one codepoint at a time.
>>
>>
>> I'm assuming 'str' is a char[], which would make that very nice. But
>> it doesn't solve correctly slicing or indexing into a char[].
>
>
> Well, it's also a lot "trickier" than that... For instance, my last name
> can be written in Unicode as Björklund or Bj¨orklund, both of which are
> valid - only that in one of them, the 'ö' occupies two full code points!
> It's still a single character, which is why Unicode avoids that term...
>
So it seems to me the problem is that those 2 bytes are both 2
characters and 1 character at the same time.
In this case, I'd prefer being able to index to a safe default (like the
ö, instead of the umlauts next to the o), or not being able to index at
all.
> As you know, if you need to access your strings by codepoint (something
> that the Unicode group explicitly recommends against, in their FAQ) then
> char[] isn't a very nice format - because of the conversion overhead...
> But it's still possible to translate, transform, and translate back ?
>
I read that FAQ at the bottom of this post, and didn't see anything
about accessing strings by codepoint. Maybe you mean a different FAQ
here, in which case, could I have a link please? I've been to the
unicode site before and all I remember was being confused and having a
hard time finding the info I wanted :(
Also I still am not sure exactly what a code point is. And that FAQ at
the bottom used the word "surrogate" a lot; I'm not sure about that one
either.
When you say char[] isn't a nice format, I wasn't thinking about having
the string class I mentioned earlier store the data ONLY as char[]. It
might be wchar[]. Or dchar[]. Then it would be automatically converted
between the two either at compile time (when possible) or dynamically at
runtime (hopefully only when needed). So if someone throws a Chinese
character literal at it, there is a very big clue there to use UTF32 or
something that can store all of the characters in a uniform width sort
of way, to speed indexing. Algorithms could be used so that a program
'learns' at runtime what kind of strings are dominating the program, and
uses algorithms optimized for those. Maybe this is a bit too complex,
but I can dream, hehe.
>> If nothing was done about this and I absolutely needed UTF support,
>> I'd probably make a class like so: [...]
>
>
> In my own mock String class, I cached the dchar[] codepoints on demand.
> (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
>
>> All in all it is a drag that we should have to learn all of this UTF
>> stuff. I want char[] to just work!
>
>
> Using Unicode strings and characters does require a little learning...
> (where http://www.unicode.org/faq/utf_bom.html is a very good page)
> And D does force you to think about string implementation, no question.
> This has both pros and cons, but it is a deliberate language decision.
>
> If you're willing to handle the "surrogates", then UTF-16 is a rather
> good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
> A downside is that it is not "ascii-compatible" (has embedded NUL chars)
> and that it is endian-dependant unlike the more universal UTF-8 format.
>
> --anders
My impression has gone from being quite scared of UTF to being not so
worried, but only for myself. D seems to be good at handling UTF, but
only if someone tells you to never handle strings as arrays of
characters. Unfortunately, the first thing you see in a lot of D
programs is "int main( char[][] args )" and there are some arrays of
characters being used as strings. This also means that some array
capabilities like indexing and the braggable slicing are more dangerous
than useful for string handling. It's a newbie trap.
Like I said earlier, I either want to be able to index/slice strings
safely, or not at all (or better yet, not by any intuitive means).
More information about the Digitalmars-d
mailing list