First Impressions!

Jonathan M Davis newsgroup.d at jmdavisprog.com
Thu Nov 30 17:40:08 UTC 2017


On Thursday, November 30, 2017 13:18:37 A Guy With a Question via 
Digitalmars-d wrote:
> As long as you understand it's limitations I think most bugs can
> be avoided. Where UTF16 breaks down, is pretty well defined.
> Also, super rare. I think UTF32 would be great to, but it seems
> like just a waste of space 99% of the time. UTF8 isn't horrible,
> I am not going to never use D because it uses UTF8 (that would be
> silly). Especially when wstring also seems baked into the
> language. However, it can complicate code because you pretty much
> always have to assume character != codepoint outside of ASCII. I
> can see a reasonable person arguing that it forcing you assume
> character != code point is actually a good thing. And that is a
> valid opinion.

The reality of the matter is that if you want to write fully valid Unicode,
then you have to understand the differences between code units, code points,
and graphemes, and since it really doesn't make sense to operate at the
grapheme level for everything (it would be terribly slow and is completely
unnecessary for many algorithms), you pretty much have to come to accept
that in the general case, you can't assume that something like a char
represents an actual character, regardless of its encoding. UTF-8 vs UTF-16
doesn't change anything in that respect except for the fact that there are
more characters which fit fully in a UTF-16 code unit than a UTF-8 code
unit, so it's easier to think that you're correctly handling Unicode when
you actually aren't. And if you're not dealing with Asian languages, UTF-16
uses up more space than UTF-8. But either way, they're both wrong if you're
trying to treat a code unit as a code point, let alone a grapheme. It's just
that we have a lot of programmers who only deal with English and thus don't
as easily hit the cases where their code is wrong. For better or worse,
UTF-16 hides it better than UTF-8, but the problem exists in both.

- Jonathan M Davis



More information about the Digitalmars-d mailing list