First Impressions!
A Guy With a Question
aguywithanquestion at gmail.com
Thu Nov 30 18:26:19 UTC 2017
On Thursday, 30 November 2017 at 17:56:58 UTC, Jonathan M Davis
wrote:
> On Thursday, November 30, 2017 03:37:37 Walter Bright via
> Digitalmars-d wrote:
>> On 11/30/2017 2:39 AM, Joakim wrote:
>> > Java, .NET, Qt, Javascript, and a handful of others use
>> > UTF-16 too, some starting off with the earlier UCS-2:
>> >
>> > https://en.m.wikipedia.org/wiki/UTF-16#Usage
>> >
>> > Not saying either is better, each has their flaws, just
>> > pointing out it's more than just Windows.
>>
>> I stand corrected.
>
> I get the impression that the stuff that uses UTF-16 is mostly
> stuff that picked an encoding early on in the Unicode game and
> thought that they picked one that guaranteed that a code unit
> would be an entire character.
I don't think that's true though. Haven't you always been able to
combine two codepoints into one visual representation (Ä for
example). To me it's still two characters to look for when going
through the string, but the UI or text interpreter might choose
to combine them. So in certain domains, such as trying to
visually represent the character, yes a codepoint is not a
character, if by what you mean by character is the visual
representation. But what we are referring to as a character can
kind of morph depending on context. When you are running through
the data though in the algorithm behind the scenes, you care
about the *information* therefore the codepoint. And we are
really just have a semantics battle if someone calls that a
character.
> Many of them picked UCS-2 and then switched later to UTF-16,
> but once they picked a 16-bit encoding, they were kind of stuck.
>
> Others - most notably C/C++ and the *nix world - picked UTF-8
> for backwards compatibility, and once it became clear that
> UCS-2 / UTF-16 wasn't going to cut it for a code unit
> representing a character, most stuff that went Unicode went
> UTF-8.
That's only because C used ASCII and thus was a byte. UTF-8 is
inline with this, so literally nothing needs to change to get
pretty much the same behavior. It makes sense. With this this in
mind, it actually might make sense for D to use it.
More information about the Digitalmars-d
mailing list