First Impressions!

A Guy With a Question aguywithanquestion at gmail.com
Thu Nov 30 18:26:19 UTC 2017


On Thursday, 30 November 2017 at 17:56:58 UTC, Jonathan M Davis 
wrote:
> On Thursday, November 30, 2017 03:37:37 Walter Bright via 
> Digitalmars-d wrote:
>> On 11/30/2017 2:39 AM, Joakim wrote:
>> > Java, .NET, Qt, Javascript, and a handful of others use 
>> > UTF-16 too, some starting off with the earlier UCS-2:
>> >
>> > https://en.m.wikipedia.org/wiki/UTF-16#Usage
>> >
>> > Not saying either is better, each has their flaws, just 
>> > pointing out it's more than just Windows.
>>
>> I stand corrected.
>
> I get the impression that the stuff that uses UTF-16 is mostly 
> stuff that picked an encoding early on in the Unicode game and 
> thought that they picked one that guaranteed that a code unit 
> would be an entire character.

I don't think that's true though. Haven't you always been able to 
combine two codepoints into one visual representation (Ä for 
example). To me it's still two characters to look for when going 
through the string, but the UI or text interpreter might choose 
to combine them. So in certain domains, such as trying to 
visually represent the character, yes a codepoint is not a 
character, if by what you mean by character is the visual 
representation. But what we are referring to as a character can 
kind of morph depending on context. When you are running through 
the data though in the algorithm behind the scenes, you care 
about the *information* therefore the codepoint. And we are 
really just have a semantics battle if someone calls that a 
character.

> Many of them picked UCS-2 and then switched later to UTF-16, 
> but once they picked a 16-bit encoding, they were kind of stuck.
>
> Others - most notably C/C++ and the *nix world - picked UTF-8 
> for backwards compatibility, and once it became clear that 
> UCS-2 / UTF-16 wasn't going to cut it for a code unit 
> representing a character, most stuff that went Unicode went 
> UTF-8.

That's only because C used ASCII and thus was a byte. UTF-8 is 
inline with this, so literally nothing needs to change to get 
pretty much the same behavior. It makes sense. With this this in 
mind, it actually might make sense for D to use it.






More information about the Digitalmars-d mailing list