UTF-16 endianess

Fri Jan 29 14:43:26 PST 2016

On 1/29/16 5:36 PM, Marek Janukowicz wrote:
> I have trouble understanding how endianess works for UTF-16.
>
> For example UTF-16 code for 'ł' character is 0x0142. But this program shows
> otherwise:
>
> import std.stdio;
>
> public void main () {
>    ubyte[] properOrder = [0x01, 0x42];
> 	ubyte[] reverseOrder = [0x42, 0x01];
> 	writefln( "proper: %s, reverse: %s",
> 		cast(wchar[])properOrder,
> 		cast(wchar[])reverseOrder );
> }
>
> output:
>
> proper: 䈁, reverse: ł
>
> Is there anything I should know about UTF endianess?

It's not any different from other endianness.

In other words, a UTF16 code unit is expected to be in the endianness of 
the platform you are running on.

If you are on x86 or x86_64 (very likely), then it should be little endian.

If your source of data is big-endian (or opposite from your native 
endianness), it will have to be converted before treating as a wchar[].

Note the version identifiers BigEndian and LittleEndian can be used to 
compile the correct code.

-Steve