UTF-16 endianess

Steven Schveighoffer via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Fri Jan 29 15:58:17 PST 2016


On 1/29/16 6:03 PM, Marek Janukowicz wrote:
> On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
>>> Is there anything I should know about UTF endianess?
>>
>> It's not any different from other endianness.
>>
>> In other words, a UTF16 code unit is expected to be in the endianness of
>> the platform you are running on.
>>
>> If you are on x86 or x86_64 (very likely), then it should be little endian.
>>
>> If your source of data is big-endian (or opposite from your native
>> endianness),
>
> To be precise - my case is IMAP UTF7 folder name encoding and I finally found
> out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).
>
>> it will have to be converted before treating as a wchar[].
>
> Is there any clever way to do the conversion? Or do I need to swap the bytes
> manually?

No clever way, just the straightforward way ;)

Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing 
it with 16 bits I believe you have to do bit shifting. Something like:

foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> 8) 
& 0x00ff);

Or you can do it with the bytes directly before casting

>
>> Note the version identifiers BigEndian and LittleEndian can be used to
>> compile the correct code.
>
> This solution is of no use to me as I don't want to change the endianess in
> general.

What I mean is that you can annotate your code with version statements like:

version(LittleEndian)
{
    // perform the byteswap
    ...
}

so your code is portable to BigEndian systems (where you would not want 
to byte swap).

-Steve


More information about the Digitalmars-d-learn mailing list