Something about Chinese Disorder Code

Tue Nov 24 11:41:01 PST 2015

Am Tue, 24 Nov 2015 17:08:33 +0000
schrieb BLM768 <blm768 at gmail.com>:

> On Tuesday, 24 November 2015 at 09:48:45 UTC, magicdmer wrote:
> > I display chinese string like:
> >
> > auto str = "你好，世界"
> > writeln(str)
> >
> > and The display is garbled。
> >
> > some windows api like MessageBoxA ,if string is chinese, it 
> > displays disorder code too
> >
> > i think i must use WideCharToMultiByte to convert it , is there 
> > any other answer to solve this question simplely
> 
> You can also try using a wide string literal, i.e. "你好，世界"w. The 
> suffix forces the string to use 16-bit characters.
> 
> The garbled display from writeln might be related to your console 
> settings. If you aren't using the UTF-8 codepage in cmd.exe, that 
> would explain why the text appears garbled. Unfortunately, 
> Windows has some significant bugs with UTF-8 in the console.

This is really our problem. We pretend the output terminal is
UTF-8 without providing any means to configure the terminal or
converting the output to the terminal's encoding. All OSs
provide conversion API that works for the most part (assuming
normalization form D for example), but we just don't use them.
Most contributers were on Linux or English Windows system
where the issue is not obvious. But even in Europe łáö will
come out garbled.

Now this is mostly not an issue with modern Linux and OS X as
the default is UTF-8, but some people refuse to change to
variable length encodings and Windows has always been
preferring fixed length encodings.

The proper way to solve this for the OP in a cross-platform
way is to replace writeln with something that detects whether
the output is a terminal and then converts the string to
something that will display correctly, which can be a simple
wchar[] on Windows or the use of iconv on Linux. Ketmar is
currently using iconv to print D string on his Linux set up
for Cyrillic KIO-8U and I think I used it in some code as
well. It is not perfect, but will make most printable strings
readable. ICU I think is the only library that gets it 100%
correct with regards to normalization and offering you
different error concealment methods.

-- 
Marco