how to localize console and GUI apps in Windows

Thu Jan 4 09:59:36 UTC 2018

On Friday, 29 December 2017 at 18:13:04 UTC, H. S. Teoh wrote:
> On Fri, Dec 29, 2017 at 10:35:53AM +0000, Andrei via 
> Digitalmars-d-learn wrote:
>> This may be endurable if you write an application where 
>> Russian is only one of rare options, and what if your whole 
>> environment is totally Russian?
>
> You mean if your environment uses a non-UTF encoding?  If your 
> environment uses UTF, there is no problem.  I have code with 
> strings in Russian (and other languages) embedded, and it's no 
> problem because everything is in Unicode, all input and all 
> output.

No, I mean difficulties to write a program based on non-ASCII 
locales. Every programming language learning since C starts with 
a "hello world" program which every non-English programmer 
essentially tries to translate to native language - and gets 
unreadable mess on the screen. Thousands try, hundreds look for a 
solution, dozens find it, and a few continue with the new 
language. That's not because these programmers cannot read 
English text-books, they can. That's because they want to write 
non-English programs for non-English people, and that's 
essential. And there are many programming languages (or rather 
their runtimes) which do not suffer such a deficiency.

That's the reason for UNICODE adoption all over the programming 
world - including D language, but what's the good for me if I can 
write in a D program a UTF8 string with my native language text, 
and get the same unreadable mess on the screen?

Yes, a new language in development can lack support for some 
features, but this forum branch shows that a simple and handy 
solution exists - yet nobody cares to bring it to the first pages 
of every text-book for beginners, at least as a footnote. Thus 
thousands of potential new language fans are lost from start.

> But I understand that in Windows you may not have this luxury. 
> So you have to deal with codepages and what-not.
>
> Converting back and forth is not a big problem, and it actually 
> also solves the problem of string comparisons, because std.uni 
> provides utilities for collating strings, etc.. But it only 
> works for Unicode, so you have to convert to Unicode internally 
> anyway.  Also, for static strings, it's not hard to make the 
> codepage mapping functions CTFE-able, so you can actually write 
> string literals in a codepage and have the compiler 
> automatically convert it to UTF-8.
>
> The other approach, if you don't like the idea of converting 
> codepages all the time, is to explicitly work in ubyte[] for 
> all strings. Or, preferably, create your own string type with 
> ubyte[] representation underneath, and implement your own 
> comparison functions, etc., then use this type for all strings. 
> Better yet, contribute this to code.dlang.org so that others 
> who have the same problem can reuse your code instead of 
> needing to write their own.

I'd definitely try this if I decide to use D language for my 
purposes (which not settled yet). But to decide I need some 
experience, and for now it stopped at reading the user's input 
(for training I intend to translate into D my recent rather 
complex interactive C# program).

>> Still this does not decide localized input problem: any 
>> localized input throws an exception “std.utf.UTFException...  
>> Invalid UTF-8 sequence”.
>
> Is the exception thrown in readln() or in writeln()? If it's in
> writeln(), it shouldn't be a big deal, you just have to pass 
> the data returned by readln() to fromKOI8 (or whatever other 
> codepage you're using).
>
> If the problem is in readln(), then you probably need to read 
> the input in binary (i.e., as ubyte[]) and convert it manually. 
> Unfortunately, there's no other way around this if you're 
> forced to use codepages. The ideal situation is if you can just 
> use Unicode throughout your environment. But of course, 
> sometimes you have no choice.

It depends.

If I avoid proper console code page initializing, I see in 
debugger that runtime reads the user's input as CP866 (MS DOS) 
Cyrillic and then throws the exception "Invalid UTF-8 sequence" 
when trying to handle it as UTF8 string (in particular by strip() 
or writeln() functions). This situation seems quite manageable by 
code page conversions you've mentioned above. I've tried first 
library function found (std.windows.charset), and got a rather 
fanciful working statement:

response = fromMBSz((readln()~"\0").ptr, 1).strip();

which assigns correct Latin/Cyrillic contents to the response 
variable.

And if I initialize console with SetConsoleCP(65001) statement 
things get worse, as I've said above. Then readln() statement 
returns an empty string and something gets broken inside the 
runtime, because any further readln() statements do not wait for 
user input, and return empty strings immediately.