Read a unicode character from the terminal

Jacob Carlborg doob at me.com
Sun Apr 1 04:58:15 PDT 2012


On 2012-03-31 20:53, Ali Çehreli wrote:
> I recommend using stdin. The destiny of std.cstream is uncertain and
> stdin is sufficient. (I know that it lacks support for BOM but I don't
> need them.)

I thought std.cstream was a stream wrapper around stdin.

> The word 'character' used to mean characters of the Latin-based
> alphabets but with Unicode support that's not the case anymore. In D,
> 'character' means UTF code unit, nothing else. Unfortunately, although
> 'Unidode character' is just the correct term to use, it conflicts with
> D's characters which are not Unicode characters.
>
> 'Unicode code point' is the non-conflicting term that matches what we
> mean with 'Unicode character.' Only dchar can hold code points.
>
> That's the part about characters.

Yeah, exactly. When I think about it, I don't know why I thought "getc" 
would work since it only returns a "char" and not a "dchar".

> The other side is what is being fed into the program through its
> standard input. On my Linux consoles, the text comes as a stream of
> chars, i.e. a UTF-8 encoded text. You must ensure that your terminal is
> capable of supporting Unicode through its settings. On Windows
> terminals, one must enter 'chcp 65001' to set the terminal to UTF-8.

I'm on Mac OS X, the terminal is capable of handling Unicode.

> Then, it is the program that must know what the data represents. If you
> are expecting a Unicode code point, then you may think that is should be
> as simple as reading into a dchar:
>
> import std.stdio;
>
> void main()
> {
> dchar letter;
> readf("%s", &letter); // <-- does not work!
> writeln(letter);
> }
>
> The output:
>
> $ ./deneme
> ç
> Ã <-- will be different on different consoles

I tried that as well.

> The problem is, char can implicitly be converted to dchar. Since the
> letter ç consists of two chars (two UTF-8 code units), dchar gets the
> first one converted as a dchar.
>
> To see this, read and write two chars in a loop without a newline in
> between:
>
> import std.stdio;
>
> void main()
> {
> foreach (i; 0 .. 2) {
> char code;
> readf("%s", &code);
> write(code);
> }
>
> writeln();
> }
>
> This time two code units are read and then outputted to form a Unicode
> character on the console:
>
> $ ./deneme
> ç
> ç <-- result of two write(code) expressions
>
> The solution is to use ranges when pulling Unicode characters out of
> strings. std.stdin does not provide this yet, but it will eventually
> happen (so I've heard :)).
>
> For now, this is a way of getting Unicode characters from the input:
>
> import std.stdio;
>
> void main()
> {
> string line = readln();
>
> foreach (dchar c; line) {
> writeln(c);
> }
> }
>
> Once you have the input as a string, std.utf.decode can also be used.
>
> Ali
>

I'll give that a try, thanks.

-- 
/Jacob Carlborg


More information about the Digitalmars-d-learn mailing list