Reading dchar from UTF-8 stdin

spir denis.spir at gmail.com
Wed Mar 16 02:52:35 PDT 2011


On 03/15/2011 11:33 PM, Ali Çehreli wrote:
> Given that the input stream is UTF-8, it is understandable that the following
> program pulls just one code unit from the standard input (I think the console
> encoding is UTF-8 on my Ubuntu 10.10):
>
> import std.stdio;
>
> void main()
> {
> char code;
> readf(" %s", &code);
> writeln(code); // <-- may write an incomplete character
> }
>
> ö is represented by two bytes in the UTF-8 encoding. When ö is fed to the input
> of the program, writeln expression does not produce a complete character on the
> output. That's understandable with char.
>
> Would you expect all of the bytes to be consumed when a dchar was used instead?
>
> import std.stdio;
>
> void main()
> {
> dchar code; // <-- now a dchar
> readf(" %s", &code);
> writeln(code); // <-- BUG: uses a code unit as a code point!
> }

Well, when I try to run that bit of code, I get an error in std.format. 
formattedRead (line near the end, marked with "***" below).

void formattedRead(R, Char, S...)(ref R r, const(Char)[] fmt, S args)
{
     auto spec = FormatSpec!Char(fmt);
     static if (!S.length)
     {
         spec.readUpToNextSpec(r);
         enforce(spec.trailing.empty);
     }
     else
     {
         // The function below accounts for '*' == fields meant to be
         // read and skipped
         void skipUnstoredFields()
         {
             for (;;)
             {
                 spec.readUpToNextSpec(r);
                 if (spec.width != spec.DYNAMIC) break;
                 // must skip this field
                 skipData(r, spec);
             }
         }

         skipUnstoredFields();
         alias typeof(*args[0]) A;
         static if (isTuple!A)
         {
             foreach (i, T; A.Types)
             {
                 //writeln("Parsing ", r, " with format ", fmt);
                 (*args[0])[i] = unformatValue!(T)(r, spec);
                 skipUnstoredFields();
             }
         }
         else
         {
             *args[0] = unformatValue!(A)(r, spec); 		// ***
         }
         return formattedRead(r, spec.trailing, args[1 .. $]);
     }
}

> When the input is ö, now the output becomes Ã.
>
> What would you expect to happen?

I would expect a whole code representing 'ö'.

> Ali
>
> P.S. As what is written is not the same as what is read above, I am reminded of
> another issue: would you expect the strings "false" and "true" to be accepted
> as correct inputs when readf'ed to bool variables?

Yep!

Denis
-- 
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list