std.utf.decode behaves unexpectedly - Bug?

Spacen Jasset via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Fri Nov 6 11:41:49 PST 2015


On Friday, 6 November 2015 at 19:26:50 UTC, HeiHon wrote:
> Consider this:
>
> [code]
> import std.stdio, std.utf, std.exception;
>
> void do_decode(string txt)
> {
>     try
>     {
>         size_t idx;
>         writeln("decode ", txt);
>         for (size_t i = 0; i < txt.length; i++)
>         {
>             dchar dc = std.utf.decode(txt[i..i+1], idx);
>             writeln(" i=", i, " length=", txt[i..i+1].length, " 
> char=", txt[i], " idx=", idx, " dchar=", dc);
>         }
>     }
>     catch(Exception e)
>     {
>         writeln(e.msg, " file=", e.file, " line=", e.line);
>     }
>     writeln();
> }
>
> void main()
> {
>     do_decode("abc");
> /+ result:
> decode abc
>  i=0 length=1 char=a idx=1 dchar=a
>  i=1 length=1 char=b idx=2 dchar=c
>  i=2 length=1 char=c idx=3 dchar=
> +/
>
>     do_decode("åbc");
> /+ result:
> decode åbc
> Attempted to decode past the end of a string (at index 1) 
> file=D:\dmd2\windows\bin\..\..\src\phobos\std\utf.d line=1268
> +/
>
>     do_decode("aåb");
> /+ result:
> decode aåb
>  i=0 length=1 char=a idx=1 dchar=a
> core.exception.RangeError at std\utf.d(1265): Range violation
> ----------------
> 0x004054D4
> 0x0040214F
> 0x004045A7
> 0x004044BB
> 0x00403008
> 0x755D339A in BaseThreadInitThunk
> 0x76EE9EF2 in RtlInitializeExceptionChain
> 0x76EE9EC5 in RtlInitializeExceptionChain
> +/
> }
> [/code]
>
> I would expect:
> decode abc -> dchar a, dchar b, dchar c
> decode åbc -> dchar å, dchar b, dchar c
> decode aåb -> dchar a, dchar å, dchar b
>
> Am I using std.utf.decode wrongly or is it buggy?

I wouldn't have thought you would want to do this:

   dchar dc = std.utf.decode(txt[i..i+1], idx);

since txt is utf8, and this is a multiple byte, and variable 
length encoding, so txt[i..i+1] won't work, you will end up with 
invalid chops of utf8.

It would seem that you might want to just say decode(txt, i) 
instead if you look at the documentation it should decode one 
code point and advance i the right amount of characters forward. 
In other words, perhaps that paired with a while ( i < 
txt.length) might do the trick.




More information about the Digitalmars-d-learn mailing list