utf.d codeLength asserts false on certain input

Jonathan M Davis newsgroup.d at jmdavisprog.com
Tue Mar 27 23:57:21 UTC 2018


On Tuesday, March 27, 2018 23:29:57 Anonymouse via Digitalmars-d-learn 
wrote:
> My IRC bot is suddenly seeing crashes. It reads characters from a
> Socket into an ubyte[] array, then idups parts of that (full
> lines) into strings for parsing. Parsing involves slicing such
> strings into meaningful segments; sender, event type, target
> channel/user, message content, etc. I can assume all of them to
> be char[]-compliant except for the content field.
>
> Running it in a debugger I see I'm tripping an assert in utf.d[1]
> when calling stripRight on a content slice[2].
>
> > /++
> >
> >     Returns the number of code units that are required to
> >
> > encode the code point
> >
> >     $(D c) when $(D C) is the character type used to encode it.
> >
> >   +/
> >
> > ubyte codeLength(C)(dchar c) @safe pure nothrow @nogc
> > if (isSomeChar!C)
> > {
> >
> >     static if (C.sizeof == 1)
> >     {
> >
> >         if (c <= 0x7F) return 1;
> >         if (c <= 0x7FF) return 2;
> >         if (c <= 0xFFFF) return 3;
> >         if (c <= 0x10FFFF) return 4;
> >         assert(false);  // <--
> >
> >     }
> >     // ...
>
> This trips it:
> > import std.string;
> >
> > void main()
> > {
> >
> >     string s = "\355\342\256 \342\245\341⮢\256\245
> >
> > ᮮ\241饭\250\245".stripRight;  // <-- asserts false
> > }
>
> The real backtrace:
> > #0  _D3std3utf__T10codeLengthTaZQpFNaNbNiNfwZh (c=26663461) at
> > /usr/include/dlang/dmd/std/utf.d:2530
> > #1  0x000055555578d7aa in
> > _D3std6string__T10stripRightTAyaZQrFQhZ14__foreachbody2MFNaNbNiNfKmKwZi
> > (this=0x7fffffff99c0, __applyArg1=@0x7fffffff9978: 26663461,
> > __applyArg0=@0x7fffffff9970: 17) at
> > /usr/include/dlang/dmd/std/string.d:2918 #2  0x00007ffff7a47014 in
> > _aApplyRcd2 () from
> > /usr/lib/libphobos2.so.0.78
> > #3  0x000055555578d731 in
> > _D3std6string__T10stripRightTAyaZQrFNaNiNfQnZQq (str=...) at
> > /usr/include/dlang/dmd/std/string.d:2915
> > #4  0x00005555558e0cc7 in
> > _D8kameloso3irc17parseSpecialcasesFNaNfKSQBnQBh9IRCParserKSQCf7ircdefs8I
> > RCEventKAyaZv (slice=..., event=...,parser=...) at
> > source/kameloso/irc.d:1184
> Should that not be an Exception, as it's based on input? I'm not
> sure where the character 26663461 came from. Even so, should it
> assert?
>
> I don't know what to do right now. I'd like to avoid sanitizing
> all lines. I could catch an Exception but not so much an
> AssertError.
>
>
> [1]: https://github.com/dlang/phobos/blob/master/std/utf.d#L2522
> [2]:
> https://github.com/zorael/kameloso/blob/master/source/kameloso/irc.d#L1184

It means that codeLength requires that dchar be a valid code point, though
the documentation doesn't say that. It probably should. It was probably
assumed that no one would try to pass it an invalid code point - especially
since it's usually called with well-known values rather than data from some
place like a socket. Regardless, the way to work around it would be to call
isValidDchar on the dchar before passing it to codeLength so that you can
handle the invalid code point rather than calling codeLength on it.

- Jonathan M Davis




More information about the Digitalmars-d-learn mailing list