utf.d codeLength asserts false on certain input

Anonymouse asdf at asdf.net
Tue Mar 27 23:29:57 UTC 2018


My IRC bot is suddenly seeing crashes. It reads characters from a 
Socket into an ubyte[] array, then idups parts of that (full 
lines) into strings for parsing. Parsing involves slicing such 
strings into meaningful segments; sender, event type, target 
channel/user, message content, etc. I can assume all of them to 
be char[]-compliant except for the content field.

Running it in a debugger I see I'm tripping an assert in utf.d[1] 
when calling stripRight on a content slice[2].

> /++
>     Returns the number of code units that are required to 
> encode the code point
>     $(D c) when $(D C) is the character type used to encode it.
>   +/
> ubyte codeLength(C)(dchar c) @safe pure nothrow @nogc
> if (isSomeChar!C)
> {
>     static if (C.sizeof == 1)
>     {
>         if (c <= 0x7F) return 1;
>         if (c <= 0x7FF) return 2;
>         if (c <= 0xFFFF) return 3;
>         if (c <= 0x10FFFF) return 4;
>         assert(false);  // <--
>     }
>     // ...

This trips it:

> import std.string;
>
> void main()
> {
>     string s = "\355\342\256 \342\245\341⮢\256\245 
> ᮮ\241饭\250\245".stripRight;  // <-- asserts false
> }

The real backtrace:
> #0  _D3std3utf__T10codeLengthTaZQpFNaNbNiNfwZh (c=26663461) at 
> /usr/include/dlang/dmd/std/utf.d:2530
> #1  0x000055555578d7aa in 
> _D3std6string__T10stripRightTAyaZQrFQhZ14__foreachbody2MFNaNbNiNfKmKwZi (this=0x7fffffff99c0, __applyArg1=@0x7fffffff9978: 26663461, __applyArg0=@0x7fffffff9970: 17) at /usr/include/dlang/dmd/std/string.d:2918
> #2  0x00007ffff7a47014 in _aApplyRcd2 () from 
> /usr/lib/libphobos2.so.0.78
> #3  0x000055555578d731 in 
> _D3std6string__T10stripRightTAyaZQrFNaNiNfQnZQq (str=...) at 
> /usr/include/dlang/dmd/std/string.d:2915
> #4  0x00005555558e0cc7 in 
> _D8kameloso3irc17parseSpecialcasesFNaNfKSQBnQBh9IRCParserKSQCf7ircdefs8IRCEventKAyaZv (slice=..., event=...,parser=...) at source/kameloso/irc.d:1184


Should that not be an Exception, as it's based on input? I'm not 
sure where the character 26663461 came from. Even so, should it 
assert?

I don't know what to do right now. I'd like to avoid sanitizing 
all lines. I could catch an Exception but not so much an 
AssertError.


[1]: https://github.com/dlang/phobos/blob/master/std/utf.d#L2522
[2]: 
https://github.com/zorael/kameloso/blob/master/source/kameloso/irc.d#L1184


More information about the Digitalmars-d-learn mailing list