[Issue 16090] New: popFront generates out-of-bounds array index on corrupted utf-8 strings

Sat May 28 22:17:24 PDT 2016

https://issues.dlang.org/show_bug.cgi?id=16090

          Issue ID: 16090
           Summary: popFront generates out-of-bounds array index on
                    corrupted utf-8 strings
           Product: D
           Version: D2
          Hardware: x86
                OS: Mac OS X
            Status: NEW
          Severity: normal
          Priority: P1
         Component: phobos
          Assignee: nobody at puremagic.com
          Reporter: jrdemail2000-dlang at yahoo.com

If a utf-8 string is chopped (terminated) in the middle of a multi-byte utf-8
character, popFront will generate an out-of-bounds array index. If compiled
with -boundscheck=on, a popFront generates a core.exception.RangeError. With
-boundscheck=off, an undetermined behavior. In the program below, in my tests
the while looped forever until generating a bus error.

    void main(string[] args) {
        import std.stdio;
        import std.range;

        auto s = "aä";
        auto corrupted = s[0 .. $-1];
        auto n = 0;
        while (!corrupted.empty) {
            corrupted.popFront;
            n++;
        }
        writeln(n);
    }

In this program, the 'ä' character is a two utf-8 sequence. Dropping the last
byte leaving an incomplete utf-8 code point.

The reason this is so problematic is that string processing often involves
corrupted strings, in particular, strings read at run-time from input sources.
In the sample program above it can be said that this is a programmer error.
However, if the string is read from an outside source, the program needs to be
able to defend against corrupted strings.

It appears this arises problem from this code in popFront (isNarrowString),
currently line 2076 in std/range/primitives.d:

    import core.bitop : bsr;
    auto msbs = 7 - bsr(~c);
    if ((msbs < 2) | (msbs > 6))
    {
        //Invalid UTF-8
        msbs = 1;
     }
     str = str[msbs .. $];

The msbs variable is holding the length of the utf-8 code point as indicated by
the first byte. The 'str[msbs .. $]' expression assumes the string is long
enough to hold the full code point.

Beside being problematic for practical applications, it is inconsistent with
other auto-decoding behavior. The 'front' routine will throw a
std.utf.UTFException in this situation. And, popFront itself handles the case
of an invalid first byte differently, by simply moving past it.

--