[Issue 16090] New: popFront generates out-of-bounds array index on corrupted utf-8 strings
via Digitalmars-d-bugs
digitalmars-d-bugs at puremagic.com
Sat May 28 22:17:24 PDT 2016
https://issues.dlang.org/show_bug.cgi?id=16090
Issue ID: 16090
Summary: popFront generates out-of-bounds array index on
corrupted utf-8 strings
Product: D
Version: D2
Hardware: x86
OS: Mac OS X
Status: NEW
Severity: normal
Priority: P1
Component: phobos
Assignee: nobody at puremagic.com
Reporter: jrdemail2000-dlang at yahoo.com
If a utf-8 string is chopped (terminated) in the middle of a multi-byte utf-8
character, popFront will generate an out-of-bounds array index. If compiled
with -boundscheck=on, a popFront generates a core.exception.RangeError. With
-boundscheck=off, an undetermined behavior. In the program below, in my tests
the while looped forever until generating a bus error.
void main(string[] args) {
import std.stdio;
import std.range;
auto s = "aä";
auto corrupted = s[0 .. $-1];
auto n = 0;
while (!corrupted.empty) {
corrupted.popFront;
n++;
}
writeln(n);
}
In this program, the 'ä' character is a two utf-8 sequence. Dropping the last
byte leaving an incomplete utf-8 code point.
The reason this is so problematic is that string processing often involves
corrupted strings, in particular, strings read at run-time from input sources.
In the sample program above it can be said that this is a programmer error.
However, if the string is read from an outside source, the program needs to be
able to defend against corrupted strings.
It appears this arises problem from this code in popFront (isNarrowString),
currently line 2076 in std/range/primitives.d:
import core.bitop : bsr;
auto msbs = 7 - bsr(~c);
if ((msbs < 2) | (msbs > 6))
{
//Invalid UTF-8
msbs = 1;
}
str = str[msbs .. $];
The msbs variable is holding the length of the utf-8 code point as indicated by
the first byte. The 'str[msbs .. $]' expression assumes the string is long
enough to hold the full code point.
Beside being problematic for practical applications, it is inconsistent with
other auto-decoding behavior. The 'front' routine will throw a
std.utf.UTFException in this situation. And, popFront itself handles the case
of an invalid first byte differently, by simply moving past it.
--
More information about the Digitalmars-d-bugs
mailing list