[Issue 14919] New: utf error

via Digitalmars-d-bugs digitalmars-d-bugs at puremagic.com
Thu Aug 13 23:54:25 PDT 2015


https://issues.dlang.org/show_bug.cgi?id=14919

          Issue ID: 14919
           Summary: utf error
           Product: D
           Version: D2
          Hardware: x86_64
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: dmd
          Assignee: nobody at puremagic.com
          Reporter: code at dawg.eu

Related/Alternative to issue 14519 (see
https://issues.dlang.org/show_bug.cgi?id=14519#c24).

When I `readText` a file a lot of time is already spent on utf validation.
But we don't take advantage of that and revalidate utf in almost every
algorithm.
The idea from issue 14519 to replace invalid chars with a replacement makes the
validation a little cheaper (b/c of the cost of dmd's EH, see issue 12442) but
still incurs a high overhead.

I suggest that we make a clean distinction between unvalidated ubyte[] data and
treat all char/wchar/dchar[] strings as valid.

The compiler already checks string literals and a few of string reading
functions do it as well. Unfortunately byLine and readln currently don't
validate utf.

This could be a much more performant approach to correct utf handling.

--


More information about the Digitalmars-d-bugs mailing list