Autodecode in the wild and An Awful Hack to std.regex

John Carter via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Jul 27 16:36:37 PDT 2016


Don't you just hate it when you google a problem and find a post 
from yourself asking the same question?

In 2013 I ran into the UTF8 invalid char autodecode UTFException, 
and the answer then was "use std.encoding.sanitize" and my 
opinion looking at the implementation, was then, as is now... Eww!

Since then, I'm glad to see Walter Bright agrees that autodecode 
is problematic.

http://forum.dlang.org/thread/nh2o9i$hr0$1@digitalmars.com

After wading through 46 pages of that, and Jack Stouffers handy 
blog entry and the longish discussion thread on it...
https://forum.dlang.org/post/eozguhavggchzzruzkwk@forum.dlang.org

Maybe I missed something.

What am I supposed to do here in 2016?

I was happily churning through 25 gigabytes of data with

     foreach( line; File( file).byLine()) {
        auto c = line.matchFirst( myRegex);
       .
       .

When I hit an invalid codepoint...Again.

What is an efficient (elegant) solution?

An inelegant solution was to hack into the point that throws the 
exception and "Do It Right" (for various values of Right)


diff -u /usr/include/dmd/phobos/std/regex/internal/ir.d{~,}
--- /usr/include/dmd/phobos/std/regex/internal/ir.d~	2015-12-03 
14:41:31.000000000 +1300
+++ /usr/include/dmd/phobos/std/regex/internal/ir.d	2016-07-28 
11:04:55.525480585 +1200
@@ -591,7 +591,7 @@
          pos = _index;
          if(_index == _origin.length)
              return false;
-        res = std.utf.decode(_origin, _index);
+        res = std.utf.decode!(UseReplacementDchar.yes)(_origin, 
_index);
          return true;
      }
      @property bool atEnd(){

That "Works For Me".

But it vaguely feels to me that that template parameter needs to 
be trickled all the way up the regex engine.


More information about the Digitalmars-d-learn mailing list