Autodecode in the wild and An Awful Hack to std.regex
John Carter via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Wed Jul 27 16:36:37 PDT 2016
Don't you just hate it when you google a problem and find a post
from yourself asking the same question?
In 2013 I ran into the UTF8 invalid char autodecode UTFException,
and the answer then was "use std.encoding.sanitize" and my
opinion looking at the implementation, was then, as is now... Eww!
Since then, I'm glad to see Walter Bright agrees that autodecode
is problematic.
http://forum.dlang.org/thread/nh2o9i$hr0$1@digitalmars.com
After wading through 46 pages of that, and Jack Stouffers handy
blog entry and the longish discussion thread on it...
https://forum.dlang.org/post/eozguhavggchzzruzkwk@forum.dlang.org
Maybe I missed something.
What am I supposed to do here in 2016?
I was happily churning through 25 gigabytes of data with
foreach( line; File( file).byLine()) {
auto c = line.matchFirst( myRegex);
.
.
When I hit an invalid codepoint...Again.
What is an efficient (elegant) solution?
An inelegant solution was to hack into the point that throws the
exception and "Do It Right" (for various values of Right)
diff -u /usr/include/dmd/phobos/std/regex/internal/ir.d{~,}
--- /usr/include/dmd/phobos/std/regex/internal/ir.d~ 2015-12-03
14:41:31.000000000 +1300
+++ /usr/include/dmd/phobos/std/regex/internal/ir.d 2016-07-28
11:04:55.525480585 +1200
@@ -591,7 +591,7 @@
pos = _index;
if(_index == _origin.length)
return false;
- res = std.utf.decode(_origin, _index);
+ res = std.utf.decode!(UseReplacementDchar.yes)(_origin,
_index);
return true;
}
@property bool atEnd(){
That "Works For Me".
But it vaguely feels to me that that template parameter needs to
be trickled all the way up the regex engine.
More information about the Digitalmars-d-learn
mailing list