std.regex character consumption
petevik38 at yahoo.com.au
petevik38 at yahoo.com.au
Fri Oct 8 14:13:36 PDT 2010
I've been running into a few problems with regular expressions in D. One
of the issues I've had recently is matching strings with non ascii
characters. As an example:
auto re = regex( `(.*)\.txt`, "i" );
re.printProgram();
auto m = match( "bà.txt", re );
writefln( "'%s'", m.captures[1] );
When I run this I get the following error:
dchar decode(in char[], ref size_t): Invalid UTF-8 sequence [160 46 116
120] around index 0
printProgram()
0: REparen len=1 n=0, pc=>10
9: REanystar
10: REistring x4, '.txt'
19: REend
While investigating the cause, I noticed that during execution of many
of the regex instructions (e.g. REanystar), the source is advanced with:
src++;
However in other cases (REanychar), it is advanced with:
src += std.utf.stride(input, src);
I found that by replacing the code REanystar with stride, the code
worked as expected. Although I can't claim to have a solid understanding
of the code, it seems to me that most of the cases of src++ should be
using stride instead.
Is this correct, or have I made some silly mistake and got completely
the wrong end of the stick?
More information about the Digitalmars-d-learn
mailing list