std.regex character consumption
Jonathan M Davis
jmdavisProg at gmx.com
Fri Oct 8 14:31:47 PDT 2010
On Friday, October 08, 2010 14:13:36 petevik38 at yahoo.com.au wrote:
> I've been running into a few problems with regular expressions in D. One
> of the issues I've had recently is matching strings with non ascii
> characters. As an example:
>
> auto re = regex( `(.*)\.txt`, "i" );
> re.printProgram();
> auto m = match( "bà.txt", re );
> writefln( "'%s'", m.captures[1] );
>
> When I run this I get the following error:
>
> dchar decode(in char[], ref size_t): Invalid UTF-8 sequence [160 46 116
> 120] around index 0
> printProgram()
> 0: REparen len=1 n=0, pc=>10
> 9: REanystar
> 10: REistring x4, '.txt'
> 19: REend
>
> While investigating the cause, I noticed that during execution of many
> of the regex instructions (e.g. REanystar), the source is advanced with:
>
> src++;
>
> However in other cases (REanychar), it is advanced with:
>
> src += std.utf.stride(input, src);
>
> I found that by replacing the code REanystar with stride, the code
> worked as expected. Although I can't claim to have a solid understanding
> of the code, it seems to me that most of the cases of src++ should be
> using stride instead.
>
> Is this correct, or have I made some silly mistake and got completely
> the wrong end of the stick?
Well, without looking at the code, I can't say for certain what's going on, but
using ++ with chars or wchars is definitely wrong in virtually all cases.
stride() will actually go to the next code point, while ++ will just go to the
next code unit, which could be in the middle of a code point.
- Jonathan M Davis
More information about the Digitalmars-d-learn
mailing list