Regex and UTF-8

Dmitry Olshansky dmitry.olsh at gmail.com
Fri Nov 18 12:40:14 PST 2011


On 18.11.2011 21:07, Andrea Fontana wrote:
> It seems related to toLower too...
>
> Here the line with exception:
>
> s = replace(s, regex(`[^"a-zA-Z0-9àòèéìù\.]`, "g"), " ").toLower();
>
> Where s is a string with that sequence...
>
> Using dmd 2.056

You mean one of prepackaged zips|debs|etc. from the website? It uses the 
old regex, which, I have to admit, is not that good with unicode. Then 
... well you are somewhat out of luck untill next release.

That's where brand new regex engine is coming, provided I figure out 
mysterious FreeBSD|OSX issue (sigh). Unfortunately, I was very busy 
recently, though maybe this weekend I'll finally work something out.

I just tested it with my version on win32 ... well it hits one of 
asserts (it should have been exception, ouch!), but the fix was easy. 
It's all about . that works as simple '.' char in [], it's just wrong to 
escape it inside character class (some engines do allow this, though 
it's confusing like hell).
After that it outputs stuff like this:
std.regex.RegexException at std\regex.d(1939): invalid escape sequence
Pattern with error: `[^"a-zA-Z0-9àòèéìù\.` <--HERE-- `]`

After changing \. --> . It does work for me with s = "Sò  ", no exceptions.

Bottom line:
Thanks, as I uncovered a serious issue i.e. misjudged assert on wrong 
escapes in character classes.
Second if you are on win32/linux you might want to try fresh github version.
And stay tuned for the next release that should fix most of regex issues 
once and for all.

>
> Il giorno ven, 18/11/2011 alle 20.33 +0400, Dmitry Olshansky ha scritto:
>> On 18.11.2011 17:58, Andrea Fontana wrote:
>> >  I build a data access layer in c++. This layer works with mongo db where
>> >  string are always encoded using UTF-8. I've ported this layer in D using
>> >  swig. String is written correctly in console but when i use std.regex
>> >  sometimes it gives an exception:
>> >
>> >  core.exception.UnicodeException at src
>> >  <mailto:core.exception.UnicodeException at src>/rt/util/utf.d(290): invalid
>> >  UTF-8 sequence
>> >
>> >  Byte sequence (for better undestanding) is:
>> >  [83, 195, 179, 32]
>> >
>> >  And the string was"Sò  "  (with accented o and a space)
>> >
>> >  I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
>> >  utf.d?
>> >
>>
>> Which version of std.regex are you using - the one from git master or
>> the one in the latest release?
>> If it's the former then I'm willing to look into this thing on weekend,
>> if you can get a hold of a pair: string + pattern that fails like this.
>>
>>


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list