Regex and UTF-8

Fri Nov 18 09:07:37 PST 2011

It seems related to toLower too...

Here the line with exception:

s = replace(s, regex(`[^"a-zA-Z0-9àòèéìù\.]`, "g"), " ").toLower();

Where s is a string with that sequence...

Using dmd 2.056

Il giorno ven, 18/11/2011 alle 20.33 +0400, Dmitry Olshansky ha scritto:

> On 18.11.2011 17:58, Andrea Fontana wrote:
> > I build a data access layer in c++. This layer works with mongo db where
> > string are always encoded using UTF-8. I've ported this layer in D using
> > swig. String is written correctly in console but when i use std.regex
> > sometimes it gives an exception:
> >
> > core.exception.UnicodeException at src
> > <mailto:core.exception.UnicodeException at src>/rt/util/utf.d(290): invalid
> > UTF-8 sequence
> >
> > Byte sequence (for better undestanding) is:
> > [83, 195, 179, 32]
> >
> > And the string was "Sò " (with accented o and a space)
> >
> > I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
> > utf.d?
> >
> 
> Which version of std.regex are you using - the one from git master or 
> the one in the latest release?
> If it's the former then I'm willing to look into this thing on weekend, 
> if you can get a hold of a pair: string + pattern that fails like this.
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20111118/9dd1a9dc/attachment.html>