Regex and utf8

Koroskin Denis 2korden at gmail.com
Sun Jul 20 12:24:46 PDT 2008


On Sun, 20 Jul 2008 22:50:06 +0400, Roman Balitskiy  
<realis_toleroATtoleroDOTorg_fake at fake.com> wrote:

> When I try to parse cyrillic text I get "Error: 4invalid UTF-8  
> sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed  
> upcomming gdc 0.25 with the same results.
>
> 	if (auto m = std.regexp.search(`ab&#1078;def`, `[&#1078;]`))   // Here  
> is cyrillic letter 'je'
> 		writefln("%s[%s]%s", m.pre, m.match(0), m.post);
>

Try removing braces, the following code sample works for me:

import std.stdio;
import std.regexp;

void main()
{
     if (auto m = std.regexp.search("abжdef", "ж")) {
         writefln("%s[%s]%s", m.pre, m.match(0), m.post);
     }
}

I don't know if it's a bug or not, most probably it is.

But since Phobos console is not Unicode aware, you won't see "ab[ж]def" as  
expected but rather something like "ab[╨╢]def" (my output, might be  
different on other locale settings).

By constrast, the Tango console I/O is more Unicode-friendly:

import tango.text.Regex;
import tango.io.Stdout;

void main()
{
     foreach(m; Regex("ж").search("abжdef")) {
         Stdout.formatln("{}[{}]{}", m.pre, m.match(0), m.post);
     }
}

Hope that helps.



More information about the Digitalmars-d mailing list