[Issue 4250] New: std.regex does not support character sets other than unicode
d-bugmail at puremagic.com
d-bugmail at puremagic.com
Sat May 29 07:47:01 PDT 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4250
Summary: std.regex does not support character sets other than
unicode
Product: D
Version: 2.041
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Phobos
AssignedTo: nobody at puremagic.com
ReportedBy: lio+bugzilla at lunesu.com
--- Comment #0 from Lionello Lunesu <lio+bugzilla at lunesu.com> 2010-05-29 07:46:59 PDT ---
Created an attachment (id=647)
Patch against phobos/std/regex.d in dmd.2.046.zip
I'm writing an application that works with Chinese text encoded in GBK,
http://en.wikipedia.org/wiki/GBK . I could convert all the text to UTF8 first,
before using regex, but it's much faster to leave the text as-is and only
convert the regular expression to GBK instead.
I suspect the following opcode need patching:
1. REanychar uses std.utf.stride;
2. REdchar and REidchar are used when the character in the regex >= 0x80;
3. REichar and REidchar use std.ctype.toupper (during creation and execution)
Point 1 and 3 are easily solved by providing the user with callback functions.
To prevent unnecessary indirection, these can be aliases if
(is(__traits(compiles, std.utf.stride(new E[], 0)))).d
Attached a proof of concept patch for point 1. If this is OK, I can do the same
for point 2 and 3 as well. (Point 2 might not even need a patch; not clear
about that now.)
--
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
More information about the Digitalmars-d-bugs
mailing list