Need to do some "dirty" UTF-8 handling
Vladimir Panteleev
vladimir at thecybershadow.net
Sat Jun 25 06:10:37 PDT 2011
On Sat, 25 Jun 2011 12:00:43 +0300, Nick Sabalausky <a at a.a> wrote:
> Anyone have a good workaround? For instance, maybe a function that'll
> take
> in a byte array and convert *all* invalid UTF-8 sequences to a
> user-selected
> valid character?
I tend to do this a lot, for various reasons. By my experience, a great
part of string-handling functions in Phobos will work just fine with
strings containing invalid UTF-8 - you can generally use your intuition
about whether a function will need to look at individual characters inside
the string. Note, though, that there's currently a bug in D2/Phobos (6064)
which causes std.array.join (and possibly other functions) to treat
strings as not something that can be joined by concatenation, and do a
character-by-character copy (which is both needlessly inefficient and will
choke on invalid UTF-8).
When I really need to pass arbitrary data through string-handling
functions, I use these functions:
/// convert any data to valid UTF-8, so D's string functions can properly
work on it
string rawToUTF8(string s)
{
dstring d;
foreach (char c; s)
d ~= c;
return toUTF8(d);
}
string UTF8ToRaw(string r)
{
string s;
foreach (dchar c; r)
{
assert(c < '\u0100');
s ~= c;
}
return s;
}
( from https://github.com/CyberShadow/Team15/blob/master/Utils.d#L514 )
Of course, it would be nice if it'd be possible to only convert INVALID
UTF-8 sequences. According to Wikipedia, the invalid Unicode code points
U+DC80..U+DCFF are often used for encoding invalid byte sequences. I'd
guess that a proper implementation will need to guarantee that a roundtrip
will always return the same data as the input, so it'd have to "escape"
the invalid code points used for escaping as well.
--
Best regards,
Vladimir mailto:vladimir at thecybershadow.net
More information about the Digitalmars-d-learn
mailing list