Need to do some "dirty" UTF-8 handling

Vladimir Panteleev vladimir at thecybershadow.net
Sat Jun 25 06:10:37 PDT 2011


On Sat, 25 Jun 2011 12:00:43 +0300, Nick Sabalausky <a at a.a> wrote:

> Anyone have a good workaround? For instance, maybe a function that'll  
> take
> in a byte array and convert *all* invalid UTF-8 sequences to a  
> user-selected
> valid character?

I tend to do this a lot, for various reasons. By my experience, a great  
part of string-handling functions in Phobos will work just fine with  
strings containing invalid UTF-8 - you can generally use your intuition  
about whether a function will need to look at individual characters inside  
the string. Note, though, that there's currently a bug in D2/Phobos (6064)  
which causes std.array.join (and possibly other functions) to treat  
strings as not something that can be joined by concatenation, and do a  
character-by-character copy (which is both needlessly inefficient and will  
choke on invalid UTF-8).

When I really need to pass arbitrary data through string-handling  
functions, I use these functions:

/// convert any data to valid UTF-8, so D's string functions can properly  
work on it
string rawToUTF8(string s)
{
	dstring d;
	foreach (char c; s)
		d ~= c;
	return toUTF8(d);
}

string UTF8ToRaw(string r)
{
	string s;
	foreach (dchar c; r)
	{
		assert(c < '\u0100');
		s ~= c;
	}
	return s;
}

( from https://github.com/CyberShadow/Team15/blob/master/Utils.d#L514 )

Of course, it would be nice if it'd be possible to only convert INVALID  
UTF-8 sequences. According to Wikipedia, the invalid Unicode code points  
U+DC80..U+DCFF are often used for encoding invalid byte sequences. I'd  
guess that a proper implementation will need to guarantee that a roundtrip  
will always return the same data as the input, so it'd have to "escape"  
the invalid code points used for escaping as well.

-- 
Best regards,
  Vladimir                            mailto:vladimir at thecybershadow.net


More information about the Digitalmars-d-learn mailing list