Need to do some "dirty" UTF-8 handling

Jonathan M Davis jmdavisProg at gmx.com
Sat Jun 25 06:25:07 PDT 2011


On 2011-06-25 02:00, Nick Sabalausky wrote:
> Sometimes I need to bring data into a string, and need to be able to treat
> it as an actual "string", but don't actually care if the entire thing is
> technically valid UTF-8 or not, don't care if invalid bytes don't get
> preserved right, and can't have any utf exceptions being thrown regardless
> of the input. Yea, I know that's sloppy, but sometimes that's good enough
> and proper handling may be far more trouble than what's needed. (For
> example: Processing HTML from arbitrary URLs. It's pretty much guaranteed
> you'll come across stuff that's wrong or even has the encoding type
> improperly set. But it's usually more important for the process to succeed
> than for it to be perfectly accurate.)
> 
> Far as I can tell, this seems to currently be impossible with Phobos
> (unless you're *extremely* meticulous about watching what your entire
> codebase does with the data), which is a major pain when such a need
> arises.
> 
> Anyone have a good workaround? For instance, maybe a function that'll take
> in a byte array and convert *all* invalid UTF-8 sequences to a
> user-selected valid character?

Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually treats 
it as a string instead of an array of bytes _must_ treat it as UTF-8 since it 
has to decode to determine what the characters are. So, I don't think that 
there's really any way around that. A string must be valid UTF-8. But if you 
really don't care about the string's contents, then you can just cast it to an 
array of ubyte and plenty of functions will work with it - nothing terribly 
string specific of course, but I don't see how you could possibly expect to do 
much string-specific with invalid data anyway.

- Jonathan M Davis


More information about the Digitalmars-d-learn mailing list