Need to do some "dirty" UTF-8 handling

Nick Sabalausky a at a.a
Sat Jun 25 02:00:43 PDT 2011


Sometimes I need to bring data into a string, and need to be able to treat 
it as an actual "string", but don't actually care if the entire thing is 
technically valid UTF-8 or not, don't care if invalid bytes don't get 
preserved right, and can't have any utf exceptions being thrown regardless 
of the input. Yea, I know that's sloppy, but sometimes that's good enough 
and proper handling may be far more trouble than what's needed. (For 
example: Processing HTML from arbitrary URLs. It's pretty much guaranteed 
you'll come across stuff that's wrong or even has the encoding type 
improperly set. But it's usually more important for the process to succeed 
than for it to be perfectly accurate.)

Far as I can tell, this seems to currently be impossible with Phobos (unless 
you're *extremely* meticulous about watching what your entire codebase does 
with the data), which is a major pain when such a need arises.

Anyone have a good workaround? For instance, maybe a function that'll take 
in a byte array and convert *all* invalid UTF-8 sequences to a user-selected 
valid character?




More information about the Digitalmars-d-learn mailing list