Need to do some "dirty" UTF-8 handling

Sat Jun 25 14:42:39 PDT 2011

"Jonathan M Davis" <jmdavisProg at gmx.com> wrote in message 
news:mailman.1214.1309008317.14074.digitalmars-d-learn at puremagic.com...
> On 2011-06-25 02:00, Nick Sabalausky wrote:
>> Sometimes I need to bring data into a string, and need to be able to 
>> treat
>> it as an actual "string", but don't actually care if the entire thing is
>> technically valid UTF-8 or not, don't care if invalid bytes don't get
>> preserved right, and can't have any utf exceptions being thrown 
>> regardless
>> of the input. Yea, I know that's sloppy, but sometimes that's good enough
>> and proper handling may be far more trouble than what's needed. (For
>> example: Processing HTML from arbitrary URLs. It's pretty much guaranteed
>> you'll come across stuff that's wrong or even has the encoding type
>> improperly set. But it's usually more important for the process to 
>> succeed
>> than for it to be perfectly accurate.)
>>
>> Far as I can tell, this seems to currently be impossible with Phobos
>> (unless you're *extremely* meticulous about watching what your entire
>> codebase does with the data), which is a major pain when such a need
>> arises.
>>
>> Anyone have a good workaround? For instance, maybe a function that'll 
>> take
>> in a byte array and convert *all* invalid UTF-8 sequences to a
>> user-selected valid character?
>
> Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually 
> treats
> it as a string instead of an array of bytes _must_ treat it as UTF-8 since 
> it
> has to decode to determine what the characters are. So, I don't think that
> there's really any way around that. A string must be valid UTF-8. But if 
> you
> really don't care about the string's contents, then you can just cast it 
> to an
> array of ubyte and plenty of functions will work with it - nothing 
> terribly
> string specific of course, but I don't see how you could possibly expect 
> to do
> much string-specific with invalid data anyway.
>

Using immutable(ubyte)[] just causes an enormous amount of type-related 
problems, largely involving the need to throw around a bunch of casts 
absolutely everywhere, including every single time any of the byte arrays 
needs to come in contact with an actual string (for instance, a string 
literal, for comparing,searching or anything else). It might be the 
"correct" thing, but in many cases (anything that doesn't need to be 
perfect, or can't realistically be perfect) it's far more trouble than it's 
actually worth.

Like I said, "For instance, maybe a function that'll take in a byte array 
and convert *all* invalid UTF-8 sequences to a user-selected valid 
character?" In such a case, *there would be no invalid data* in the actual 
string.