Need to do some "dirty" UTF-8 handling

Sat Jun 25 16:25:47 PDT 2011

"Dmitry Olshansky" <dmitry.olsh at gmail.com> wrote in message 
news:iu5n32$2vjd$1 at digitalmars.com...
> On 26.06.2011 1:49, Nick Sabalausky wrote:
>> "Andrej Mitrovic"<andrej.mitrovich at gmail.com>  wrote in message
>> news:mailman.1215.1309019944.14074.digitalmars-d-learn at puremagic.com...
>>> I've had a similar requirement some time ago. I've had to copy and
>>> modify the phobos function std.utf.decode for a custom text editor
>>> because the function throws when it finds an invalid code point. This
>>> is way too slow for my needs. I'm actually displaying invalid code
>>> points with special marks (just like Scintilla), so I need decoding to
>>> work as fast as possible.
>>>
>>> The new function simply replaces throwing exceptions with flagging a
>>> boolean.
>> I think I may end up doing something like that :/
>>
>> I was hoping to be able to do something vaguely sensible like this:
>>
>> string newStr;
>> foreach(dchar dc; str)
>> {
>>      if(isValidDchar(dc))
>>          newStr ~= dc;
>>      else
>>          newStr ~= 'X';
>> }
>> str = newStr;
>>
>> But that just blows up in my face.
>>
>>
> std.encoding to the rescue?
> It looks like a well established module that was forgotten for some 
> reason.
>
> And here I'm wondering what a function named sanitize could do :)
>

Ahh, I didn't even notice that module.

Even if it's imperfect and goes away, it looks like it'll at least get the 
job done for me. And the encoding conversions should even give me an easy 
way to save at least some of the invalid chars (which wasn't really a 
requirement of mine, but it'll still be nice).