Need to do some "dirty" UTF-8 handling

Sat Jun 25 17:29:57 PDT 2011

"Dmitry Olshansky" <dmitry.olsh at gmail.com> wrote in message 
news:iu5tan$ets$1 at digitalmars.com...
> On 26.06.2011 3:25, Nick Sabalausky wrote:
>> "Dmitry Olshansky"<dmitry.olsh at gmail.com>  wrote in message
>> news:iu5n32$2vjd$1 at digitalmars.com...
>>> On 26.06.2011 1:49, Nick Sabalausky wrote:
>>>> "Andrej Mitrovic"<andrej.mitrovich at gmail.com>   wrote in message
>>>> news:mailman.1215.1309019944.14074.digitalmars-d-learn at puremagic.com...
>>>>> I've had a similar requirement some time ago. I've had to copy and
>>>>> modify the phobos function std.utf.decode for a custom text editor
>>>>> because the function throws when it finds an invalid code point. This
>>>>> is way too slow for my needs. I'm actually displaying invalid code
>>>>> points with special marks (just like Scintilla), so I need decoding to
>>>>> work as fast as possible.
>>>>>
>>>>> The new function simply replaces throwing exceptions with flagging a
>>>>> boolean.
>>>> I think I may end up doing something like that :/
>>>>
>>>> I was hoping to be able to do something vaguely sensible like this:
>>>>
>>>> string newStr;
>>>> foreach(dchar dc; str)
>>>> {
>>>>       if(isValidDchar(dc))
>>>>           newStr ~= dc;
>>>>       else
>>>>           newStr ~= 'X';
>>>> }
>>>> str = newStr;
>>>>
>>>> But that just blows up in my face.
>>>>
>>>>
>>> std.encoding to the rescue?
>>> It looks like a well established module that was forgotten for some
>>> reason.
>>>
>>> And here I'm wondering what a function named sanitize could do :)
>>>
>> Ahh, I didn't even notice that module.
>
> Same here, It's just a couple of days(!) ago I somehow managed to find 
> decode in the wrong place (in std.encoding  instead of std.utf). And it 
> looked useful, but I never heard about it. Seriously, how many totally 
> irrelevant old modules we have around here? (hint: std.gregorian!)
>> Even if it's imperfect and goes away, it looks like it'll at least get 
>> the
>> job done for me. And the encoding conversions should even give me an easy
>> way to save at least some of the invalid chars (which wasn't really a
>> requirement of mine, but it'll still be nice).
>>
>>
> Yeah, given the amount of necessary work in the Phobos realm it could hang 
> around for quite sometime ;)
>

Yea, and even when it does go, I can just copy it and include it manually 
(although it'll probably need some work once typedef goes away).

This seems to get the job done well enough for me, and even manages to save 
some of the intended chars:

// With std.utf and std.encoding imported:
string src = ...;
bool valid=true;
try
    validate(src);
catch(UtfException e)
    valid=false;

if(!valid)
{
    auto tmpStr = sanitize( cast(Windows1252String) src );
    transcode(tmpStr, src);
}