Need to do some "dirty" UTF-8 handling

Sat Jun 25 18:20:18 PDT 2011

On 2011-06-25 17:04, Dmitry Olshansky wrote:
> On 26.06.2011 3:25, Nick Sabalausky wrote:
> > "Dmitry Olshansky"<dmitry.olsh at gmail.com>  wrote in message
> > news:iu5n32$2vjd$1 at digitalmars.com...
> > 
> >> On 26.06.2011 1:49, Nick Sabalausky wrote:
> >>> "Andrej Mitrovic"<andrej.mitrovich at gmail.com>   wrote in message
> >>> news:mailman.1215.1309019944.14074.digitalmars-d-learn at puremagic.com...
> >>> 
> >>>> I've had a similar requirement some time ago. I've had to copy and
> >>>> modify the phobos function std.utf.decode for a custom text editor
> >>>> because the function throws when it finds an invalid code point. This
> >>>> is way too slow for my needs. I'm actually displaying invalid code
> >>>> points with special marks (just like Scintilla), so I need decoding to
> >>>> work as fast as possible.
> >>>> 
> >>>> The new function simply replaces throwing exceptions with flagging a
> >>>> boolean.
> >>> 
> >>> I think I may end up doing something like that :/
> >>> 
> >>> I was hoping to be able to do something vaguely sensible like this:
> >>> 
> >>> string newStr;
> >>> foreach(dchar dc; str)
> >>> {
> >>> 
> >>>       if(isValidDchar(dc))
> >>>       
> >>>           newStr ~= dc;
> >>>       
> >>>       else
> >>>       
> >>>           newStr ~= 'X';
> >>> 
> >>> }
> >>> str = newStr;
> >>> 
> >>> But that just blows up in my face.
> >> 
> >> std.encoding to the rescue?
> >> It looks like a well established module that was forgotten for some
> >> reason.
> >> 
> >> And here I'm wondering what a function named sanitize could do :)
> > 
> > Ahh, I didn't even notice that module.
> 
> Same here, It's just a couple of days(!) ago I somehow managed to find
> decode in the wrong place (in std.encoding  instead of std.utf). And it
> looked useful, but I never heard about it. Seriously, how many totally
> irrelevant old modules we have around here? (hint: std.gregorian!)
> 
> > Even if it's imperfect and goes away, it looks like it'll at least get
> > the job done for me. And the encoding conversions should even give me an
> > easy way to save at least some of the invalid chars (which wasn't really
> > a requirement of mine, but it'll still be nice).
> 
> Yeah, given the amount of necessary work in the Phobos realm it could
> hang around for quite sometime ;)

Oh, it'll probably be around for a while. It'll take time before a replacement 
is devised. After, std.stream is still around, isn't it? And there's actually 
supposedly a plan regarding its replacement's implementation. There's no such 
thing with regards to std.encoding. I just thought that I should point out 
that it's likely to be replaced at some point (hopefully with something much 
better).

- Jonathan M Davis