Suppressing UTFException / Squashing Bad Codepoints?

Mon Dec 23 12:51:40 PST 2013

On Monday, 23 December 2013 at 20:48:08 UTC, John Carter wrote:
> This frustrated me in Ruby unicode too....
>
> Typically i/o is the ultimate in "untrusted and untrustworthy" 
> sources,
> coming usually from systems beyond my control.
>
> Likely to be corrupted, or maliciously crafted, or defective...
>
> Unfortunately not all sequences of bytes are valid UTF8.
>
> Thus inevitably in every collection of inputs there are always 
> going to be
> around 1 in a million codepoints resulting in an UTFException 
> thrown.
>
> Alas, I always have to do Regex matches on the other 999999 
> valid
> codepoints.....
>
> Is there a standard recipe in stdio for squashing bad 
> codepoints to some
> default?
>
> These days memory is very much larger than most files I want to 
> scan.
>
> So if I was doing this in C I would typically mmap the file 
> PROT_READ |
> PROT_WRITE and MAP_PRIVATE then run down the file squashing bad 
> codepoints
> and then run down it again matching patterns.
>
> In Ruby I have a horridly inefficient utility....
>       def IO.read_utf_8(file)
>
> read(file,:external_encoding=>'ASCII-8BIT').encode('UTF-8',:undef=>:replace)
>       end
>
> What is the idiomatic D solution to this conundrum?

The encoding schemes in std.encoding support cleaning up input 
using the sanitize function.

http://dlang.org/phobos/std_encoding.html#.EncodingScheme.sanitize

It'd be nicer if the API were range based but it seems to do the 
trick in my experience.