Suppressing UTFException / Squashing Bad Codepoints?
Brad Anderson
eco at gnuk.net
Mon Dec 23 12:51:40 PST 2013
On Monday, 23 December 2013 at 20:48:08 UTC, John Carter wrote:
> This frustrated me in Ruby unicode too....
>
> Typically i/o is the ultimate in "untrusted and untrustworthy"
> sources,
> coming usually from systems beyond my control.
>
> Likely to be corrupted, or maliciously crafted, or defective...
>
> Unfortunately not all sequences of bytes are valid UTF8.
>
> Thus inevitably in every collection of inputs there are always
> going to be
> around 1 in a million codepoints resulting in an UTFException
> thrown.
>
> Alas, I always have to do Regex matches on the other 999999
> valid
> codepoints.....
>
> Is there a standard recipe in stdio for squashing bad
> codepoints to some
> default?
>
> These days memory is very much larger than most files I want to
> scan.
>
> So if I was doing this in C I would typically mmap the file
> PROT_READ |
> PROT_WRITE and MAP_PRIVATE then run down the file squashing bad
> codepoints
> and then run down it again matching patterns.
>
> In Ruby I have a horridly inefficient utility....
> def IO.read_utf_8(file)
>
> read(file,:external_encoding=>'ASCII-8BIT').encode('UTF-8',:undef=>:replace)
> end
>
> What is the idiomatic D solution to this conundrum?
The encoding schemes in std.encoding support cleaning up input
using the sanitize function.
http://dlang.org/phobos/std_encoding.html#.EncodingScheme.sanitize
It'd be nicer if the API were range based but it seems to do the
trick in my experience.
More information about the Digitalmars-d-learn
mailing list