<div dir="ltr"><div><div>Eww. <br><br>If I read the source correctly it mallocs a new array and runs down the original at least three times! (Four if you count peeks)<br><br></div>Not to mention that it is completely unintegrated with stdio.<br>


<br></div>Sigh! I miss the Good Old Days of 7-bit ASCII! ;-)<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Dec 24, 2013 at 9:51 AM, Brad Anderson <span dir="ltr"><<a href="mailto:eco@gnuk.net" target="_blank">eco@gnuk.net</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Monday, 23 December 2013 at 20:48:08 UTC, John Carter wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

This frustrated me in Ruby unicode too....<br>

<br>

Typically i/o is the ultimate in "untrusted and untrustworthy" sources,<br>

coming usually from systems beyond my control.<br>

<br>

Likely to be corrupted, or maliciously crafted, or defective...<br>

<br>

Unfortunately not all sequences of bytes are valid UTF8.<br>

<br>

Thus inevitably in every collection of inputs there are always going to be<br>

around 1 in a million codepoints resulting in an UTFException thrown.<br>

<br>

Alas, I always have to do Regex matches on the other 999999 valid<br>

codepoints.....<br>

<br>

Is there a standard recipe in stdio for squashing bad codepoints to some<br>

default?<br>

<br>

These days memory is very much larger than most files I want to scan.<br>

<br>

So if I was doing this in C I would typically mmap the file PROT_READ |<br>

PROT_WRITE and MAP_PRIVATE then run down the file squashing bad codepoints<br>

and then run down it again matching patterns.<br>

<br>

In Ruby I have a horridly inefficient utility....<br>

      def IO.read_utf_8(file)<br>

<br>

read(file,:external_encoding=><u></u>'ASCII-8BIT').encode('UTF-8',:<u></u>undef=>:replace)<br>

      end<br>

<br>

What is the idiomatic D solution to this conundrum?<br>

</blockquote>

<br></div></div>

The encoding schemes in std.encoding support cleaning up input using the sanitize function.<br>

<br>

<a href="http://dlang.org/phobos/std_encoding.html#.EncodingScheme.sanitize" target="_blank">http://dlang.org/phobos/std_<u></u>encoding.html#.EncodingScheme.<u></u>sanitize</a><br>

<br>

It'd be nicer if the API were range based but it seems to do the trick in my experience.<br>

</blockquote></div><br><br clear="all"><br>-- <br><div dir="ltr">John Carter<br>Phone : (64)(3) 358 6639<br>Tait Electronics                         <br>PO Box 1645 Christchurch<br>New Zealand<br><br></div>

</div>


<br>

<hr><font color="#808080">This email, including any attachments, is only

 for the intended recipient. It is subject to copyright, is confidential

 and may be the subject of legal or other privilege, none of which is 

waived or lost by reason of this transmission.</font><div><font color="#808080">If

 you are not an intended recipient, you may not use, disseminate, 

distribute or reproduce such email, any attachments, or any part 

thereof. If you have received a message in error, please notify the 

sender immediately and erase all copies of the message and any 

attachments.</font></div><div><font color="#808080">Unfortunately, we 

cannot warrant that the email has not been altered or corrupted during 

transmission nor can we guarantee that any email or any attachments are 

free from computer viruses or other conditions which may damage or 

interfere with recipient data, hardware or software. The recipient 

relies upon its own procedures and assumes all risk of use and of 

opening any attachments.</font></div><div><hr></div>