dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

jfondren julian.fondren at
Thu Nov 4 14:52:56 UTC 2021

On Thursday, 4 November 2021 at 02:26:20 UTC, Walter Bright wrote:

This doesn't throw, actually:

unittest {
     import std.stdio : writeln;
     enum invalid = "hello\247\205\257there";

     foreach (c; invalid)
         writeln(cast(ubyte) c);

Which is per usual in D

@("std.utf.byUTF 2/3 (throwing)")
@safe unittest {
     import std.utf : byUTF, UTFException, UseReplacementDchar;
     import std.exception : assertThrown, assertNotThrown;
     import std.algorithm : count;

     string partial = "hello\247\205\257there";

     // byChar misses the bad UTF8 ...

     // byDchar objects to it

This does throw:

unittest {
     import std.stdio : writeln;
     enum invalid = "hello\247\205\257there";

     foreach (dchar c; invalid)
         writeln(cast(int) c);

but by asking for dchars from an immutable(char)[] you're asking 
for some unicode work to happen, so throwing is a reasonable 
default IMO. Emitting the replacement character is also a 
reasonable default, and objections in the thread can be answered 
the same way that objections to throwing can be: if you don't 
like it, iterate some other way:

// throw on invalid UTF
unittest {
     import std.utf : byUTF, UseReplacementDchar, UTFException;

     enum invalid = "hello\247\205\257there";

     int sum;
     try {
         foreach (dchar c; invalid.byUTF!(dchar,
             sum += cast(int) c;
         assert(sum == 197667);
     } catch (UTFException e) {
         assert(sum == 532);

// AssertError on invalid UTF
// (release behavior: "\247\205\257" is three dchars!)
unittest {
     import std.stdio : writeln;
     import std.encoding : codePoints;

     enum invalid = "hello\247\205\257there";

     foreach (dchar c; invalid.codePoints)
         writeln(cast(int) c);

// stop iterating on invalid UTF
unittest {
     import std.encoding : validLength;

     enum invalid = "hello\247\205\257there";
     char[] s;

     foreach (dchar c; invalid[0 .. invalid.validLength])
         s ~= c;
     assert(s == "hello");

More information about the Digitalmars-d mailing list