Why the hell doesn't foreach decode strings

Thu Oct 20 21:52:09 PDT 2011

On 2011-10-21 03:58:50 +0000, Jonathan M Davis <jmdavisProg at gmx.com> said:

> Sure, if you _know_ that you're dealing with a string with only ASCII, it's
> faster to just iterate over chars

It works for non-ASCII too. You're probably missing an interesting 
property of UTF encodings: if you want to search for a substring in a 
well-formed UTF sequence, you do not need to decode the bigger string, 
comparing the UTF-x code units of the substring with the UTF-x code 
units of the bigger string is plenty enough.

Similarly, if you're searching for the 'ê' code point in an UTF-8 
string, the most efficient way is to search the string for the two-byte 
UTF-8 sequence you would use to encode 'ê' (in other words, convert 'ê' 
to a string). Decoding the whole string is a wasteful process.

> Sure, if you _know_ that you're dealing with a string with only ASCII, it's
> faster to just iterate over chars, but then you can explicitly give the type
> of the foreach variable as char, but normally what people care about is
> iterating over characters, not pieces of characters.

If you want to iterate over what people consider characters, then you 
need to take into account combining marks that form multi-code-point 
graphemes. (You'll probably want to deal with unicode normalization 
too.) Treating code points as if they were characters is a 
misconception in the same way as treating UTF-16 code units as 
character is: both works most of the time but also fail in a number of 
cases.

> So, I would expect the
> case where people _want_ to iterate over chars to be rare. In most cases,
> iterating over a string as chars is a bug - one which in many cases won't be
> quickly caught, because the programmer is English speaking and uses almost
> exclusively ASCII for whatever testing that they do.

That's a real problem. But is treating everything as dchar the only 
solution to that problem?

> Defaulting to the
> guaranteed correct handling of characters and special casing when it's
> possible to write code more efficiently than that is definitely the way to go
> about it, and it's how Phobos generally does it.

Iterating on dchar is not guarantied to be correct, it only has 
significantly more chances of being correct.

> The fact that foreach doesn't
> is incongruous with how strings are handled in most other cases.

You could also argue that ranges are doing things the wrong way.

>> I like the type deduction feature of foreach, and don't think it should be
>> removed for strings. Currently, it's consistent - T[] gets an element type
>> of T.
> 
> Sure, the type deduction of foreach is great, and it's completely consistent
> that iterating over an array of chars would iterate over chars rather than
> dchars when you don't give the type. However, in most cases, that is _not_
> what the programmer actually wants. They want to iterate over characters, not
> pieces of characters.

I note that you keep confusing characters with code units.

>> I want to reiterate that there's no way to program strings in D without
>> being cognizant of them being a multibyte representation. D is both a high
>> level and a low level language, and you can pick which to use, but you
>> still gotta pick.
> 
> I fully agree that programmers need to properly understand unicode to use
> strings in D properly. However, the problem is that the default handling of
> strings with foreach is _not_ what programmers are going to normally want, so
> the default will cause bugs.

That said I wouldn't expect most programmers understand Unicode. Giving 
them dchars by default won't eliminate bugs related to multi-code-point 
characters, but it'll likely eliminate bugs relating to multi-code-unit 
sequences. That could be a good start. I'd say choosing dchar is a 
practical compromise between the "characters by default" and "the type 
of the array by default", but it is neither of those ideals. How is 
that pragmatic trade-off going to fare a few years in the future? I'm a 
little skeptical that this is the ideal solution.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/