Why the hell doesn't foreach decode strings
Michel Fortin
michel.fortin at michelf.com
Thu Oct 20 21:52:09 PDT 2011
On 2011-10-21 03:58:50 +0000, Jonathan M Davis <jmdavisProg at gmx.com> said:
> Sure, if you _know_ that you're dealing with a string with only ASCII, it's
> faster to just iterate over chars
It works for non-ASCII too. You're probably missing an interesting
property of UTF encodings: if you want to search for a substring in a
well-formed UTF sequence, you do not need to decode the bigger string,
comparing the UTF-x code units of the substring with the UTF-x code
units of the bigger string is plenty enough.
Similarly, if you're searching for the 'ê' code point in an UTF-8
string, the most efficient way is to search the string for the two-byte
UTF-8 sequence you would use to encode 'ê' (in other words, convert 'ê'
to a string). Decoding the whole string is a wasteful process.
> Sure, if you _know_ that you're dealing with a string with only ASCII, it's
> faster to just iterate over chars, but then you can explicitly give the type
> of the foreach variable as char, but normally what people care about is
> iterating over characters, not pieces of characters.
If you want to iterate over what people consider characters, then you
need to take into account combining marks that form multi-code-point
graphemes. (You'll probably want to deal with unicode normalization
too.) Treating code points as if they were characters is a
misconception in the same way as treating UTF-16 code units as
character is: both works most of the time but also fail in a number of
cases.
> So, I would expect the
> case where people _want_ to iterate over chars to be rare. In most cases,
> iterating over a string as chars is a bug - one which in many cases won't be
> quickly caught, because the programmer is English speaking and uses almost
> exclusively ASCII for whatever testing that they do.
That's a real problem. But is treating everything as dchar the only
solution to that problem?
> Defaulting to the
> guaranteed correct handling of characters and special casing when it's
> possible to write code more efficiently than that is definitely the way to go
> about it, and it's how Phobos generally does it.
Iterating on dchar is not guarantied to be correct, it only has
significantly more chances of being correct.
> The fact that foreach doesn't
> is incongruous with how strings are handled in most other cases.
You could also argue that ranges are doing things the wrong way.
>> I like the type deduction feature of foreach, and don't think it should be
>> removed for strings. Currently, it's consistent - T[] gets an element type
>> of T.
>
> Sure, the type deduction of foreach is great, and it's completely consistent
> that iterating over an array of chars would iterate over chars rather than
> dchars when you don't give the type. However, in most cases, that is _not_
> what the programmer actually wants. They want to iterate over characters, not
> pieces of characters.
I note that you keep confusing characters with code units.
>> I want to reiterate that there's no way to program strings in D without
>> being cognizant of them being a multibyte representation. D is both a high
>> level and a low level language, and you can pick which to use, but you
>> still gotta pick.
>
> I fully agree that programmers need to properly understand unicode to use
> strings in D properly. However, the problem is that the default handling of
> strings with foreach is _not_ what programmers are going to normally want, so
> the default will cause bugs.
That said I wouldn't expect most programmers understand Unicode. Giving
them dchars by default won't eliminate bugs related to multi-code-point
characters, but it'll likely eliminate bugs relating to multi-code-unit
sequences. That could be a good start. I'd say choosing dchar is a
practical compromise between the "characters by default" and "the type
of the array by default", but it is neither of those ideals. How is
that pragmatic trade-off going to fare a few years in the future? I'm a
little skeptical that this is the ideal solution.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list