Why the hell doesn't foreach decode strings

Mon Oct 24 12:23:14 PDT 2011

On 21.10.2011 06:06, Jonathan M Davis wrote:
> It's this very problem that leads some people to argue that string should be
> its own type which holds an array of code units (which can be accessed when
> needed) rather than doing what we do now where we try and treat a string as
> both an array of chars and a range of dchars. The result is schizophrenic.

Indeed - expressing strings as arrays of characters will always fall 
short of the unicode concept in some way. A true unicode-compliant 
languages have to handle strings as opaque objects that do not have any 
encoding. There is a number of operations that can be done with these 
objects (concatenation, comparison, searching, etc.). Any kind of 
defined memory representation can only be obtained by an explicit 
encoding operation.

Python3, for example, did a fundamental step by introducing this 
fundamental distinction. At first it seems silly, having to think about 
encodings so often when writing trivial code. After a short while, the 
strict conceptual separation between unencoded "strings" and encoded 
"arrays of something" really helps avoiding ugly problems.

Sure, for a performance-critical language, the issue becomes a lot 
trickier. I still think it is possible and ultimately the only way to 
solve tricky problems that will otherwise always crop up somewhere.