The Case For Autodecode

ag0aep6g via Digitalmars-d digitalmars-d at puremagic.com
Fri Jun 3 11:42:31 PDT 2016


On 06/03/2016 07:51 PM, Patrick Schluter wrote:
> You mean that '¶' is represented internally as 1 byte 0xB6 and that it
> can be handled as such without error? This would mean that char literals
> are broken. The only valid way to represent '¶' in memory is 0xC3 0x86.
> Sorry if I misunderstood, I'm only starting to learn D.

There is no single char for '¶', that's right, and D gets that right. 
That's not what happens.

But there is a single wchar for it. wchar is a UTF-16 code unit, 2 
bytes. UTF-16 encodes '¶' as a single code unit, so that's correct.

The problem is that you can accidentally search for a wchar in a range 
of chars. Every char is compared to the wchar by numeric value. But the 
numeric values of a char don't mean the same as those of a wchar, so you 
get nonsensical results.

A similar implicit conversion lets you search for a large number in a 
byte[]:

----
byte[] arr = [1, 2, 3];
foreach(x; arr) if (x == 1000) writeln("found it!");
----

You won't ever find 1000 in a byte[], of course. The byte type simply 
can't store the value. But you can compare a byte with an int. And that 
comparison is meaningful, unlike the comparison of a char with a wchar.

You can also produce false positives with numeric types, by mixing 
signed and unsigned types:

----
int[] arr = [1, -1, 3];
foreach(x; arr) if (x == uint.max) writeln("found it!");
----

uint.max is a large number, -1 is a small number. They're considered 
equal here because of an implicit conversion that messes with the 
meaning of the bits.

False negatives are not possible with numeric types. At least not in the 
same way as with differently sized Unicode code units.


More information about the Digitalmars-d mailing list