Regarding hex strings

foobar foo at bar.com
Fri Oct 19 11:46:06 PDT 2012


On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
> On 19/10/12 16:07, foobar wrote:
>> On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
>>>>
>>>> We can still have both (assuming the code points are 
>>>> valid...):
>>>> string foo = "\ua1\ub2\uc3"; // no .dup
>>>
>>> That doesn't compile.
>>> Error: escape hex sequence has 2 hex digits instead of 4
>>
>> Come on, "assuming the code points are valid". It says so 4 
>> lines above!
>
> It isn't the same.
> Hex strings are the raw bytes, eg UTF8 code points. (ie, it 
> includes the high bits that indicate the length of each char).
> \u makes dchars.
>
> "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two 
> non-zero bytes.

Yes, the \u requires code points and not code-units for a 
specific UTF encoding, which you are correct in pointing out are 
four hex digits and not two.
This is a very reasonable choice to prevent/reduce Unicode 
encoding errors.

http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hex data. 
The hex data need not form valid UTF characters."

I _already_ said that I consider this a major semantic bug as it 
violates the principle of least surprise - the programmer's 
expectation that the D string types which are Unicode according 
to the spec to, well, actually contain _valid_ Unicode and _not_ 
arbitrary binary data.
Given the above, the design of \u makes perfect sense for 
_strings_ - you can use _valid_ code-points (not code units) in 
hex form.

For general purpose binary data (i.e. _not_ UTF encoded Unicode 
text) I also _already_ said IMO should be either stored as 
ubyte[] or better yet their own types that would ensure the 
correct invariants for the data type, be it audio, video, or just 
a different text encoding.

In neither case the hex-string is relevant IMO. In the former it 
potentially violates the type's invariant and in the latter we 
already have array literals.

Using a malformed _string_ to initialize ubyte[] IMO is simply 
less readable. How did that article call such features, "WAT"?


More information about the Digitalmars-d mailing list