Regarding hex strings

foobar foo at bar.com
Fri Oct 19 11:57:31 PDT 2012


On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:
> On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
>> On 19/10/12 16:07, foobar wrote:
>>> On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston 
>>> wrote:
>>>>>
>>>>> We can still have both (assuming the code points are 
>>>>> valid...):
>>>>> string foo = "\ua1\ub2\uc3"; // no .dup
>>>>
>>>> That doesn't compile.
>>>> Error: escape hex sequence has 2 hex digits instead of 4
>>>
>>> Come on, "assuming the code points are valid". It says so 4 
>>> lines above!
>>
>> It isn't the same.
>> Hex strings are the raw bytes, eg UTF8 code points. (ie, it 
>> includes the high bits that indicate the length of each char).
>> \u makes dchars.
>>
>> "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two 
>> non-zero bytes.
>
> Yes, the \u requires code points and not code-units for a 
> specific UTF encoding, which you are correct in pointing out 
> are four hex digits and not two.
> This is a very reasonable choice to prevent/reduce Unicode 
> encoding errors.
>
> http://dlang.org/lex.html#HexString states:
> "Hex strings allow string literals to be created using hex 
> data. The hex data need not form valid UTF characters."
>
> I _already_ said that I consider this a major semantic bug as 
> it violates the principle of least surprise - the programmer's 
> expectation that the D string types which are Unicode according 
> to the spec to, well, actually contain _valid_ Unicode and 
> _not_ arbitrary binary data.
> Given the above, the design of \u makes perfect sense for 
> _strings_ - you can use _valid_ code-points (not code units) in 
> hex form.
>
> For general purpose binary data (i.e. _not_ UTF encoded Unicode 
> text) I also _already_ said IMO should be either stored as 
> ubyte[] or better yet their own types that would ensure the 
> correct invariants for the data type, be it audio, video, or 
> just a different text encoding.
>
> In neither case the hex-string is relevant IMO. In the former 
> it potentially violates the type's invariant and in the latter 
> we already have array literals.
>
> Using a malformed _string_ to initialize ubyte[] IMO is simply 
> less readable. How did that article call such features, "WAT"?

I just re-checked and to clarify string literals support _three_ 
escape sequences:
\x__ - a single byte
\u____ - two bytes
\U________ - four bytes

So raw bytes _can_ be directly specified and I hope the compiler 
still verifies the string literal is valid Unicode.




More information about the Digitalmars-d mailing list