Regarding hex strings
foobar
foo at bar.com
Fri Oct 19 11:57:31 PDT 2012
On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:
> On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
>> On 19/10/12 16:07, foobar wrote:
>>> On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston
>>> wrote:
>>>>>
>>>>> We can still have both (assuming the code points are
>>>>> valid...):
>>>>> string foo = "\ua1\ub2\uc3"; // no .dup
>>>>
>>>> That doesn't compile.
>>>> Error: escape hex sequence has 2 hex digits instead of 4
>>>
>>> Come on, "assuming the code points are valid". It says so 4
>>> lines above!
>>
>> It isn't the same.
>> Hex strings are the raw bytes, eg UTF8 code points. (ie, it
>> includes the high bits that indicate the length of each char).
>> \u makes dchars.
>>
>> "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two
>> non-zero bytes.
>
> Yes, the \u requires code points and not code-units for a
> specific UTF encoding, which you are correct in pointing out
> are four hex digits and not two.
> This is a very reasonable choice to prevent/reduce Unicode
> encoding errors.
>
> http://dlang.org/lex.html#HexString states:
> "Hex strings allow string literals to be created using hex
> data. The hex data need not form valid UTF characters."
>
> I _already_ said that I consider this a major semantic bug as
> it violates the principle of least surprise - the programmer's
> expectation that the D string types which are Unicode according
> to the spec to, well, actually contain _valid_ Unicode and
> _not_ arbitrary binary data.
> Given the above, the design of \u makes perfect sense for
> _strings_ - you can use _valid_ code-points (not code units) in
> hex form.
>
> For general purpose binary data (i.e. _not_ UTF encoded Unicode
> text) I also _already_ said IMO should be either stored as
> ubyte[] or better yet their own types that would ensure the
> correct invariants for the data type, be it audio, video, or
> just a different text encoding.
>
> In neither case the hex-string is relevant IMO. In the former
> it potentially violates the type's invariant and in the latter
> we already have array literals.
>
> Using a malformed _string_ to initialize ubyte[] IMO is simply
> less readable. How did that article call such features, "WAT"?
I just re-checked and to clarify string literals support _three_
escape sequences:
\x__ - a single byte
\u____ - two bytes
\U________ - four bytes
So raw bytes _can_ be directly specified and I hope the compiler
still verifies the string literal is valid Unicode.
More information about the Digitalmars-d
mailing list