Regarding hex strings
foobar
foo at bar.com
Fri Oct 19 11:46:06 PDT 2012
On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
> On 19/10/12 16:07, foobar wrote:
>> On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
>>>>
>>>> We can still have both (assuming the code points are
>>>> valid...):
>>>> string foo = "\ua1\ub2\uc3"; // no .dup
>>>
>>> That doesn't compile.
>>> Error: escape hex sequence has 2 hex digits instead of 4
>>
>> Come on, "assuming the code points are valid". It says so 4
>> lines above!
>
> It isn't the same.
> Hex strings are the raw bytes, eg UTF8 code points. (ie, it
> includes the high bits that indicate the length of each char).
> \u makes dchars.
>
> "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two
> non-zero bytes.
Yes, the \u requires code points and not code-units for a
specific UTF encoding, which you are correct in pointing out are
four hex digits and not two.
This is a very reasonable choice to prevent/reduce Unicode
encoding errors.
http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hex data.
The hex data need not form valid UTF characters."
I _already_ said that I consider this a major semantic bug as it
violates the principle of least surprise - the programmer's
expectation that the D string types which are Unicode according
to the spec to, well, actually contain _valid_ Unicode and _not_
arbitrary binary data.
Given the above, the design of \u makes perfect sense for
_strings_ - you can use _valid_ code-points (not code units) in
hex form.
For general purpose binary data (i.e. _not_ UTF encoded Unicode
text) I also _already_ said IMO should be either stored as
ubyte[] or better yet their own types that would ensure the
correct invariants for the data type, be it audio, video, or just
a different text encoding.
In neither case the hex-string is relevant IMO. In the former it
potentially violates the type's invariant and in the latter we
already have array literals.
Using a malformed _string_ to initialize ubyte[] IMO is simply
less readable. How did that article call such features, "WAT"?
More information about the Digitalmars-d
mailing list