Improving D's support of code-pages

Kirk McDonald kirklin.mcdonald at gmail.com
Mon Aug 20 13:39:32 PDT 2007


Rioshin an'Harthen wrote:
> "Kirk McDonald" <kirklin.mcdonald at gmail.com> kirjoitti viestissä 
> news:facpkj$13ml$1 at digitalmars.com...
> 
>> Regan Heath wrote:
>>
>>> Kirk McDonald wrote:
>>> Technically 'char' in C is a signed byte, not an unsigned one 
>>> therefore byte[] is more accurate.
>>
>>
>> I don't agree with this last part. For starters, I had thought the 
>> signed-ness of 'char' in C was not defined. In any case, we're talking 
>> about chunks of arbitrary, homogenous binary data, so I think ubyte[] 
>> is most appropriate.
> 
> 
> <ramble>
> True. The C standard does not define the signedness of the char type.
> 
> What it does require of the char type is that it guarantees that any 
> character
> in the basic execution character set
> 
>    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
>    a b c d e f g h i j k l m n o p q r s t u v w x y z
>    0 1 2 3 4 5 6 7 8 9
>    ! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~
> 
> fit into a char in such a way that they are non-negative.
> 
> Quote from standard (ISO/IEC 9899:TC2 Committee Draft May 6, 2005):
> 
> 6.2.5 Types
> 3  An object declared as type char is large enough to store any member 
> of the
>    basic execution character set. If a member of the basic execution 
> character
>    set is stored in a char object, its value is guaranteed to be 
> nonnegative. If
>    any other character is stored in a char object, the resulting value is
>    implementation-defined but shall be within the range of vlaues that 
> can be
>    represented in that type.
> 
> 5.2.4.2.1 Size of integer types <limits.h>
>    - number of bits for smallest object that is not a bit-field (byte)
>        CHAR_BIT                    8
>    - minimum value for an object of type signed char
>        SCHAR_MIN                -127 // -(2^7 - 1)
>    - maximum value for an object of type signed char
>        SCHAR_MAX                127 // 2^7 - 1
>    - maximum value for an object of type unsigned char
>        UCHAR_MAX                255 // 2^8 - 1
>    - minimum value for an object of type char
>        CHAR_MIN                    see below
>    - maximum value for an object of type char
>        CHAR_MAX                   see below
> 
> 2  If the value of an object of type char is treated as a signed integer 
> when
>    used in an expression, the value of CHAR_MIN shall be the same as
>    that of SCHAR_MIN and the value of CHAR_MAX shall be the same
>    as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall
>    be 0 and the value of CHAR_MAX shall be the same as that of
>    UCHAR_MAX. The value of UCHAR_MAX shall equal
>    2^(CHAR_BIT) - 1.
> </ramble>
> 
> So, applying this to the discussion would suggest that either byte[] or
> ubyte[] would be appropriate. However, the most natural would be to
> handle data as raw data without signs, thus ubyte[] feels more natural to
> use as the standard type for any data whatsoever.

Although this is interesting, and it does agree with what I was saying, 
it is basically irrelevant. When passing a string to decode(), the bytes 
therein could be in any encoding, even one which has nothing to do with 
the above. (It could be in a multi-byte encoding!) None of those 
guarantees which the C standard requires apply to these raw bytes. 
Therefore ubyte[] is /definitely/ more appropraite.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org



More information about the Digitalmars-d mailing list