Why is string.front dchar?

Maxim Fomin maxim at maxim-fomin.ru
Mon Jan 20 03:55:54 PST 2014


On Monday, 20 January 2014 at 09:58:07 UTC, Jakob Ovrum wrote:
> On Thursday, 16 January 2014 at 06:59:43 UTC, Maxim Fomin wrote:
>> This is wrong. String in D is de facto (by implementation, 
>> spec may say whatever is convenient for advertising D) array 
>> of single bytes which can keep UTF-8 code units. No way string 
>> type in D is always a string in a sense of code 
>> points/characters. Sometimes it happens that string type 
>> behaves like 'string', but if you put UTF-16 or UTF-32 text it 
>> would remind you what string type really is.
>
> By implementation they are also UTF strings. String literals 
> use UTF, `char.init` is 0xFF and `wchar.init` is 0xFFFF, 
> foreach over narrow strings with `dchar` iterator variable type 
> does UTF decoding etc.
>
> I don't think you know what you're talking about; putting 
> UTF-16 or UTF-32 in `string` is utter madness and not trivially 
> possible. We have `wchar`/`wstring` and `dchar`/`dstring` for 
> UTF-16 and UTF-32, respectively.
>

import std.stdio;

void main()
{
	string s = "о";
	writeln(s.length);
}

This compiles and prints 2. This means that string type is 
broken. It is broken in the way as I was attempting to explain.

>> This is attempt to explain problematic design as a wise action.
>
> No, it's not. Please leave crappy, unsubstantiated arguments 
> like this out of these forums.

Note, that I provided examples why design is problematic. The 
arguement isn't unsubstained.

>
>>> [1] http://dlang.org/type
>>
>> By the way, the link you provide says char is unsigned 8 bit 
>> type which can keep value of UTF-8 code unit.
>
> Not *can*, but *does*. Otherwise it is an error in the program. 
> The specification, compiler implementation (as shown above) and 
> standard library all treat `char` as a UTF-8 code unit. Treat 
> it otherwise at your own peril.
>

But such treating is nonsense. It is like treating integer or 
floating number as sequence of bytes. You are essentially saying 
that treating char as UTF-8 code unit is OK because language is 
treating char as UTF-8 code unit.

> The only problem in the implementation here that you illustrate 
> is that `['s', 'ä', 'д']` is of type `int[]`, which is a bug. 
> It should be `dchar[]`. The length of `char[]` works as 
> intended.

You are saying that length of char works as intended, which is 
true, but shows that design is broken.

>> Problems with string type can be illustrated as possible 
>> situation in domain of integers type. Assume that user wants 
>> 'number' type which accepts both integers, floats and doubles 
>> and treats them properly. This would require either library 
>> solution or a new special type in a language which is 
>> supported by both compiler and runtime library, which performs 
>> operation at runtime on objects of number type according to 
>> their effective type.
>>
>> D designers want to support such feature (to make the language 
>> better), but as it happens in other situations, the support is 
>> only limited: compiler allows to do
>>
>> alias immutable(int)[] number;
>> number my_number = [0, 3.14, 3.14l];
>
> I don't understand this example. The compiler does *not* allow 
> that code; try it for yourself.

It does not allow because it is nonsense. However it does allows 
equivalent nonsesnce in character types.

alias immutable(int)[] number;
number my_number = [0, 3.14, 3.14l]; // does not compile

alias immutable(char)[] string;
string s = "säд"; // compiles, however "säд" should default to 
wstring or dstring

Same reasons which prevent sane person from being OK with int[] 
number = [3.14l] should prevent him from being OK with string s = 
"säд"


More information about the Digitalmars-d-learn mailing list