Inconsitency

Maxim Fomin maxim at maxim-fomin.ru
Sun Oct 13 10:03:14 PDT 2013


On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
>> This is simply wrong. All strings return number of codeunits. 
>> And it's only UTF-32 where codepoint (~ character) happens to 
>> fit into one codeunit.
>
> I do not agree:
>
>    writeln("säд".length);        => 5  chars: 5 (1 + 2 [C3A4] + 
> 2 [D094], UTF-8)
>    writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
>    writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
>    writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)
>
> This is not consistent - from my point of view.

This is not a single inconsistency here.

First of all, typeof("säд") yileds string type (immutable char) 
while typeof(['s', 'ä', 'д']) yileds neither char[], nor wchar[], 
nor even dchar[] but int[]. In this case D is close to C which 
also treats character literals as integer type. Secondly, 
character arrays are only one who have two kinds of array 
literals - usual [item. item, item] and special "blah", as you 
see there is no correspondence between them.

If you try char[] x = cast(char[])['s', 'ä', 'д'] then length 
would be indeed 3 (but don't use it - it is broken).

In D dynamic array is at binary level represented as struct { 
void *ptr; size_t length; }. When you perform some operations on 
dynamic arrays they are implemented by compiler as calls to 
runtime functions. However, during runtime it is impossible to do 
something useful on arrays for which there is only information 
about address of beginning and total elements (this is a source 
of other problems in D). To handle this, compiler generates and 
passes as separate argument "TypeInfo" to runtime functions. 
TypeInfo contains some data, most relevant here is size of the 
element.

What happens is follows. Compiler recognizes that "säд" should be 
string literal and encoded as UTF-8 
(http://dlang.org/lex.html#DoubleQuotedString), so element type 
should be char, which requires to have 5 elements in array. So, 
at runtime an object "säд" is treated as array of 5 elements each 
having 1 byte per element.

Basically string (and char[]) plays dual role in the language - 
on the one hand, it is array of elements having strictly 1 byte 
size by definition, on the other hand D tries to use it as 
'generic' UTF type for which size is not fixed. So, there is 
contradiction - in source code such strings are viewed by 
programmer as some abstract UTF string, but druntime views it as 
5 byte array. In my view, trouble begins when "säд" is internally 
casted to char (which is no better than int[] x = [3.14, 5.6]). 
And indeed, char[] x = ['s', 'ä', 'д'] is refused by language, so 
there is great inconsistency here.

By the way, UTF definition is irrelevant here, this is pure 
implementation issue (I think it is design fault).


More information about the Digitalmars-d mailing list