Inconsitency
Maxim Fomin
maxim at maxim-fomin.ru
Sun Oct 13 10:03:14 PDT 2013
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
>> This is simply wrong. All strings return number of codeunits.
>> And it's only UTF-32 where codepoint (~ character) happens to
>> fit into one codeunit.
>
> I do not agree:
>
> writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] +
> 2 [D094], UTF-8)
> writeln(std.utf.count("säд")) => 3 chars: 5 (ibidem)
> writeln("säд"w.length); => 3 chars: 6 (2 x 3, UTF-16)
> writeln("säд"d.length); => 3 chars: 12 (4 x 3, UTF-32)
>
> This is not consistent - from my point of view.
This is not a single inconsistency here.
First of all, typeof("säд") yileds string type (immutable char)
while typeof(['s', 'ä', 'д']) yileds neither char[], nor wchar[],
nor even dchar[] but int[]. In this case D is close to C which
also treats character literals as integer type. Secondly,
character arrays are only one who have two kinds of array
literals - usual [item. item, item] and special "blah", as you
see there is no correspondence between them.
If you try char[] x = cast(char[])['s', 'ä', 'д'] then length
would be indeed 3 (but don't use it - it is broken).
In D dynamic array is at binary level represented as struct {
void *ptr; size_t length; }. When you perform some operations on
dynamic arrays they are implemented by compiler as calls to
runtime functions. However, during runtime it is impossible to do
something useful on arrays for which there is only information
about address of beginning and total elements (this is a source
of other problems in D). To handle this, compiler generates and
passes as separate argument "TypeInfo" to runtime functions.
TypeInfo contains some data, most relevant here is size of the
element.
What happens is follows. Compiler recognizes that "säд" should be
string literal and encoded as UTF-8
(http://dlang.org/lex.html#DoubleQuotedString), so element type
should be char, which requires to have 5 elements in array. So,
at runtime an object "säд" is treated as array of 5 elements each
having 1 byte per element.
Basically string (and char[]) plays dual role in the language -
on the one hand, it is array of elements having strictly 1 byte
size by definition, on the other hand D tries to use it as
'generic' UTF type for which size is not fixed. So, there is
contradiction - in source code such strings are viewed by
programmer as some abstract UTF string, but druntime views it as
5 byte array. In my view, trouble begins when "säд" is internally
casted to char (which is no better than int[] x = [3.14, 5.6]).
And indeed, char[] x = ['s', 'ä', 'д'] is refused by language, so
there is great inconsistency here.
By the way, UTF definition is irrelevant here, this is pure
implementation issue (I think it is design fault).
More information about the Digitalmars-d
mailing list