Why is string.front dchar?

Maxim Fomin maxim at maxim-fomin.ru
Wed Jan 15 22:59:41 PST 2014


On Thursday, 16 January 2014 at 05:56:48 UTC, Jakob Ovrum wrote:
> On Tuesday, 14 January 2014 at 11:42:34 UTC, Maxim Fomin wrote:
>> The root of the issue is that string literals containing 
>> characters which do not fit into signle byte are still 
>> converted to char[] array. This is strictly speaking not type 
>> safe because it allows to reinterpret 2 or 4 byte code unit as 
>> sequence of characters of 1 byte size. The string type is in 
>> some sense problematic in D. That's why the fact that .front 
>> returns dhcar is a way to correct the problem, it is not an 
>> attempt to introduce confusion.
>
> This assertion makes all the wrong assumptions.
>
> `char` is a UTF-8 code unit[1], and `string` is an array of 
> immutable UTF-8 code units. The whole point of UTF-8 is the 
> ability to encode code points that need multiple bytes (UTF-8 
> code units), so the string literal behaviour is perfectly 
> regular.

This is wrong. String in D is de facto (by implementation, spec 
may say whatever is convenient for advertising D) array of single 
bytes which can keep UTF-8 code units. No way string type in D is 
always a string in a sense of code points/characters. Sometimes 
it happens that string type behaves like 'string', but if you put 
UTF-16 or UTF-32 text it would remind you what string type really 
is.

> Operations on code units are rare, which is why the standard 
> library instead treats strings as ranges of code points, for 
> correctness by default. However, we must not prevent the user 
> from being able to work on arrays of code units, as many string 
> algorithms can be optimized by not doing full UTF decoding. The 
> standard library does this on many occasions, and there are 
> more to come.

This is attempt to explain problematic design as a wise action.

> Note that the Unicode definition of an unqualified "character" 
> is the translation of a code *point*, which is very different 
> from a *glyph*, which is what people generally associate the 
> word "character" with. Thus, `string` is not an array of 
> characters (i.e. an array where each element is a character), 
> but `dstring` can be said to be.
>
> [1] http://dlang.org/type

By the way, the link you provide says char is unsigned 8 bit type 
which can keep value of UTF-8 code unit.

UTF is irrelevant because the problem is in D implementation. See 
http://forum.dlang.org/thread/hoopiiobddbapybbwwoc@forum.dlang.org 
(in particular 2nd page).

The root of the issue is that D does not provide 'utf' type which 
would handle correctly strings and characters irrespective of the 
format. But instead the language pretends that it supports such 
type by allowing to convert to single byte char array both 
literals "sad" and "säд". And ['s', 'ä', 'д'] is by the way 
neither char[], no wchar[], even not dchar[] but sequence of 
integers, which compounds oddities in character types.

Problems with string type can be illustrated as possible 
situation in domain of integers type. Assume that user wants 
'number' type which accepts both integers, floats and doubles and 
treats them properly. This would require either library solution 
or a new special type in a language which is supported by both 
compiler and runtime library, which performs operation at runtime 
on objects of number type according to their effective type.

D designers want to support such feature (to make the language 
better), but as it happens in other situations, the support is 
only limited: compiler allows to do

alias immutable(int)[] number;
number my_number = [0, 3.14, 3.14l];

but there is no support in runtime library. On surface, this 
looks like language have type which supports wanted feature, but 
if you use it, you will face the problems, as my_number[2] would 
give strange integer instead of float 3.14 and length of this 
array is 4 instead of 3. In addition this is not a type safe 
approach because there is option to reinterpret double as two 
integers.

Now in order to fix this, there is number of functions in library 
which treat number type properly. Such practice (limited and 
broken language support, unsafe and illogical type, functions to 
correct design mistakes) is embedded into practice so deeply, 
that anyone who point out on this problem in newsgroup is treated 
as a fool and is sent to study IEE 754 standard.


More information about the Digitalmars-d-learn mailing list