UTF-8 issues

Mon Sep 15 13:41:48 PDT 2008

On Mon, Sep 15, 2008 at 2:38 PM, Chris R. Miller
<lordsauronthegreat at gmail.com> wrote:
> Eldar Insafutdinov wrote:
>> I faced some issues with utf-8 support in D.
>> As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.
>> But foreach works correctly. So utf-8 support is partial. Maybe there are functions from standart library that does this work? I checked D2 new features - there was not improving utf-8 support - am I wrong?
>
> IIRC a char array in D will compress itself for ASCII-encodable
> characters, which destroys the integrity of the length variable.  Well,
> it's still valid in terms of how long in words the array is, but in
> terms of real characters it's no longer valid.

It's called UTF-8, and it's supposed to work like that.  That D does
not provide some kind of interface for dealing with multibyte
encodings (other than foreach and the encode/decode functions) is a
failing on its part, not Unicode's.

(Though it could be argued that multibyte encodings are stupid as
hell, and I would agree with that.)

> If you used a wchar or dchar things would be different.
>

If he used dchar it'd be different.  wchar still has multi-element
encodings (surrogate pairs) for codepoints outside the BMP.  Which,
admittedly, are not that common, but it can still happen.