Inconsitency

anonymous anonymous at example.com
Sun Oct 13 10:33:38 PDT 2013


On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
>> However, it could also yield the first code unit of the umlaut 
>> diacritic, depending on how the string is represented.
>
> This is not true for UTF-8, which is not subject to "endianism".

This is not about endianness. It's "\u00E4" vs "a\u0308". The 
first is the single code point 'ä', the second is two code 
points, 'a' plus umlaut dots.

[...]
> Well that's a point; on the other hand, D is constantly 
> creating and throwing away new strings, so this isn't quite an 
> argument. The current solution puts the programmer in charge of 
> dealing with UTF-x, where a more consistent implementation 
> would put the burden on the implementors of the libraries/core, 
> i.e. the ones who usually have a better understanding of 
> Unicode than the average programmer.
>
> Also, implementing such a semantics would not per se abandon a 
> byte-wise access, would it?
>
> So, how do you guys handle UTF-8 strings in D? What are your 
> solutions to the problems described? Does it all come down to 
> converting "string"s and "wstring"s to "dstring"s, manipulating 
> them, and re-convert them to "string"s? Btw, what would this 
> mean in terms of speed?
>
> These is no irony in my questions. I'm really looking for 
> solutions...

I think, std.uni and std.utf are supposed to supply everything 
Unicode.


More information about the Digitalmars-d mailing list