char and string with umlauts
Jonathan M Davis
jmdavisProg at gmx.com
Thu Oct 20 10:19:12 PDT 2011
On Thursday, October 20, 2011 09:48 Jim Danley wrote:
> I have been a programmer for many years and started using D about one year
> back. Suddenly, I find myself in unfamiliar territory. I need to used
> Finish umlauts in chars and strings, but they are not part of my usual
> American ASCII character set.
>
> Can anyone point me in the right direction? I am getting "Invalid UTF-8
> sequence" errors.
I'd have to see code to really say much about what you're doing. But char is a
UTF-8 code unit, wchar is a UTF-16 code unit, and dchar is a UTF-32 code unit.
For UTF-8 and UTF-16, it can take multiple code units to make a single code
point, and a code point is typically what you would consider to be a character
(it's actually possible for one code point to alter another - e.g. add an
accent or superscript to it - so a true character would be what is called a
grapheme, but for the most part, you don't need to worry about that; at the
moment, D doesn't do anything special to support graphemes). So, when you're
operating on characters in D, you want to operate on dchars, not chars or
wchars, because they're not necessarily complete characters. That's why range-
based functions treat all strings as ranges of dchar, even if they're arrays
of char or wchar (e.g. front returns a dchar, not a char or wchar). It's also
why when iterating over a string with foreach, you want to specify the
iteration type. e.g.
foreach(dchar c; str)
not
foreach(c; str)
Since iterating over the individual code units really isn't what you want.
Basically, you pretty much never want to operate on an individual char or
wchar. Always make sure that you operate on dchars when operating on
individual characters.
- Jonathan M Davis
More information about the Digitalmars-d-learn
mailing list