char and string with umlauts

Jonathan M Davis jmdavisProg at gmx.com
Thu Oct 20 10:19:12 PDT 2011


On Thursday, October 20, 2011 09:48 Jim Danley wrote:
> I have been a programmer for many years and started using D about one year
> back. Suddenly, I find myself in unfamiliar territory. I need to used
> Finish umlauts in chars and strings, but they are not part of my usual
> American ASCII character set.
> 
> Can anyone point me in the right direction? I am getting "Invalid UTF-8
> sequence" errors.

I'd have to see code to really say much about what you're doing. But char is a 
UTF-8 code unit, wchar is a UTF-16 code unit, and dchar is a UTF-32 code unit. 
For UTF-8 and UTF-16, it can take multiple code units to make a single code 
point, and a code point is typically what you would consider to be a character 
(it's actually possible for one code point to alter another - e.g. add an 
accent or superscript to it - so a true character would be what is called a 
grapheme, but for the most part, you don't need to worry about that; at the 
moment, D doesn't do anything special to support graphemes). So, when you're 
operating on characters in D, you want to operate on dchars, not chars or 
wchars, because they're not necessarily complete characters. That's why range-
based functions treat all strings as ranges of dchar, even if they're arrays 
of char or wchar (e.g. front returns a dchar, not a char or wchar). It's also 
why when iterating over a string with foreach, you want to specify the 
iteration type. e.g.

foreach(dchar c; str)

not

foreach(c; str)

Since iterating over the individual code units really isn't what you want. 
Basically, you pretty much never want to operate on an individual char or 
wchar. Always make sure that you operate on dchars when operating on 
individual characters.

- Jonathan M Davis


More information about the Digitalmars-d-learn mailing list