std.string and unicode
Stewart Gordon
smjg_1998 at yahoo.com
Sun Dec 17 07:17:21 PST 2006
Jari-Matti Mäkelä wrote:
<snip>
> Ok, sorry - didn't mean to be rude. I just thought it might not be
> realistic to expect them all to be Unicode aware in the near future.
>
> Many of them were implemented to be pretty limited to ASCII last time I
> checked. And they haven't changed a lot in the last few years. I don't
> know, the 1.0 stable will be out on Jan 1.
At this rate, I think to call it stable is wishful thinking.
<snip>
>> But functions like find() (substring form) and replace(), for instance,
>> shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
>> anything that doesn't deal with properties of individual characters
>> should probably be fine.
>
> Unless the functions must be aware of different forms of the same
> characters. The Unicode gurus can shed more light into this. I recall
> there has been previous discussion of this also.
Eventually, we should probably have two versions of the find functions:
one that matches codepoint for codepoint and therefore byte for byte,
and one that matches character for character. See point 6 at
http://www.textpad.info/forum/viewtopic.php?t=4778
However, it does appear that that post mixes up terms a bit - I think by
"character" it means "codepoint", and by "glyph" it means "character".
But the question, which version should be called find, etc. as opposed
to something else?
<snip>
> Also tolower & toupper have problems with characters like Ә (04D8) and Һ
> (04BA). Ok, I'm not 100% sure they are supposed to have upper and lower
> case counterparts. Then there are some characters like Σ (03A3) that are
> also used as symbols in e.g. mathematics and ordinary letters in some
> languages.
<snip>
U+03A3 is the Greek letter and U+2211 is the mathematical symbol.
Similarly, U+03A0 and U+220F.
Here they are:
Σ ∑
Π ∏
Stewart.
More information about the Digitalmars-d
mailing list