std.string and unicode

Sun Dec 17 07:17:21 PST 2006

Jari-Matti Mäkelä wrote:
<snip>
> Ok, sorry - didn't mean to be rude. I just thought it might not be
> realistic to expect them all to be Unicode aware in the near future.
> 
> Many of them were implemented to be pretty limited to ASCII last time I
> checked. And they haven't changed a lot in the last few years. I don't
> know, the 1.0 stable will be out on Jan 1.

At this rate, I think to call it stable is wishful thinking.

<snip>
>> But functions like find() (substring form) and replace(), for instance,
>> shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
>> anything that doesn't deal with properties of individual characters
>> should probably be fine.
> 
> Unless the functions must be aware of different forms of the same
> characters. The Unicode gurus can shed more light into this. I recall
> there has been previous discussion of this also.

Eventually, we should probably have two versions of the find functions: 
one that matches codepoint for codepoint and therefore byte for byte, 
and one that matches character for character.  See point 6 at

http://www.textpad.info/forum/viewtopic.php?t=4778

However, it does appear that that post mixes up terms a bit - I think by 
"character" it means "codepoint", and by "glyph" it means "character". 
But the question, which version should be called find, etc. as opposed 
to something else?

<snip>
> Also tolower & toupper have problems with characters like Ә (04D8) and Һ
> (04BA). Ok, I'm not 100% sure they are supposed to have upper and lower
> case counterparts. Then there are some characters like Σ (03A3) that are
> also used as symbols in e.g. mathematics and ordinary letters in some
> languages.
<snip>

U+03A3 is the Greek letter and U+2211 is the mathematical symbol. 
Similarly, U+03A0 and U+220F.

Here they are:
Σ ∑
Π ∏

Stewart.