std.string and unicode

Sun Dec 17 08:17:35 PST 2006

Stewart Gordon wrote:
> Jari-Matti Mäkelä wrote:
>>> But functions like find() (substring form) and replace(), for instance,
>>> shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
>>> anything that doesn't deal with properties of individual characters
>>> should probably be fine.
>>
>> Unless the functions must be aware of different forms of the same
>> characters. The Unicode gurus can shed more light into this. I recall
>> there has been previous discussion of this also.
> 
> Eventually, we should probably have two versions of the find functions:
> one that matches codepoint for codepoint and therefore byte for byte,
> and one that matches character for character.  See point 6 at
> 
> http://www.textpad.info/forum/viewtopic.php?t=4778
> 
> However, it does appear that that post mixes up terms a bit - I think by
> "character" it means "codepoint", and by "glyph" it means "character".
> But the question, which version should be called find, etc. as opposed
> to something else?

I guess it's a bad practice to use any Unicode-unaware algorithms today,
but sometimes it's just a lot easier to do it the old fashioned way. And
of course, a bit faster too. D can be abused to use 8-bit ISO8859-x code
pages. This is what some people still want. The Linux people are pretty
much comfortable with Unicode, but on Windows things are worse. I
haven't invested much time on the windows side of things, but I think
there are some libraries to workaround this now.

One friend of mine also had some problems with Unicode a while back. He
would have wanted to have an Unicode aware file stream on the standard
library. The standard file stream just writes whatever comes to the file
so it's basically pretty easy to mix up different UTF encodings on the
same stream.

Currently D does not try to be very Unicode friendly. I really don't
know how it should be. There are several different opinions and
implementation and this has been discussed before. To me it just seems
that the current state of D is a weird mixture of several approaches.