First Impressions
Georg Wrede
georg.wrede at nospam.org
Fri Sep 29 18:18:10 PDT 2006
Derek Parnell wrote:
> I'm pretty sure that the phobos routines for search and replace only work
> for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
> always fail to deliver the correct result. It finds the first occurance of
> the byte value for the letter 'a' which may well be inside a Japanese
> character. It looks for byte-subsets rather than character sub-sets.
I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61)
may be found within a Japanese multibyte glyph? Or even a very long
Japanese text.
That is not correct.
The designers of UTF-8 knew that this would be dangerous, and created
UTF-8 so that such _will_not_happen_. Ever.
Therefore, something like std.string.find() doesn't even have to know
about it.
Basically, std.string.find() and comparable functions, only have to
receive two octet sequences, and see where one of them first occurs in
the other. No need to be aware of UTF or ASCII. For all we know, the
strings may even be in EBCDIC. Still works.
If the strings themselves are valid (in whichever encoding you have
chosen to use), then the result will also be valid.
((For the sake of completeness, here I've restricted the discussion to
the version of such functions that accept ubyte[] compatible input
(obviously including char[]). Those taking 16 or 32 bits, and especially
if we deliberately feed input of wrong width to any of these, then of
course the results will be more complicated.))
More information about the Digitalmars-d
mailing list