First Impressions

Fri Sep 29 18:18:10 PDT 2006

Derek Parnell wrote:
> I'm pretty sure that the phobos routines for search and replace only work
> for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
> always fail to deliver the correct result. It finds the first occurance of
> the byte value for the letter 'a' which may well be inside a Japanese
> character. It looks for byte-subsets rather than character sub-sets.

I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61) 
may be found within a Japanese multibyte glyph? Or even a very long 
Japanese text.

That is not correct.

The designers of UTF-8 knew that this would be dangerous, and created 
UTF-8 so that such _will_not_happen_. Ever.

Therefore, something like std.string.find() doesn't even have to know 
about it.

Basically, std.string.find() and comparable functions, only have to 
receive two octet sequences, and see where one of them first occurs in 
the other. No need to be aware of UTF or ASCII. For all we know, the 
strings may even be in EBCDIC. Still works.

If the strings themselves are valid (in whichever encoding you have 
chosen to use), then the result will also be valid.

((For the sake of completeness, here I've restricted the discussion to 
the version of such functions that accept ubyte[] compatible input 
(obviously including char[]). Those taking 16 or 32 bits, and especially 
if we deliberately feed input of wrong width to any of these, then of 
course the results will be more complicated.))