string types: const(char)[] and cstring
Regan Heath
regan at netmail.co.nz
Tue May 29 17:39:33 PDT 2007
Frits van Bommel Wrote:
> Regan Heath wrote:
> > Aziz K. Wrote:
> >> On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan at netmail.co.nz>
> >> wrote:
> >>> and the result will be a correctly reversed UTF8 string. Or am I
> >>> missing something?
> >>>
> >>> Regan Heath
> >> I think your method doesn't take compound characters into account.
> >>
> >> For example:
> >> // The accented é can be represented by a single code-point. But let's
> >> assume it's a compound character (Ce`a).
> >
> > Is it a compound character in UTF32?
>
> Unicode defines multiple valid encodings for lots of accented
> characters; typically a single codepoint as well as separate codepoints
> for the accent and the "naked" character that combine when put together.
I realise that. But, the important question is what does toUTF32 do with compound UTF8 characters (or UTF16 for that matter)?
> >> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
> >> // This would print áeC
> >
> > Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.
> >
> > My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.
>
> I don't think std.utf.toUTF* combine or split accented characters, I'm
> pretty sure it just does codepoint representation conversions (keeping
> the number of codepoints constant).
This is the key issue. I was under the (perhaps mistaken) impression it converted them to the single codepoint version (as that was easier), which is what I based this idea on. Really a simple test should tell us, can you whip one up to prove it one way or the other?
I would, but I don't really use unicode at all and I don't know any compound characters offhand. I know, I know, I could google it but I also get the impression you know a bit more about this and would be able to devise a better test case, or two.
Ahh.. another thought. I think I may have based my assumption on the foreach behaviour, eg.
char[] text = "<compund stuff>";
foreach(dchar d; text) { .. }
this _has_ to give the single codepoint versions, right?
I suspect foreach uses the same code as in std.utf, but I may be wrong.
Regan Heath
More information about the Digitalmars-d-announce
mailing list