string types: const(char)[] and cstring

Regan Heath regan at netmail.co.nz
Tue May 29 17:39:33 PDT 2007


Frits van Bommel Wrote:
> Regan Heath wrote:
> > Aziz K. Wrote:
> >> On Tue, 29 May 2007 20:41:31 +0200, Regan Heath <regan at netmail.co.nz>  
> >> wrote:
> >>> and the result will be a correctly reversed UTF8 string.  Or am I  
> >>> missing something?
> >>>
> >>> Regan Heath
> >> I think your method doesn't take compound characters into account.
> >>
> >> For example:
> >> // The accented é can be represented by a single code-point. But let's  
> >> assume it's a compound character (Ce`a).
> > 
> > Is it a compound character in UTF32?
> 
> Unicode defines multiple valid encodings for lots of accented 
> characters; typically a single codepoint as well as separate codepoints 
> for the accent and the "naked" character that combine when put together.

I realise that.  But, the important question is what does toUTF32 do with compound UTF8 characters (or UTF16 for that matter)?  

> >> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
> >> // This would print áeC
> > 
> > Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.
> > 
> > My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.
> 
> I don't think std.utf.toUTF* combine or split accented characters, I'm 
> pretty sure it just does codepoint representation conversions (keeping 
> the number of codepoints constant).

This is the key issue.  I was under the (perhaps mistaken) impression it converted them to the single codepoint version (as that was easier), which is what I based this idea on.  Really a simple test should tell us, can you whip one up to prove it one way or the other?  

I would, but I don't really use unicode at all and I don't know any compound characters offhand.  I know, I know, I could google it but I also get the impression you know a bit more about this and would be able to devise a better test case, or two.

Ahh.. another thought.  I think I may have based my assumption on the foreach behaviour, eg.

char[] text = "<compund stuff>";
foreach(dchar d; text) { .. }

this _has_ to give the single codepoint versions, right?

I suspect foreach uses the same code as in std.utf, but I may be wrong.

Regan Heath



More information about the Digitalmars-d-announce mailing list