string types: const(char)[] and cstring

Wed May 30 04:32:08 PDT 2007

Regan Heath wrote:
> Frits van Bommel Wrote:
>> Regan Heath wrote:
>>> Aziz K. Wrote:
>>>> writefln( toUTF8(toUTF32("Céa").reverse) ) // would reverse to a`eC
>>>> // This would print áeC
>>> Can you code that test up (using the \U character literal syntax so that the web interface doesn't mangle it) I'd like to play with it.
>>>
>>> My statement was based on the assumption that converting UTF8 to UTF32 would result in all the compound characters being converted/represented by a single UTF32 codepoint each and would therefore be reversable.
>> I don't think std.utf.toUTF* combine or split accented characters, I'm 
>> pretty sure it just does codepoint representation conversions (keeping 
>> the number of codepoints constant).
> 
> This is the key issue.  I was under the (perhaps mistaken) impression it converted them to the single codepoint version (as that was easier), which is what I based this idea on.  Really a simple test should tell us, can you whip one up to prove it one way or the other?  

---
import std.stdio;
import std.utf;

void main(char[][] args) {
     // Codepoint 0301 is "Combining acute accent".
     // Codepoint 00e9 is "Latin small letter e with acute"
     char[] str = "e\u0301 \u00e9";

     // This doesn't show the combined character on my console.
     // Perhaps my terminal doesn't properly support combining characters.
     // (My encoding is utf-8, so that shouldn't be the problem)
     // The precomposed character (00e9) is displayed properly.
     // When piped to a .html file and wrapped with
     // <html><body>...</body></html> firefox properly displays both.
     writefln(str);
     foreach (dchar c; str) {
         writef("%04x ", c);
     }
     writefln();

     // This produces the exact same output as above code:
     dchar[] dstr = toUTF32(str);
     writefln(dstr);
     foreach (dchar c; dstr) {
         writef("%04x ", c);
     }
     writefln();
}
---

> I would, but I don't really use unicode at all and I don't know any compound characters offhand.  I know, I know, I could google it but I also get the impression you know a bit more about this and would be able to devise a better test case, or two.

I normally have little use for it as well. A few Dutch (my native 
tongue) words need accents, but I'll be damned if I know the codes. Let 
alone those of any combining characters. My usual way of typing those is 
either using the symbol map or just typing it without accents, 
right-click, select spell-check suggestion with accents :).
However, for above test I just looked up the codes in the code charts on 
the unicode website (unicode.org/charts for the precomposed character 
and the "symbols and punctuation" link at the top for the combining 
accent). It's pretty easy to find, actually.

> Ahh.. another thought.  I think I may have based my assumption on the foreach behaviour, eg.
> 
> char[] text = "<compund stuff>";
> foreach(dchar d; text) { .. }
> 
> this _has_ to give the single codepoint versions, right?

As demonstrated above, it doesn't. The runtime support for the 
converting foreach statements just imports std.utf and use decode and 
toUTF*[1] (as well as some manual conversion to surrogates in the 
functions dealing with wchar). None of those do anything other than 
decoding and encoding single codepoints.

[1]: The apparently undocumented (buf, dchar) overloads, which don't 
allocate.

> I suspect foreach uses the same code as in std.utf, but I may be wrong.

About this, you're not :P.

I suspect the reason std.utf doesn't do decomposition and/or combining 
is that it would require a lookup table, and possibly quite a big one at 
that. Though generating it shouldn't be a problem; it could be trivially 
extracted from the machine-readable data on the unicode website. Just 
take http://www.unicode.org/Public/UNIDATA/UnicodeData.txt, the sixth 
column is the decomposition of the character in the first column. (It 
may also contain the mapping type between <angle brackets>)
Note that for full decomposition this mapping needs to be applied 
recursively[2], i.e. the characters in the 6th column need to be 
decomposed as well (if possible).

[2]: See the reminder in 
http://www.unicode.org/Public/UNIDATA/UCD.html#Character_Decomposition_Mappings