std.string.reverse() for mutable array of chars

Fri Dec 9 02:41:35 PST 2011

On Friday, December 09, 2011 04:46:35 bearophile wrote:
> Reversing an array of chars/wchars is a common enough operation (mutable
> arrays often come from precedent operations that have built it). Currently
> std.algorithm.reverse() can't be used:
> 
> 
> import std.algorithm;
> void main() {
>     dchar[] s1 = "hello"d.dup;
>     s1.reverse(); // OK
>     wchar[] s2 = "hello"w.dup;
>     s2.reverse(); // error
>     char[] s3 = "hello".dup;
>     s3.reverse(); // error
> }
> 
> 
> I suggest to add a char[]/wchar[] specialization to std.algorithm.reverse()
> (or to add a std.string.reverse()), to make it work on those types too.
> Generally std.algorithms don't work on UTF8/UTF16 because of the variable
> length of its items, but for this specific algorithm I think this is not a
> problem because:
> 
> 1) Reversing an array is an O(n) operation, and decoding UTF adds a constant
> overhead, so the computational complexity of reverse doesn't change. 2) If
> you reverse an char[] or wchar[] the result will fit in the input array (is
> this always true? Please tell me if this isn't true). It "just" needs to
> correctly swap the bytes of multi-byte chars, and swap if there are
> combined codepoints too.
> 
> - - - - - - - - - - - - - - - - - -
> 
> And I think std.algorithm.reverse() is sometimes buggy on a dchar[] (UTF32):
> 
> 
> import std.algorithm: reverse;
> void main() {
>     dchar[] txt = "\U00000041\U00000308\U00000042"d.dup;
>     txt.reverse();
>     assert(txt == "\U00000042\U00000308\U00000041"d);
> }
> 
> 
> txt contains LATIN CAPITAL LETTER A, COMBINING DIAERESIS, LATIN CAPITAL
> LETTER B (see bug 7084 for more details).
> 
> A correct output for reversing txt is (LATIN CAPITAL LETTER B, LATIN CAPITAL
> LETTER A, COMBINING DIAERESIS):
> 
> "\U00000042\U00000041\U00000308"d
> 
> 
> See for some code:
> http://stackoverflow.com/questions/199260/how-do-i-reverse-a-utf-8-string-in
> -place
> 
> See also:
> http://d.puremagic.com/issues/show_bug.cgi?id=7085
> 
> Regarding the printing of unicode strings see also:
> http://d.puremagic.com/issues/show_bug.cgi?id=7084

If you want to reverse a char[], then cast it to ubyte[] and reverse that. If 
you want to reverse a wchar[], then cast it to ushort[] and reverse that. In 
Phobos, strings are ranges of dchar, so reverse is going to reverse code 
points. If you want it to reverse code units instead, then you just use the 
appropriate cast. There's no reason to have it reverse the code units and 
completely mess up unicode strings.

And as I explained in bug# 7085, reverse's behavior with regards to dchar[] is 
completely correct. It's reversing the code points, _not_ the graphemes. If 
you want to operate on graphemes, you need a range of graphemes, which Phobos 
does not yet support. Once it does (or if you implement it yourself), you can 
reverse a string based on graphemes if that's what you want to do. But as it 
stands, ranges of code points are the most advanced unicode construct that 
Phobos currently supports, so that's what its functions are going to operate 
on.

- Jonathan M Davis