Sorting char arrays

Tue Mar 13 10:27:21 PDT 2012

On Tuesday, March 13, 2012 10:13:23 Magnus Lie Hetland wrote:
> On 2012-03-12 17:24:56 +0000, Jonathan M Davis said:
> > The problem is that sort requires a random access range, and narrow string
> > (string and wstring) aren't, because they're variable length encoded. I'm
> > not sure that strings _can_ be efficiently sorted, because of that, but
> > maybe there's a sorting algorthm that could do it reasonably efficiently,
> I'd certainly think that'd be posible. (Might, in fact, be a nice
> problem for the students in my algorithms class ;)
> 
> I guess I'm just tripped up by the fact that char[] is treated as an
> "almost-string", and hence is "infected" by the variable-length
> encoding of string (i.e., const char[]).
> 
> This all makes sense, for sure -- at least as a practical tradeoff or
> the like. (After all, UTF-8 is a super-elegant solution to a very
> difficult problem.) It's just so easy to assume that T[] is a
> random-access range. It seems so *obvious*. And it is true for any type
> T except the variable-length chars... :)
> 
> In my case, I was just using char literals (and strings) as an easy way
> of encoding test cases for a template class, and the sorting (+ uniq)
> was just a way of removing duplicates. (Could've used a hash, of
> course.) So I was really just treating it as a ubyte[]. Slightly
> Evil[tm], I guess.
> 
> > and we could
> > special case sort for narrow strings to use that one, but it's a while
> > since I messed with sorting algorithms, so I don't remember all of their
> > characteristics off of the top of my head. Certainly, with how sort is
> > currenly implemented, it can't handle any range which isn't a random
> > access range.
> No, I get that. I was just assuming that any T[] could be treated as a
> random access range if I really wanted to ;)

If you use ubyte[] instead of cast it to ubyte[], then _that_ is a random-
access range. So, as long as you don't mind operating on code units instead of 
code points (which really only works with ASCII and nothing else for char), 
then you can use functions like sort on it. But, of course, you're screwed as 
soon as you have a non-ASCII character, so you have to be careful. Depending 
on what your requirements are (with regards to efficiency and whatnot), it might 
just be safer to either using dchar[] or to convert to dchar[] and back again 
for operations that require it (such as sort). But if you _know_ that you're 
only dealing with ASCII, it works just fine to cast to ubyte[] for operations 
that need a random-access range.

- Jonathan M Davis