Why foreach(c; someString) must yield dchar

Jonathan M Davis jmdavisprog at gmail.com
Fri Aug 20 10:22:51 PDT 2010


On Friday, August 20, 2010 09:44:26 Simen kjaeraas wrote:
> Rainer Deyke <rainerd at eldwood.com> wrote:
> > On 8/19/2010 03:56, Jonathan Davis wrote:
> >> The problem is that chars are not characters. They are UTF-8 code
> >> units.
> > 
> > So what?  You're acting like 'char' (and specifically 'char[]') is some
> > sort of unique special case.  In reality, it's just one case of encoded
> > data.  What about compressed data?  What about packed arrays of bits?
> > What about other containers?
> 
> First off, char, wchar, and dchar are special cases already - they're
> basically byte, short, and int, but are treated somewhat differently.
> 
> One possibility, which would make strings a less integrated part of the
> language, is to make them simple range structs, and hide UTF-8/16
> details in the implementation. If it were not for the fact that D touts
> its UTF capabilities, and that this would make it a little less true,
> and the fact that char/wchar/dchar are already treated specially, I
> would support this idea.

If you do that, you'd probably do something like

struct String(C)
{
    C[] array;
    
    dchar front() { size_t i = 0; return decod(a, i); }
    dchar back()  { /* more complicated code*/ }
    void popFront() { array.popFront(); }
    void popBack()  { array.popBack(); }
    bool empty()    { return array.empty; }
}

alias String(immutable char) string;


Naturally, there would be template constraints, the functions might be a bit 
more complex, and there would probably be some other functions (not to mention, 
you might have to do something fancy to get the immutable part to work since 
IIRC templates remove immutable and const so that they don't generate different 
templates for immutable, const, and mutable), but essentially, you would wrap 
the various string types in a struct with range operations based on dchar. You 
could get at the underlying array quite easily if you actually wanted array 
operations. And if you want string operations, well you have the range 
operations. Everywhere in the code where you currently have string, you'd have 
String(immutable char) instead of immutable (char)[].

I really don't know what all of the implications of this are. There have been 
similar suggestions before. You don't really hide the fact that they're UTF-8 
and UTF-16. Rather you just make it so that the main interface to them is 
UTF-32. Anyone who wants at the UTF-8 or UTF-16 array can get at it just fine.

I'm not sure how much this really saves you though, nor what all the problems a 
struct like this would cause over what we currently have. But you'd probably 
still have to special case stuff, since there are going to be algorithms that 
need to process the underlying array rather than the dchar range in order to be 
properly efficient, if work at all. Also, without universal function call syntax, 
I think that the only way to make it possible to call functions on it as if they 
were member functions is to use opDispatch(), which would definitely cause bugs 
(opDot() won't work since the most that you could do at that point is pass it 
along to the internal array, and then we're right back where we started). So, 
ultimately, I'm not sure that such a change would gain you much, and you're 
definitely losing something big.

Ultimately, I think that we're stuck with what we've got, though we may be able 
to make some tweaks. Fundamentally, we're trying to treat something as two 
different things without treating it as two different things. We want to treat it 
as a range of characters and an array of unicode code units at the same time, 
using it as a range of characters where appropriate and using it as an array of 
code units where appropriate without having to special case it. I just don't 
think that that's going to work.

We can improve our situation with the use of good template and trait stuff, along 
with making the use of iterating over string types without specifying a type a 
warning/error or making it default to dchar. But ultimately, there's a 
fundamental disjoint going on here, and we have to deal with it.

- Jonathan M Davis


More information about the Digitalmars-d mailing list