Why foreach(c; someString) must yield dchar
Jonathan M Davis
jmdavisprog at gmail.com
Fri Aug 20 10:22:51 PDT 2010
On Friday, August 20, 2010 09:44:26 Simen kjaeraas wrote:
> Rainer Deyke <rainerd at eldwood.com> wrote:
> > On 8/19/2010 03:56, Jonathan Davis wrote:
> >> The problem is that chars are not characters. They are UTF-8 code
> >> units.
> >
> > So what? You're acting like 'char' (and specifically 'char[]') is some
> > sort of unique special case. In reality, it's just one case of encoded
> > data. What about compressed data? What about packed arrays of bits?
> > What about other containers?
>
> First off, char, wchar, and dchar are special cases already - they're
> basically byte, short, and int, but are treated somewhat differently.
>
> One possibility, which would make strings a less integrated part of the
> language, is to make them simple range structs, and hide UTF-8/16
> details in the implementation. If it were not for the fact that D touts
> its UTF capabilities, and that this would make it a little less true,
> and the fact that char/wchar/dchar are already treated specially, I
> would support this idea.
If you do that, you'd probably do something like
struct String(C)
{
C[] array;
dchar front() { size_t i = 0; return decod(a, i); }
dchar back() { /* more complicated code*/ }
void popFront() { array.popFront(); }
void popBack() { array.popBack(); }
bool empty() { return array.empty; }
}
alias String(immutable char) string;
Naturally, there would be template constraints, the functions might be a bit
more complex, and there would probably be some other functions (not to mention,
you might have to do something fancy to get the immutable part to work since
IIRC templates remove immutable and const so that they don't generate different
templates for immutable, const, and mutable), but essentially, you would wrap
the various string types in a struct with range operations based on dchar. You
could get at the underlying array quite easily if you actually wanted array
operations. And if you want string operations, well you have the range
operations. Everywhere in the code where you currently have string, you'd have
String(immutable char) instead of immutable (char)[].
I really don't know what all of the implications of this are. There have been
similar suggestions before. You don't really hide the fact that they're UTF-8
and UTF-16. Rather you just make it so that the main interface to them is
UTF-32. Anyone who wants at the UTF-8 or UTF-16 array can get at it just fine.
I'm not sure how much this really saves you though, nor what all the problems a
struct like this would cause over what we currently have. But you'd probably
still have to special case stuff, since there are going to be algorithms that
need to process the underlying array rather than the dchar range in order to be
properly efficient, if work at all. Also, without universal function call syntax,
I think that the only way to make it possible to call functions on it as if they
were member functions is to use opDispatch(), which would definitely cause bugs
(opDot() won't work since the most that you could do at that point is pass it
along to the internal array, and then we're right back where we started). So,
ultimately, I'm not sure that such a change would gain you much, and you're
definitely losing something big.
Ultimately, I think that we're stuck with what we've got, though we may be able
to make some tweaks. Fundamentally, we're trying to treat something as two
different things without treating it as two different things. We want to treat it
as a range of characters and an array of unicode code units at the same time,
using it as a range of characters where appropriate and using it as an array of
code units where appropriate without having to special case it. I just don't
think that that's going to work.
We can improve our situation with the use of good template and trait stuff, along
with making the use of iterating over string types without specifying a type a
warning/error or making it default to dchar. But ultimately, there's a
fundamental disjoint going on here, and we have to deal with it.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list