Why foreach(c; someString) must yield dchar

Jonathan M Davis jmdavisprog at gmail.com
Thu Aug 19 12:43:02 PDT 2010


On Thursday, August 19, 2010 12:18:03 Kagamin wrote:
> Jonathan M Davis Wrote:
> > No, it doesn't hurt to have the iteration type larger than the actual
> > type, but you're not going to have overflow.
> 
> Trivial: take byte and add 256.

Except that that only happens once you do something to the element that you get 
from foreach. You read byte just fine without having overflow problems. You can't 
do the same with char or wchar. You often need multiple of them to get anything 
meaningful - unlike bytes. If you want to change the iteration type to int or 
long or whatever when iterating over bytes so that you can change the variable 
without overflow issues, you can. But the byte itself is meaingful by itself. 
Such is not generally the case with char or wchar.

> > It's fine with me to use narrow strings. Much as I'd love to avoid a lot
> > of these issues, dstrings take up too much memory if you're going to be
> > doing a lot of string processing.
> 
> If you're going to take much memory, there probably won't be much
> difference between strings and dstrings, you'll take much memory in both
> cases. And don't forget that UTF-8 chars take up to 4 bytes.

For ASCII characters, a UTF-32 character takes _4_ times as much memory as a 
UTF-8 character. Even if you use lots of Asian characters, as I understand it, 
most won't take more than 3. So, even if you're using primarily Asian characters 
with UTF-8, your still have 25% space savings. And since apparently, many Asian 
characters will fit into one wchar, if you use UTF-16 when you have lots of Asian 
characters, you're getting closer to 50% space savings over UTF-32. If you have 
a lot of strings, that's a lot of wasted memory.

> If you care about people and want to force them to use dchar ranges, you
> can do it with the library: make it refuse narrow strings - as long as the
> library is unusable with narrow strings, people will have to do something
> about it, say, use wrappers like one proposed in this thread (but
> providing forward dchar range interface).

We _can't_ force everyone to use dstring. That defeats having string and wstring 
in the first place and is incredibly inefficient space-wise. The standard libraries 
_need_ to work well with all string types.

> > It makes perfect sense for general arrays. It makes perfect sense if you
> > don't really care about the contents of the array for your algorithm
> > (that is, whether they're code points or characters or just bytes in
> > memory doesn't matter for what you're doing). However, if you're
> > actually processing characters, it makes no sense at all. This mess with
> > foreach and strings is one of the big reasons why foreach tends to be
> > avoided in std.algorithm.
> 
> The problem here is that integers are not much different from characters in
> this regard.

Integers are totally different. An integer may be limited in the size of the 
number that it can hold, but it makes perfect sense to process each integer 
individually. An integer is a full value on its own. char and wchar are not. 
They're only parts of a whole.

> Conceptually number is an infinite sequence of digits with decimal point.
> What do you plan to do about this?

That's a totally different issue. The solution for that is to use a BigInt type 
which combines multiple integers (or bytes or longs or whatever) together to 
make larger values that primitive integral types can hold. In that case, if you 
were to try and iterate over indivdual ints within the BigInt, then you'd be 
screwed because they don't mean anything on your own. string and wstring are 
effectively BigInt for chars and wchars. You have to combine multiple of them to 
get meaningful values. The fact that one of them can't hold a big enough (let 
alone infinite) range is the whole reason that they were created in the first 
place (that and the fact that making the type big enough (i.e. dchar) on its own 
wastes a lot of space).

- Jonathan M Davis


More information about the Digitalmars-d mailing list