iteration over a string

Tue May 28 01:25:52 PDT 2013

On Tuesday, May 28, 2013 00:26:03 Timothee Cour wrote:
> Questions regarding iteration over code points of a utf8 string:
> 
> In all that follows, I don't want to go through intermediate UTF32
> representation by making a copy of my string, but I want to iterate over
> its code points.
> 
> say my string is declared as:
> string a="Ωabc"; //if email reader screws this up, it's a 'Omega' followed
> by abc
> 
> A)
> this doesn't work obviously:
> foreach(i,ai; a){
>   write(i,",",ai," ");
> }
> //prints 0,� 1,� 2,a 3,b 4,c (ie decomposes at the 'char' level, so 5
> elements)

Yes. I'd love it if it were a warning or error to not give an explicit 
iteration type for foreach with strings, but I don't think that Walter is 
willing to do that. He seems to think that everyone should understand Unicode 
and therefore have no problems with the fact that foreach iterates over code 
units rather than code points.

> B)
> foreach(i,dchar ai;a){
>   write(i,",",ai," ");
> }
> // prints 0,Ω 2,a 3,b 4,c (ie decomposes at code points, so 4 elements)
> But index i skips position 1, indicating the start index of code points; it
> prints [0,2,3,4]
> Is that a bug or a feature?

Feature. It's the index of the array, so it's code units, and in general is 
more useful (at least for more advanced string processing).

> C)
> writeln(a.walkLength); // prints 4
> for(size_t i;!a.empty;a.popFront,i++)
>   write(i,",",a.front," ");
> 
> // prints 0,Ω 1,a 2,b 3,c
> This seems the most correct for interpreting a string as a range over code
> points, where index i has positions [0,1,2,3]
> 
> Is there a more idiomatic way?

Not really (though maybe it could be done with zip and iota or something if 
you really wanted to), but it's also not something that you'd do normally. 
Remember that ranges don't provide indices unless they're random access, and 
narrow strings aren't random access as far as ranges are concerned. You have 
to count the elements for pretty much _any_ non-random-access range if you 
want to know which element you're on.

> D)
> How to make the standard algorithms (std.map, etc) work well with the
> iteration over code points as in method C above ?
> 
> For example this one is very confusing for me:
> string a="ΩΩab";
> auto b1=a.map!(a=>"<"d~a~">"d).array;
> writeln(b1.length);//6

For ranges, use walkLength, not length. length will _not_ be correct in the 
general case when strings are involved, because that's the number of code 
units rather than code points.

> E)
> The fact that there are 2 ways to iterate over strings is confusing:
> For example reading at docs, ForeachType is different from ElementType and
> ElementType is special cased for narrow strings;
> foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t
> i;!a.empty;a.popFront,i++) {foo(i,a.front);}
> walkLength != length for strings
> 
> F)
> Why can't we have the following design instead:
> * no special case with isNarrowString scattered throughout phobos
> * iteration with foreach behaves as iteration with popFront/empty/front,
> and walkLength == length
> * ForeachType == ElementType (ie one is redundant)
> * require *explicit user syntax* to construct a range over code points from
> a string:
> 
> struct CodepointRange{
>  this(string a){...}
>  auto popFront(){}
>  auto empty(){}
>  auto length(){}//
> }
> 
> now the user can do:
> a.map!foo => will iterate over char
> a.CodepointRange.map!foo => will iterate over code points.
>
> Everything seems more orhogonal that way, and user has clear understanding
> of complexity of each operation.

The reason that we don't do that is mostly because it makes pretty much no 
sense to iterate over ranges of code units 99.99% of the time. Very nearly the 
_only_ time that it makes sense is when you know that all of the characters 
that you're operating on are ASCII characters. By iterating over code points, 
the situation is far more correct (the not completely correct, due to the 
difference between code points and graphemes). If we did everything on code 
units by default, D code in general would not operate at all correctly on 
Unicode most of the time.

The difference between foreach and ranges for strings is most definitely 
unfortunate, but it really isn't all that complicated ultimately. Just 
remember that foreach itself operates on code units (so you have to use dchar 
explicitly if you want to iterate over code points) and that all of the range 
functions operate on code points. So, when you're operating on narrow strings 
as ranges, don't use any operations that the range API doesn't consider them 
to have (in particular, length, random access, and slicing). In general, this 
is quite easy to do, because you write range-based functions in a generic 
fashion using traits such as hasLength and hasSlicing to determine a range's 
capabilities. So, you really don't have to do anything special for strings - 
unless you want to optimize your code for strings. And if you want to do that, 
then you need to understand how Unicode works with regards to code units and 
code points and code up your specialiazation appropriately.

Maybe there's a better way to explain this, but it really seems to me like 
you're overcomplicating it. If you want to operate on code units, use the 
built-in operations, as the language itself considers strings to be arrays of 
code units. If you want to operate on code points, use the range operations - 
and only the range operations that a given range supports (meaning don't do 
things like use length when hasLength!R is false like it is for narrow 
strings).

- Jonathan M Davis