iteration over a string

Tue May 28 00:26:03 PDT 2013

Questions regarding iteration over code points of a utf8 string:

In all that follows, I don't want to go through intermediate UTF32
representation by making a copy of my string, but I want to iterate over
its code points.

say my string is declared as:
string a="Ωabc"; //if email reader screws this up, it's a 'Omega' followed
by abc

A)
this doesn't work obviously:
foreach(i,ai; a){
  write(i,",",ai," ");
}
//prints 0,� 1,� 2,a 3,b 4,c (ie decomposes at the 'char' level, so 5
elements)

B)
foreach(i,dchar ai;a){
  write(i,",",ai," ");
}
// prints 0,Ω 2,a 3,b 4,c (ie decomposes at code points, so 4 elements)
But index i skips position 1, indicating the start index of code points; it
prints [0,2,3,4]
Is that a bug or a feature?

C)
writeln(a.walkLength); // prints 4
for(size_t i;!a.empty;a.popFront,i++)
  write(i,",",a.front," ");

// prints 0,Ω 1,a 2,b 3,c
This seems the most correct for interpreting a string as a range over code
points, where index i has positions [0,1,2,3]

Is there a more idiomatic way?

D)
How to make the standard algorithms (std.map, etc) work well with the
iteration over code points as in method C above ?

For example this one is very confusing for me:
string a="ΩΩab";
auto b1=a.map!(a=>"<"d~a~">"d).array;
writeln(b1.length);//6
writeln(b1);//["<Ω>", "<Ω>", "<a>", "<b>", "", ""]
Why are there 2 empty strings at the end? (one per Omega if you vary the
number of such symbols in the string).

E)
The fact that there are 2 ways to iterate over strings is confusing:
For example reading at docs, ForeachType is different from ElementType and
ElementType is special cased for narrow strings;
foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t
i;!a.empty;a.popFront,i++) {foo(i,a.front);}
walkLength != length for strings

F)
Why can't we have the following design instead:
* no special case with isNarrowString scattered throughout phobos
* iteration with foreach behaves as iteration with popFront/empty/front,
and walkLength == length
* ForeachType == ElementType (ie one is redundant)
* require *explicit user syntax* to construct a range over code points from
a string:

struct CodepointRange{
 this(string a){...}
 auto popFront(){}
 auto empty(){}
 auto length(){}//
}

now the user can do:
a.map!foo => will iterate over char
a.CodepointRange.map!foo => will iterate over code points.

Everything seems more orhogonal that way, and user has clear understanding
of complexity of each operation.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-learn/attachments/20130528/73680c00/attachment-0001.html>