<div>Questions regarding iteration over code points of a utf8 string:</div><div><div><br></div><div>In all that follows, I don't want to go through intermediate UTF32 representation by making a copy of my string, but I want to iterate over its code points.</div>
</div><div><br></div><div>say my string is declared as:</div><div><div>string a="Ωabc"; //if email reader screws this up, it's a 'Omega' followed by abc</div><div></div></div><div><br></div><div>A)</div>
<div>this doesn't work obviously:</div><div>foreach(i,ai; a){</div><div> write(i,",",ai," ");</div><div>}</div><div>//prints 0,� 1,� 2,a 3,b 4,c (ie decomposes at the 'char' level, so 5 elements)</div>
<div><br></div><div>B)</div><div><div>foreach(i,dchar ai;a){</div><div> write(i,",",ai," ");</div><div>}</div></div><div>// prints 0,Ω 2,a 3,b 4,c (ie decomposes at code points, so 4 elements)</div><div>
But index i skips position 1, indicating the start index of code points; it prints [0,2,3,4]</div><div>Is that a bug or a feature?</div><div><br></div><div>C)</div><div><div>writeln(a.walkLength); // prints 4</div><div>for(size_t i;!a.empty;a.popFront,i++)</div>
<div> write(i,",",a.front," ");</div></div><div><br></div><div>// prints 0,Ω 1,a 2,b 3,c</div><div>This seems the most correct for interpreting a string as a range over code points, where index i has positions [0,1,2,3]</div>
<div><br></div><div>Is there a more idiomatic way?</div><div><br></div><div>D)</div><div>How to make the standard algorithms (std.map, etc) work well with the iteration over code points as in method C above ?</div><div><br>
</div><div>For example this one is very confusing for me:</div><div><div>string a="ΩΩab";</div><div>auto b1=a.map!(a=>"<"d~a~">"d).array;</div><div>writeln(b1.length);//6</div><div>writeln(b1);//["<Ω>", "<Ω>", "<a>", "<b>", "", ""]</div>
</div><div>Why are there 2 empty strings at the end? (one per Omega if you vary the number of such symbols in the string).</div><div><br></div><div><br></div><div>E)</div><div>The fact that there are 2 ways to iterate over strings is confusing:</div>
<div>For example reading at docs, ForeachType is different from ElementType and ElementType is special cased for narrow strings; </div><div>foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t i;!a.empty;a.popFront,i++) {foo(i,a.front);}</div>
<div>walkLength != length for strings</div><div><br></div><div>F)</div><div>Why can't we have the following design instead:</div><div>* no special case with isNarrowString scattered throughout phobos</div><div>* iteration with foreach behaves as iteration with popFront/empty/front, and walkLength == length</div>
<div>* ForeachType == ElementType (ie one is redundant)</div><div>* require <b>explicit user syntax</b> to construct a range over code points from a string:</div><div><br></div><div>struct CodepointRange{</div><div> this(string a){...}</div>
<div> auto popFront(){}</div><div> auto empty(){}</div><div> auto length(){}// </div><div>}</div><div><br></div><div>now the user can do:</div><div>a.map!foo => will iterate over char</div><div>a.CodepointRange.map!foo => will iterate over code points.</div>
<div><br></div><div>Everything seems more orhogonal that way, and user has clear understanding of complexity of each operation.</div><div><br></div><div><br></div>