iteration over a string
Jonathan M Davis
jmdavisProg at gmx.com
Tue May 28 01:25:52 PDT 2013
On Tuesday, May 28, 2013 00:26:03 Timothee Cour wrote:
> Questions regarding iteration over code points of a utf8 string:
>
> In all that follows, I don't want to go through intermediate UTF32
> representation by making a copy of my string, but I want to iterate over
> its code points.
>
> say my string is declared as:
> string a="Ωabc"; //if email reader screws this up, it's a 'Omega' followed
> by abc
>
> A)
> this doesn't work obviously:
> foreach(i,ai; a){
> write(i,",",ai," ");
> }
> //prints 0,� 1,� 2,a 3,b 4,c (ie decomposes at the 'char' level, so 5
> elements)
Yes. I'd love it if it were a warning or error to not give an explicit
iteration type for foreach with strings, but I don't think that Walter is
willing to do that. He seems to think that everyone should understand Unicode
and therefore have no problems with the fact that foreach iterates over code
units rather than code points.
> B)
> foreach(i,dchar ai;a){
> write(i,",",ai," ");
> }
> // prints 0,Ω 2,a 3,b 4,c (ie decomposes at code points, so 4 elements)
> But index i skips position 1, indicating the start index of code points; it
> prints [0,2,3,4]
> Is that a bug or a feature?
Feature. It's the index of the array, so it's code units, and in general is
more useful (at least for more advanced string processing).
> C)
> writeln(a.walkLength); // prints 4
> for(size_t i;!a.empty;a.popFront,i++)
> write(i,",",a.front," ");
>
> // prints 0,Ω 1,a 2,b 3,c
> This seems the most correct for interpreting a string as a range over code
> points, where index i has positions [0,1,2,3]
>
> Is there a more idiomatic way?
Not really (though maybe it could be done with zip and iota or something if
you really wanted to), but it's also not something that you'd do normally.
Remember that ranges don't provide indices unless they're random access, and
narrow strings aren't random access as far as ranges are concerned. You have
to count the elements for pretty much _any_ non-random-access range if you
want to know which element you're on.
> D)
> How to make the standard algorithms (std.map, etc) work well with the
> iteration over code points as in method C above ?
>
> For example this one is very confusing for me:
> string a="ΩΩab";
> auto b1=a.map!(a=>"<"d~a~">"d).array;
> writeln(b1.length);//6
For ranges, use walkLength, not length. length will _not_ be correct in the
general case when strings are involved, because that's the number of code
units rather than code points.
> E)
> The fact that there are 2 ways to iterate over strings is confusing:
> For example reading at docs, ForeachType is different from ElementType and
> ElementType is special cased for narrow strings;
> foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t
> i;!a.empty;a.popFront,i++) {foo(i,a.front);}
> walkLength != length for strings
>
> F)
> Why can't we have the following design instead:
> * no special case with isNarrowString scattered throughout phobos
> * iteration with foreach behaves as iteration with popFront/empty/front,
> and walkLength == length
> * ForeachType == ElementType (ie one is redundant)
> * require *explicit user syntax* to construct a range over code points from
> a string:
>
> struct CodepointRange{
> this(string a){...}
> auto popFront(){}
> auto empty(){}
> auto length(){}//
> }
>
> now the user can do:
> a.map!foo => will iterate over char
> a.CodepointRange.map!foo => will iterate over code points.
>
> Everything seems more orhogonal that way, and user has clear understanding
> of complexity of each operation.
The reason that we don't do that is mostly because it makes pretty much no
sense to iterate over ranges of code units 99.99% of the time. Very nearly the
_only_ time that it makes sense is when you know that all of the characters
that you're operating on are ASCII characters. By iterating over code points,
the situation is far more correct (the not completely correct, due to the
difference between code points and graphemes). If we did everything on code
units by default, D code in general would not operate at all correctly on
Unicode most of the time.
The difference between foreach and ranges for strings is most definitely
unfortunate, but it really isn't all that complicated ultimately. Just
remember that foreach itself operates on code units (so you have to use dchar
explicitly if you want to iterate over code points) and that all of the range
functions operate on code points. So, when you're operating on narrow strings
as ranges, don't use any operations that the range API doesn't consider them
to have (in particular, length, random access, and slicing). In general, this
is quite easy to do, because you write range-based functions in a generic
fashion using traits such as hasLength and hasSlicing to determine a range's
capabilities. So, you really don't have to do anything special for strings -
unless you want to optimize your code for strings. And if you want to do that,
then you need to understand how Unicode works with regards to code units and
code points and code up your specialiazation appropriately.
Maybe there's a better way to explain this, but it really seems to me like
you're overcomplicating it. If you want to operate on code units, use the
built-in operations, as the language itself considers strings to be arrays of
code units. If you want to operate on code points, use the range operations -
and only the range operations that a given range supports (meaning don't do
things like use length when hasLength!R is false like it is for narrow
strings).
- Jonathan M Davis
More information about the Digitalmars-d-learn
mailing list