iteration over a string

Timothee Cour thelastmammoth at gmail.com
Tue May 28 03:05:52 PDT 2013


2A)
Thanks for your answer;
You skipped over this one, which I don't understand:

> string a="ΩΩab";
> auto b1=a.map!(a=>"<"d~a~">"d).array;
> writeln(b1);//["<Ω>", "<Ω>", "<a>", "<b>", "", ""]
> Why are there 2 empty strings at the end? (one per Omega if you vary the
number of such symbols in the string).

The above is just weird; is that a bug?

2B)
> The reason that we don't do that is mostly because it makes pretty much
no sense to iterate over ranges of code units 99.99% of the time

And yet "ΩΩab".map!foo seems to operate on code units and in order to
operate on code points we need "ΩΩab".stride(1).map!foo, so isn't it
inconsistent with the fact that std.algorithm.find("ΩΩab") operates on code
points ?

2C)
> Remember that ranges don't provide indices unless they're random access,

For input ranges ranges, I don't understand why the compiler can't accept
the foreach(i,ai;a) syntax:

it should behave as follows:
foreach(i , ai; a){expr}

rewritten as:
for(size_t i=0, ai=a.front; !a.empty; a.popFront;){expr}

but it doesn't compile (it only accepts foreach(ai;a){expr}




On Tue, May 28, 2013 at 1:25 AM, Jonathan M Davis <jmdavisProg at gmx.com>wrote:

> On Tuesday, May 28, 2013 00:26:03 Timothee Cour wrote:
> > Questions regarding iteration over code points of a utf8 string:
> >
> > In all that follows, I don't want to go through intermediate UTF32
> > representation by making a copy of my string, but I want to iterate over
> > its code points.
> >
> > say my string is declared as:
> > string a="Ωabc"; //if email reader screws this up, it's a 'Omega'
> followed
> > by abc
> >
> > A)
> > this doesn't work obviously:
> > foreach(i,ai; a){
> >   write(i,",",ai," ");
> > }
> > //prints 0,� 1,� 2,a 3,b 4,c (ie decomposes at the 'char' level, so 5
> > elements)
>
> Yes. I'd love it if it were a warning or error to not give an explicit
> iteration type for foreach with strings, but I don't think that Walter is
> willing to do that. He seems to think that everyone should understand
> Unicode
> and therefore have no problems with the fact that foreach iterates over
> code
> units rather than code points.
>
> > B)
> > foreach(i,dchar ai;a){
> >   write(i,",",ai," ");
> > }
> > // prints 0,Ω 2,a 3,b 4,c (ie decomposes at code points, so 4 elements)
> > But index i skips position 1, indicating the start index of code points;
> it
> > prints [0,2,3,4]
> > Is that a bug or a feature?
>
> Feature. It's the index of the array, so it's code units, and in general is
> more useful (at least for more advanced string processing).
>
> > C)
> > writeln(a.walkLength); // prints 4
> > for(size_t i;!a.empty;a.popFront,i++)
> >   write(i,",",a.front," ");
> >
> > // prints 0,Ω 1,a 2,b 3,c
> > This seems the most correct for interpreting a string as a range over
> code
> > points, where index i has positions [0,1,2,3]
> >
> > Is there a more idiomatic way?
>
> Not really (though maybe it could be done with zip and iota or something if
> you really wanted to), but it's also not something that you'd do normally.
> Remember that ranges don't provide indices unless they're random access,
> and
> narrow strings aren't random access as far as ranges are concerned. You
> have
> to count the elements for pretty much _any_ non-random-access range if you
> want to know which element you're on.
>
> > D)
> > How to make the standard algorithms (std.map, etc) work well with the
> > iteration over code points as in method C above ?
> >
> > For example this one is very confusing for me:
> > string a="ΩΩab";
> > auto b1=a.map!(a=>"<"d~a~">"d).array;
> > writeln(b1.length);//6
>
> For ranges, use walkLength, not length. length will _not_ be correct in the
> general case when strings are involved, because that's the number of code
> units rather than code points.
>
> > E)
> > The fact that there are 2 ways to iterate over strings is confusing:
> > For example reading at docs, ForeachType is different from ElementType
> and
> > ElementType is special cased for narrow strings;
> > foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t
> > i;!a.empty;a.popFront,i++) {foo(i,a.front);}
> > walkLength != length for strings
> >
> > F)
> > Why can't we have the following design instead:
> > * no special case with isNarrowString scattered throughout phobos
> > * iteration with foreach behaves as iteration with popFront/empty/front,
> > and walkLength == length
> > * ForeachType == ElementType (ie one is redundant)
> > * require *explicit user syntax* to construct a range over code points
> from
> > a string:
> >
> > struct CodepointRange{
> >  this(string a){...}
> >  auto popFront(){}
> >  auto empty(){}
> >  auto length(){}//
> > }
> >
> > now the user can do:
> > a.map!foo => will iterate over char
> > a.CodepointRange.map!foo => will iterate over code points.
> >
> > Everything seems more orhogonal that way, and user has clear
> understanding
> > of complexity of each operation.
>
> The reason that we don't do that is mostly because it makes pretty much no
> sense to iterate over ranges of code units 99.99% of the time. Very nearly
> the
> _only_ time that it makes sense is when you know that all of the characters
> that you're operating on are ASCII characters. By iterating over code
> points,
> the situation is far more correct (the not completely correct, due to the
> difference between code points and graphemes). If we did everything on code
> units by default, D code in general would not operate at all correctly on
> Unicode most of the time.
>
> The difference between foreach and ranges for strings is most definitely
> unfortunate, but it really isn't all that complicated ultimately. Just
> remember that foreach itself operates on code units (so you have to use
> dchar
> explicitly if you want to iterate over code points) and that all of the
> range
> functions operate on code points. So, when you're operating on narrow
> strings
> as ranges, don't use any operations that the range API doesn't consider
> them
> to have (in particular, length, random access, and slicing). In general,
> this
> is quite easy to do, because you write range-based functions in a generic
> fashion using traits such as hasLength and hasSlicing to determine a
> range's
> capabilities. So, you really don't have to do anything special for strings
> -
> unless you want to optimize your code for strings. And if you want to do
> that,
> then you need to understand how Unicode works with regards to code units
> and
> code points and code up your specialiazation appropriately.
>
> Maybe there's a better way to explain this, but it really seems to me like
> you're overcomplicating it. If you want to operate on code units, use the
> built-in operations, as the language itself considers strings to be arrays
> of
> code units. If you want to operate on code points, use the range
> operations -
> and only the range operations that a given range supports (meaning don't do
> things like use length when hasLength!R is false like it is for narrow
> strings).
>
> - Jonathan M Davis
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-learn/attachments/20130528/27f9d38c/attachment-0001.html>


More information about the Digitalmars-d-learn mailing list