Why the hell doesn't foreach decode strings
Jonathan M Davis
jmdavisProg at gmx.com
Mon Oct 24 09:11:17 PDT 2011
On Monday, October 24, 2011 17:58:15 Simen Kjaeraas wrote:
> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
>
> <schveiguy at yahoo.com> wrote:
> > On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
> >
> > <newshound2 at digitalmars.com> wrote:
> >> On 10/22/2011 2:21 AM, Peter Alexander wrote:
> >>> Which operations do you believe would be less efficient?
> >>
> >> All of the ones that don't require decoding, such as searching, would
> >> be less efficient if decoding was done.
> >
> > Searching that does not do decoding is fundamentally incorrect. That
> > is, if you want to find a substring in a string, you cannot just compare
> > chars.
>
> Assuming both string are valid UTF-8, you can. Continuation bytes can never
> be confused with the first byte of a code point, and the first byte always
> identifies how many continuation bytes there should be.
Yes, but as far as iterating through, looking for a specific character goes,
you can't simply search for it like you would search for an integer in an
int[] unless you decode it. Techniques to search more efficiently exist in a
number of cases as long as you understand unicode well enough, but as the
default method of searching, it's just not going to work. And once you
actually care about stuff on the level of graphemes (which admittedly Phobos
doesn't do yet), you either have to decode everything, or searching becomes
much more complicated.
Really what it comes down to is that decoding by default will result in
correct but less efficient code. Not decoding by default will inevitably result
in incorrect code except in cases where people luck out (e.g. are only really
dealing with ASCII) or where they know enough that they would have been
specifically choosing to search on char for the first code unit in a code point
and things of that variety in order to gain efficiency. There are just going to
be fewer bugs if the default is correct but easily allows the programmer to
use more efficient methods if they choose to.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list