Why the hell doesn't foreach decode strings

Jonathan M Davis jmdavisProg at gmx.com
Mon Oct 24 09:11:17 PDT 2011


On Monday, October 24, 2011 17:58:15 Simen Kjaeraas wrote:
> On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
> 
> <schveiguy at yahoo.com> wrote:
> > On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
> > 
> > <newshound2 at digitalmars.com> wrote:
> >> On 10/22/2011 2:21 AM, Peter Alexander wrote:
> >>> Which operations do you believe would be less efficient?
> >> 
> >> All of the ones that don't require decoding, such as searching, would
> >> be less efficient if decoding was done.
> > 
> > Searching that does not do decoding is fundamentally incorrect.  That
> > is, if you want to find a substring in a string, you cannot just compare
> > chars.
> 
> Assuming both string are valid UTF-8, you can. Continuation bytes can never
> be confused with the first byte of a code point, and the first byte always
> identifies how many continuation bytes there should be.

Yes, but as far as iterating through, looking for a specific character goes, 
you can't simply search for it like you would search for an integer in an 
int[] unless you decode it. Techniques to search more efficiently exist in a 
number of cases as long as you understand unicode well enough, but as the 
default method of searching, it's just not going to work. And once you 
actually care about stuff on the level of graphemes (which admittedly Phobos 
doesn't do yet), you either have to decode everything, or searching becomes 
much more complicated.

Really what it comes down to is that decoding by default will result in 
correct but less efficient code. Not decoding by default will inevitably result 
in incorrect code except in cases where people luck out (e.g. are only really 
dealing with ASCII) or where they know enough that they would have been 
specifically choosing to search on char for the first code unit in a code point 
and things of that variety in order to gain efficiency. There are just going to 
be fewer bugs if the default is correct but easily allows the programmer to 
use more efficient methods if they choose to.

- Jonathan M Davis


More information about the Digitalmars-d mailing list