The Case Against Autodecode

H. S. Teoh via Digitalmars-d digitalmars-d at puremagic.com
Tue May 31 13:35:42 PDT 2016


On Tue, May 31, 2016 at 10:47:56PM +0300, Dmitry Olshansky via Digitalmars-d wrote:
> On 31-May-2016 01:00, Walter Bright wrote:
> > On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
> > > I don't agree on changing those. Indexing and slicing a char[] is
> > > really useful and actually not hard to do correctly (at least with
> > > regard to handling code units).
> > 
> > Yup. It isn't hard at all to use arrays of codeunits correctly.
> 
> Ehm as long as all you care for is operating on substrings I'd say.
> Working with individual character requires either decoding or clever
> tricks like operating on encoded UTF directly.
[...]

Working on individual characters needs byGrapheme, unless you know
beforehand that the character(s) you're working with are ASCII, or fits
in a single code unit.

About "clever tricks", it's not really that hard.  I was thinking that
things like s.canFind('Ш') should translate the 'Ш' into a UTF-8 byte
sequence, and then do a substring search directly on the encoded string.
This way, a large number of single-character algorithms don't even need
to decode.  The way UTF-8 is designed guarantees that there will not be
any false positives.  This will eliminate a lot of the current overhead
of autodecoding.


T

-- 
Klein bottle for rent ... inquire within. -- Stephen Mulraney


More information about the Digitalmars-d mailing list