Proposal for fixing dchar ranges

H. S. Teoh hsteoh at quickfur.ath.cx
Mon Mar 10 12:46:56 PDT 2014


On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:
> Am Mon, 10 Mar 2014 11:30:07 -0700
> schrieb Walter Bright <newshound2 at digitalmars.com>:
> 
> > On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
> > > An idea to fix the whole problems I see with char[] being treated
> > > specially by phobos: introduce an actual string type, with char[]
> > > as backing, that is a dchar range, that actually dictates the
> > > rules we want. Then, make the compiler use this type for literals.
> > 
> > Proposals to make a string class for D have come up many times. I
> > have a kneejerk dislike for it. It's a really strong feature for D
> > to have strings be an array type, and I'll go to great lengths to
> > keep it that way.

I'm on the fence about this one. The nice thing about strings being an
array type, is that it is a familiar concept to C coders, and it allows
array slicing for extracting substrings, etc., which fits nicely with
the C view of strings as character arrays. As a C coder myself, I like
it this way too. But the bad thing about strings being an array type, is
that it's a holdover from C, and it allows slicing for extracting
substrings -- malformed substrings by permitting slicing a multibyte
(multiword) character.

Basically, the nice aspects of strings being arrays only apply when
you're dealing with ASCII (or mostly-ASCII) strings. These very same
"nice" aspects turn into problems when dealing with anything non-ASCII.
The only way the user can get it right using only array operations, is
if they understand the whole of Unicode in their head and are willing to
reinvent Unicode algorithms every time they slice a string or do some
operation on it. Since D purportedly supports Unicode by default, it
shouldn't be this way. D should *actually* support Unicode all the way
-- use proper Unicode algorithms for substring extraction, collation,
line-breaking, normalization, etc.. Being a systems language, of course,
means that D should allow you to get under the hood and do things
directly with the raw string representation -- but this shouldn't be the
*default* modus operandi.  The default should be a properly-encapsulated
string type with Unicode algorithms to operate on it (with the option of
reaching into the raw representation where necessary).


> Question: which type T doesn't have slicing, has an ElementType of
> dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and
> still satisfies isArray?
> 
> It's a string. Would you call that 'an array type'?
> 
> 	writeln(isArray!string);   //true
> 	writeln(hasSlicing!string); //false
> 	writeln(ElementType!string.stringof); //dchar
> 	writeln(ElementEncodingType!string.stringof); //char
> 
> I wouldn't call that an array. Part of the problem is that you want
> string to be arrays (fixed size elements, direct indexing) and Andrei
> doesn't want them to be arrays (operating on code points => not fixed
> size => not arrays).

Exactly. What we have right now is a frankensteinian hybrid that's
neither fully an array, nor fully a Unicode string type. If we call the
current messy AA implementation split between compiler, aaA.d, and
object.di a design problem, then I'd call the current state of D strings
a design problem too. This underlying inconsistency is ultimately what
leads to the poor performance of strings in std.algorithm.

It's precisely because of this that I've given up on using std.algorithm
for strings altogether -- std.regex is far better: more flexible, more
expressive, and more performant, and specifically designed to operate on
strings. Nowadays I only use std.algorithm for non-string ranges
(because then the behaviour is actually consistent!!).


T

-- 
MS Windows: 64-bit overhaul of 32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit operating system originally coded for a 4-bit microprocessor, written by a 2-bit company that can't stand 1-bit of competition.


More information about the Digitalmars-d mailing list