[review] new string type

Jonathan M Davis jmdavisProg at gmx.com
Tue Nov 30 15:34:56 PST 2010


On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
> On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg at gmx.com>
> 
> wrote:
> > 1. At least until universal function syntax is in the language, you use
> > the
> > ability to do str.func().
> 
> Yes, but first, this is not a problem with the string type, it's a problem
> with the language.  Second, any string-specific functions can be added, it
> is a struct after all, not a builtin.

I really don't think that we want to have to be adding string functions to a 
string struct. Having them external is far better, particularly when there are 
so many of them which aren't directly related to strings but use them heavily. 
Even if all of std.string got tacked onto a struct string type, that would leave 
out the rest of Phobos and any user code. I really think that universal function 
syntax is a necessity for a struct solution to be acceptable.

I had considered a solution similar to this one a few months back, and the lack 
of universal function syntax is one of the reasons why I decided that it wasn't 
really an improvement. Honestly, without uniform function call syntax, I would 
consider a struct solution to be DOA. The loss in useability would just be too 
great.

> > 2. Functions that would work just fine treating strings as arrays of
> > code units
> > (presumably because they don't care about what the actual data is) lose
> > efficiency, because now a string isn't an array.
> 
> Which functions are those?  They can be allowed via wrappers.

It is my understanding that there are various functions in std.algorithm which 
are able to treat strings as arrays and therefore process them more efficiently. I 
haven't looked into which ones are in that camp. I would think that find() might 
be, but I'd have to look.

> > 3. You have no access to the underlying array unless you're dealing with
> > an
> > actual array of dchar.
> 
> I thought of adding some kind of access.  I wasn't sure the best way.
> 
> I was thinking of allowing direct access via opCast, because I think
> casting might be a sufficient red flag to let you know you are crossing
> into dangerous waters.
> 
> But it could just be as easy as making the array itself public.

Something like that would need to be done. It really should be a public property 
of some kind if not an outright public variable.

> > 4. Indexing is no longer O(1), which violates the guarantees of the index
> > operator.
> 
> Indexing is still O(1).
> 
> > 5. Slicing (other than a full slice) is no longer O(1), which violates
> > the
> > guarantees of the slicing operator.
> 
> Slicing is still O(1).

You're right. I misread what _charStart() did. However, if I understand it 
correctly now, you give it a code unit index and yet get a code point back. That 
worries me. It means that there is no relation between the length of the string 
and the indices that you'd use to index it. It _is_ related to the number of 
code units, which you can get with codeUnits(), but that disjoint seems like it 
could cause problems. It might be the correct solution, but I think that it 
merits some examination. Returning a code unit would be wrong since that totally 
breaks with the rest of the API dealing in dchar/code units, and it would be 
wrong to index by code point, since then indexing and slicing are no longer 
O(1). So, it's probably a choice between indexing by one thing and returning 
another or not having indexing and slicing, which would definitely not be good. 
So, maybe your solution is the best one in this respect, but it worries me. The 
exact ramifications of that need to be looked into.

> > What you're doing here is forcing the view of a string as a range of
> > dchar in
> > all cases. Granted, that's what you want in most cases, but it can
> > degrade
> > efficiency, and the fact that some operations (in particular indexing
> > and slicing)
> > are not O(1) like they're supposed to be means that algorithms which
> > rely on
> > O(1) behavior from them could increase their cost by an order of
> > magnitude. All
> > the cases where treating a string as an actual array which are currently
> > valid
> > are left out to dry
> 
> You can still use char[] and wchar[].

Except that what if you need to do both with the same type? Right now, you could 
have a function which treats a string as a range of dchar while another one 
which can get away with treating it as a range of code units can treat it as an 
array. You can pass the same string to both, and it works. That should still 
work if we go for a struct solution. Special-casing on strings and specifically 
using the internal array instead of the struct for them could fix the problem, 
but it still needs to be possible.

> > The inherent problem with strings is that we _want_ to be able to view
> > them as
> > both arrays of code units and as ranges of dchar. Trying to view them
> > only as
> > ranges of dchar is inefficient in a number of cases. Even something as
> > simple as
> > getting a substring is less efficient with this code. Only in cases
> > where you
> > don't actually know the length of the substring that you want is this
> > essentially as efficient as what we have now. You're ignoring all cases
> > where
> > viewing a string as an array of code units is correct and desirable.
> > You're only
> > solving half of the problem (albeit the more prevalent half).
> 
> I'm not planning on eliminating char and wchar based arrays.
> 
> In other words, you should be able to get access to the array, but it
> should not be the default, and it should be considered unsafe.
> 
> > Now, making it possible to have a wrapper struct like this which works
> > in many
> > cases where you'd use strings could reduce errors in code in many
> > situations, so
> > giving the programmer the ability to do that could be interesting. But
> > it seems
> > to me that this solution is too limiting to be a viable replacement for
> > strings
> > as they are.
> 
> Hopefully you can see that I'm not eliminating the functionality you are
> looking for, just making it not the default.

There is definitely some value in making strings treated as ranges of dchar by 
default. But for the most part, that's the case already thanks to how std.array 
is defined. The only place where that runs into trouble is if you use foreach. It 
still treats them as arrays of code units unless you tell it to iterate over 
dchar. Either making foreach over character arrays iterate over dchar by default 
or making it a warning or error to use foreach with a char or wchar array of any 
kind without specifying the type to iterate over would fix that problem.

I don't think that the basic idea of what you present is necessarily a bad idea 
(I've considered proposing similar solutions in the past), but I'm not at all 
convinced that it's an improvement. What we have works quite well overall, with 
the notable exception of foreach. I'm also worried that trying to bury the fact 
that strings are arrays of code units is going to cause a leaky abstraction 
which will case problems. It's not likely to be as bad as the situation with 
char literals and std::string in C++, but I think that we need to look at this 
very carefully and examine what all of the ramifications are before we even 
consider going for it.

There is an inherent but necessary disjoint between having strings be arrays of 
code units and ranges of dchar. Sometimes they need to be treated as one, 
sometimes as the other. Ideally, the default - whichever it is - would be the 
one which leads to fewer bugs. But they both need to be there. A struct solution 
is essentially an attempt to make strings treated as ranges of dchar in more 
situations by default than is currenly the case. As such, it could be better 
than what we have now, but I'm honestly not convinced that it is. Aside from the 
foreach problem (which could be fixed with an appropriate warning or error - 
preferrably error), what we have works quite well.

- Jonathan M Davis


More information about the Digitalmars-d mailing list