[review] new string type

Steven Schveighoffer schveiguy at yahoo.com
Wed Dec 1 14:09:27 PST 2010


On Tue, 30 Nov 2010 18:34:56 -0500, Jonathan M Davis <jmdavisProg at gmx.com>  
wrote:

> On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
>> On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis  
>> <jmdavisProg at gmx.com>
>>
>> wrote:
>> > 1. At least until universal function syntax is in the language, you  
>> use
>> > the
>> > ability to do str.func().
>>
>> Yes, but first, this is not a problem with the string type, it's a  
>> problem
>> with the language.  Second, any string-specific functions can be added,  
>> it
>> is a struct after all, not a builtin.
>
> I really don't think that we want to have to be adding string functions  
> to a
> string struct. Having them external is far better, particularly when  
> there are
> so many of them which aren't directly related to strings but use them  
> heavily.
> Even if all of std.string got tacked onto a struct string type, that  
> would leave
> out the rest of Phobos and any user code. I really think that universal  
> function
> syntax is a necessity for a struct solution to be acceptable.

Why?  It's a choice between adding them to the struct, and doing func(str)  
instead.  You already have to do func(R) for arbitrary ranges, I don't see  
why strings should be specialized.  For things that should be considered  
'built-in' for strings, they can be included as members of the struct.

In other words, I think it's worth fixing strings now, even if it means we  
cannot call str.func() until the universal syntax is introduced.

> I had considered a solution similar to this one a few months back, and  
> the lack
> of universal function syntax is one of the reasons why I decided that it  
> wasn't
> really an improvement. Honestly, without uniform function call syntax, I  
> would
> consider a struct solution to be DOA. The loss in useability would just  
> be too
> great.

But you can still call the function, it's just func(str).  There is no  
loss of functionality.  Usability isn't even really lost.

>> > 2. Functions that would work just fine treating strings as arrays of
>> > code units
>> > (presumably because they don't care about what the actual data is)  
>> lose
>> > efficiency, because now a string isn't an array.
>>
>> Which functions are those?  They can be allowed via wrappers.
>
> It is my understanding that there are various functions in std.algorithm  
> which
> are able to treat strings as arrays and therefore process them more  
> efficiently. I
> haven't looked into which ones are in that camp. I would think that  
> find() might
> be, but I'd have to look.

Since std.algorithm works exclusively with ranges, those must be special  
cases for strings because strings are bi-directional ranges of dchar  
according to phobos.  So those can continue to be special cases for the  
new string types.

>> > 4. Indexing is no longer O(1), which violates the guarantees of the  
>> index
>> > operator.
>>
>> Indexing is still O(1).
>>
>> > 5. Slicing (other than a full slice) is no longer O(1), which violates
>> > the
>> > guarantees of the slicing operator.
>>
>> Slicing is still O(1).
>
> You're right. I misread what _charStart() did. However, if I understand  
> it
> correctly now, you give it a code unit index and yet get a code point  
> back. That
> worries me. It means that there is no relation between the length of the  
> string
> and the indices that you'd use to index it. It _is_ related to the  
> number of
> code units, which you can get with codeUnits(), but that disjoint seems  
> like it
> could cause problems. It might be the correct solution, but I think that  
> it
> merits some examination. Returning a code unit would be wrong since that  
> totally
> breaks with the rest of the API dealing in dchar/code units, and it  
> would be
> wrong to index by code point, since then indexing and slicing are no  
> longer
> O(1). So, it's probably a choice between indexing by one thing and  
> returning
> another or not having indexing and slicing, which would definitely not  
> be good.
> So, maybe your solution is the best one in this respect, but it worries  
> me. The
> exact ramifications of that need to be looked into.

How many times do you use a hard-coded index into a string without knowing  
the encoding of the string (i.e. I know this string is ascii)?  How many  
times do you iterate the characters of a string via an incrementing index?

It's not common to use the indexing operation with values that are not  
known or computed to be valid starts to code-points.  In fact, the  
language depends on that (otherwise you'd see sliced strings everywhere  
with invalid data).

So while it looks strange, it shouldn't be a common need.

That being said, I think Lars pointed out that the strangeness of  
returning the code point even if you point in the middle would be  
surprising in some cases, so I think the better solution is to throw an  
exception.

>> > What you're doing here is forcing the view of a string as a range of
>> > dchar in
>> > all cases. Granted, that's what you want in most cases, but it can
>> > degrade
>> > efficiency, and the fact that some operations (in particular indexing
>> > and slicing)
>> > are not O(1) like they're supposed to be means that algorithms which
>> > rely on
>> > O(1) behavior from them could increase their cost by an order of
>> > magnitude. All
>> > the cases where treating a string as an actual array which are  
>> currently
>> > valid
>> > are left out to dry
>>
>> You can still use char[] and wchar[].
>
> Except that what if you need to do both with the same type? Right now,  
> you could
> have a function which treats a string as a range of dchar while another  
> one
> which can get away with treating it as a range of code units can treat  
> it as an
> array. You can pass the same string to both, and it works. That should  
> still
> work if we go for a struct solution. Special-casing on strings and  
> specifically
> using the internal array instead of the struct for them could fix the  
> problem,
> but it still needs to be possible.

Yeah, it's definitely needed.  I'll add access to the data member in the  
next version.

>> Hopefully you can see that I'm not eliminating the functionality you are
>> looking for, just making it not the default.
>
> There is definitely some value in making strings treated as ranges of  
> dchar by
> default. But for the most part, that's the case already thanks to how  
> std.array
> is defined. The only place where that runs into trouble is if you use  
> foreach.

or indexing.  This is a huge problem.

> It
> still treats them as arrays of code units unless you tell it to iterate  
> over
> dchar. Either making foreach over character arrays iterate over dchar by  
> default
> or making it a warning or error to use foreach with a char or wchar  
> array of any
> kind without specifying the type to iterate over would fix that problem.

I agree that would be ideal, but it still doesn't solve the indexing  
problem.

> There is an inherent but necessary disjoint between having strings be  
> arrays of
> code units and ranges of dchar. Sometimes they need to be treated as one,
> sometimes as the other. Ideally, the default - whichever it is - would  
> be the
> one which leads to fewer bugs. But they both need to be there. A struct  
> solution
> is essentially an attempt to make strings treated as ranges of dchar in  
> more
> situations by default than is currenly the case. As such, it could be  
> better
> than what we have now, but I'm honestly not convinced that it is. Aside  
> from the
> foreach problem (which could be fixed with an appropriate warning or  
> error -
> preferrably error), what we have works quite well.

The thing is, the most common use of strings is as a string, not as an  
array of code-units.  The common case is to print, slice, find, etc. on a  
*string*.  When dealing with the string as a whole, either using an array  
or a specialized type works equally well.

The uncommon case is to extract individual characters from the string.  In  
this case, the default needs to be the most common need in that area --  
extracting a dchar, not a code-unit.  Having the default index operation  
extract a code-unit is very incorrect.

-Steve


More information about the Digitalmars-d mailing list