string is rarely useful as a function argument

Wed Dec 28 13:17:49 PST 2011

Apparently my previous post was lost. Apologies if this comes out twice.

On 12/28/2011 09:39 PM, Jonathan M Davis wrote:
> On Wednesday, December 28, 2011 21:25:39 Timon Gehr wrote:
>> Why? char and wchar are unicode code units, ubyte/ushort are unsigned
>> integrals. It is clear that char/wchar are a better match.
>
> It's an issue of the correct usage being the easy path. As it stands, it's
> incredibly easy to use narrow strings incorrectly. By forcing any array of
> char or wchar to use .rep.length instead of .length, the relatively automatic
> (and generally incorrect) usage of .length on a string wouldn't immediately
> work. It would force you to work more at doing the wrong thing. Unfortunately,
> walkLength isn't necessarily any easier than .rep.length, but it does force
> people to look into why they can't do .length, which will generally better
> educate them and will hopefully reduce the misuse of narrow strings.
>

I was educated enough not to make that mistake, because I read the 
entire language specification before deciding the language was awesome 
and downloading the compiler. I find it strange that the product should 
be made less usable because we do not expect users to read the manual. 
But it is of course a valid point.

> If we make rep ubyte[] and ushort[] for char[] and wchar[] respectively, then
> we reinforce the fact that you shouldn't operate on chars or wchars.

There is nothing wrong with operating at the code unit level. Efficient 
slicing is very desirable.

> It also
> makes it simply for the compiler to never allow you to use length on char[] or
> wchar[], since it doesn't have to worry about whether you got that char[] or
> wchar[] from a rep property or not.
>
> Now, I don't know if this is really a good move at this point. If we were to
> really do this right, we'd need to disallow indexing and slicing of the char[]
> and wchar[] as well, which would break that much more code. It also pretty
> quickly makes it look like string should be its own type rather than an array,
> since it's acting less and less like an array.

Exactly. It is acting less and less like an array of code units. But it 
*is* an array of code units. If the general consensus is that we need a 
string data type that acts at a different abstraction level by default 
(with which I'd disagree, but apparently I don't have a popular opinion 
here), then we need a string type in the standard library to do that. 
Changing the language so that an array of code units stops behaving like 
an array of code units is not a solution.

> Not to mention, even the
> correct usage of .rep would become rather irritating (e.g. slicing it when you
> know that the indicies that you're dealing with aren't going to cut into any
> code points), because you'd have to cast from ubyte[] to char[] whenever you
> did that.
>
> So, I think that the general sentiment behind this is a good one, but I don't
> know if the exact idea is ultimately a good one - particularly at this stage
> in the game. If we're going to make a change like this which would break as
> much code as this would, we'd need to be _very_ certain that it's what we want
> to do.
>
> - Jonathan M Davis

I agree.