string is rarely useful as a function argument

Sat Dec 31 10:22:02 PST 2011

On 12/30/2011 02:55 PM, Timon Gehr wrote:
> On 12/30/2011 08:33 PM, Joshua Reusch wrote:
>> Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
>>> On 12/29/11 12:28 PM, Don wrote:
>>>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>>>> Oh, one more thing - one good thing that could come out of this thread
>>>>> is abolition (through however slow a deprecation path) of s.length and
>>>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length
>>>>> and
>>>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not
>>>>> char/wchar.
>>>>> Then, people would access the decoding routines on the needed
>>>>> occasions,
>>>>> or would consciously use the representation.
>>>>>
>>>>> Yum.
>>>>
>>>>
>>>> If I understand this correctly, most others don't. Effectively, .rep
>>>> just means, "I know what I'm doing", and there's no change to existing
>>>> semantics, purely a syntax change.
>>>
>>> Exactly!
>>>
>>>> If you change s[i] into s.rep[i], it does the same thing as now.
>>>> There's
>>>> no loss of functionality -- it's just stops you from accidentally doing
>>>> the wrong thing. Like .ptr for getting the address of an array.
>>>> Typically all the ".rep" everywhere would get annoying, so you would
>>>> write:
>>>> ubyte [] u = s.rep;
>>>> and use u from then on.
>>>>
>>>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>>>> Apart from that, I think this would be perfect.
>>>
>>> Yes, I mean "rep" as a short for "representation" but upon first sight
>>> the connection is tenuous. "raw" sounds great.
>>>
>>> Now I'm twice sorry this will not happen...
>>>
>>
>> Maybe it could happen if we
>> 1. make dstring the default strings type --
> 
> Inefficient.
> 

But correct (enough).

>> code units and characters would be the same
> 
> Wrong.
> 

*sigh*, FINE.  Code units and /code points/ would be the same.

>> or 2. forward string.length to std.utf.count and opIndex to
>> std.utf.toUTFindex
> 
> Inconsistent and inefficient (it blows up the algorithmic complexity).
> 

Inconsistent?  How?

Inefficiency is a lot easier to deal with than incorrect.  If something
is inefficient, then in the right places I will NOTICE.  If something is
incorrect, it can hide for years until that one person (or country, in
this case) with a different usage pattern than the others uncovers it.

>>
>> so programmers could use the slices/indexing/length (no lazyness
>> problems), and if they really want codeunits use .raw/.rep (or better
>> .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)
>>
> 
> Anyone who intends to write efficient string processing code needs this.
> Anyone who does not want to write string processing code will not need
> to index into a string -- standard library functions will suffice.
> 

What about people who want to write correct string processing code AND
want to use this handy slicing feature?  Because I totally want both of
these.  Slicing is super useful for script-like coding.

>> But generally I liked the idea of just having an alias for strings...
> 
> Me too. I think the way we have it now is optimal. The only reason we
> are discussing this is because of fear that uneducated users will write
> code that does not take into account Unicode characters above code point
> 0x80. But what is the worst thing that can happen?
> 
> 1. They don't notice. Then it is not a problem, because they are
> obviously only using ASCII characters and it is perfectly reasonable to
> assume that code units and characters are the same thing.
> 

How do you know they are only working with ASCII?  They might be /now/.
 But what if someone else uses the program a couple years later when the
original author is no longer maintaining that chunk of code?

> 2. They get screwed up string output, look for the reason, patch up
> their code with some functions from std.utf and will never make the same
> mistakes again.
> 

Except they don't.  Because there are a lot of programmers that will
never put in non-ascii strings to begin with.  But that has nothing to
do with whether or not the /users/ or /maintainers/ of that code will
put non-ascii strings in.  This could make some messes.

> 
> I have *never* seen an user in D.learn complain about it. They might
> have been some I missed, but it is certainly not a prevalent problem.
> Also, just because an user can type .rep does not mean he understands
> Unicode: He is able to make just the same mistakes as before, even more
> so, as the array he is getting back has the _wrong element type_.
> 

You know, here in America (Amurica?) we don't know that other countries
exist.  I think there is a large population of programmers here that
don't even know how to enter non-latin characters, much less would think
to include such characters in their test cases.  These programmers won't
necessarily be found on the internet much, but they will be found in
cubicles all around, doing their 9-to-5 and writing mediocre code that
the rest of us have to put up with.  Their code will pass peer review
(their peers are also from America) and continue working just fine until
someone from one of those confusing other places decides to type in the
characters they feel comfortable typing in.  No, there will not be
/tests/ for code points greater than 0x80, because there is no one
around to write those.  I'd feel a little better if D herds people into
writing correct code to begin with, because they won't otherwise.

...

There's another issue at play here too: efficiency vs correctness as a
default.

Here's the tradeoff --

Option A:
char[i] returns the i'th byte of the string as a (char) type.
Consequences:
(1) Code is efficient and INcorrect.
(2) It requires extra effort to write correct code.
(3) Detecting the incorrect code may take years, as these errors can
hide easily.

Option B:
char[i] returns the i'th codepoint of the string as a (dchar) type.
Consequences:
(1) Code is INefficient and correct.
(2) It requires extra effort to write efficient code.
(3) Detecting the inefficient code happens in minutes.  It is VERY
noticable when your program runs too slowly.

This is how I see it.

And I really like my correct code.  If it's too slow, and I'll /know/
when it's too slow, then I'll profile->tweak->profile->etc until the
slowness goes away.  I'm totally digging option B.