string is rarely useful as a function argument

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Sat Dec 31 07:03:13 PST 2011


On 12/31/11 8:17 CST, Michel Fortin wrote:
> On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
>> On 12/31/11 2:04 AM, Walter Bright wrote:
>>
>>> We're chasing phantoms here, and I worry a lot about over-engineering
>>> trivia.
>>
>> I disagree. I understand that seems trivia to you, but that doesn't
>> make your opinion any less wrong, not to mention provincial through
>> insistence it's applicable beyond a small team of experts. Again: I
>> know no other - I literally mean not one - person who writes string
>> code like you do (and myself after learning it from you); the current
>> system is adequate; the proposed system is perfect - save for breaking
>> backwards compatibility, which makes the discussion moot. But it being
>> moot does not afford me to concede this point. I am right.
>
> Perfect?

Sorry, I exaggerated. I meant "a net improvement while keeping simplicity".

> At one time Java and other frameworks started to use UTF-16 as
> if they were characters, that turned wrong on them. Now we know that not
> even code points should be considered characters, thanks to characters
> spanning on multiple code points. You might call it perfect, but for
> that you have made two assumptions:
>
> 1. treating code points as characters is good enough, and
> 2. the performance penalty of decoding everything is tolerable

I'm not sure how you concluded I drew such assumptions.

> Ranges of code points might be perfect for you, but it's a tradeoff that
> won't work in every situations.

Ranges can be defined to span logical glyphs that span multiple code points.

> The whole concept of generic algorithms working on strings efficiently
> doesn't work.

Apparently std.algorithm does.

> Applying generic algorithms to strings by treating them as
> a range of code points is both wasteful (because it forces you to decode
> everything) and incomplete (because of multi-code-point characters) and
> it should be avoided.

An algorithm that gains by accessing the encoding can do so - and indeed 
some do. Spanning multi-code-point characters is a matter of defining 
the range appropriately; it doesn't break the abstraction.

> Algorithms working on Unicode strings should be
> designed with Unicode in mind. And the best way to design efficient
> Unicode algorithms is to access the array of code units directly and
> read each character at the level of abstraction required and know what
> you're doing.

As I said, that's happening already.

> I'm not against making strings more opaque to encourage people to use
> the Unicode algorithms from the standard library instead of rolling
> their own.

I'd say we're discussing making the two kinds of manipulation (encoded 
sequence of logical character vs. array of code units) more 
distinguished from each other. That's a Good Thing(tm).

> But I doubt the current approach of using .raw alone will
> prevent many from doing dumb things.

I agree. But I think it would be a sensible improvement over now, when 
you get to do a ton of dumb things with much more ease.

> On the other side I'm sure it'll
> make it it more complicated to write Unicode algorithms because
> accessing and especially slicing the raw content of char[] will become
> tiresome. I'm not convinced it's a net win.

Many Unicode algorithms don't need slicing. Those that do carefully mix 
manipulation of code points with manipulation of representation. It is a 
net win that the two operations are explicitly distinguished.

> As for Walter being the only one coding by looking at the code units
> directly, that's not true. All my parser code look at code units
> directly and only decode to code points where necessary (just look at
> the XML parsing code I posted a while ago to get an idea to how it can
> apply to ranges). And I don't think it's because I've seen Walter code
> before, I think it is because I know how Unicode works and I want to
> make my parser efficient. I've done the same for a parser in C++ a while
> ago. I can hardly imagine I'm the only one (with Walter and you). I
> think this is how efficient algorithms dealing with Unicode should be
> written.

Congratulations.


Andrei


More information about the Digitalmars-d mailing list