string is rarely useful as a function argument

Sat Dec 31 06:17:34 PST 2011

On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 12/31/11 2:04 AM, Walter Bright wrote:
> 
>> We're chasing phantoms here, and I worry a lot about over-engineering
>> trivia.
> 
> I disagree. I understand that seems trivia to you, but that doesn't 
> make your opinion any less wrong, not to mention provincial through 
> insistence it's applicable beyond a small team of experts. Again: I 
> know no other - I literally mean not one - person who writes string 
> code like you do (and myself after learning it from you); the current 
> system is adequate; the proposed system is perfect - save for breaking 
> backwards compatibility, which makes the discussion moot. But it being 
> moot does not afford me to concede this point. I am right.

Perfect? At one time Java and other frameworks started to use UTF-16 as 
if they were characters, that turned wrong on them. Now we know that 
not even code points should be considered characters, thanks to 
characters spanning on multiple code points. You might call it perfect, 
but for that you have made two assumptions:

1. treating code points as characters is good enough, and
2. the performance penalty of decoding everything is tolerable

Ranges of code points might be perfect for you, but it's a tradeoff 
that won't work in every situations.

The whole concept of generic algorithms working on strings efficiently 
doesn't work. Applying generic algorithms to strings by treating them 
as a range of code points is both wasteful (because it forces you to 
decode everything) and incomplete (because of multi-code-point 
characters) and it should be avoided. Algorithms working on Unicode 
strings should be designed with Unicode in mind. And the best way to 
design efficient Unicode algorithms is to access the array of code 
units directly and read each character at the level of abstraction 
required and know what you're doing.

I'm not against making strings more opaque to encourage people to use 
the Unicode algorithms from the standard library instead of rolling 
their own. But I doubt the current approach of using .raw alone will 
prevent many from doing dumb things. On the other side I'm sure it'll 
make it it more complicated to write Unicode algorithms because 
accessing and especially slicing the raw content of char[] will become 
tiresome. I'm not convinced it's a net win.

As for Walter being the only one coding by looking at the code units 
directly, that's not true. All my parser code look at code units 
directly and only decode to code points where necessary (just look at 
the XML parsing code I posted a while ago to get an idea to how it can 
apply to ranges). And I don't think it's because I've seen Walter code 
before, I think it is because I know how Unicode works and I want to 
make my parser efficient. I've done the same for a parser in C++ a 
while ago. I can hardly imagine I'm the only one (with Walter and you). 
I think this is how efficient algorithms dealing with Unicode should be 
written.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/