On the meaning of string.length

Adam D. Ruppe via Digitalmars-d-announce digitalmars-d-announce at puremagic.com
Wed Nov 19 06:33:03 PST 2014


I answered a random C# stackoverflow question about why 
string.length returns the value it does with some rationale 
defending code units instead of "characters" - basically, I typed 
up a defense of D's string-as-array behavior.

To my surprise, my answer got an enormous number of votes* so I 
decided to post it to reddit too.

http://www.reddit.com/r/programming/comments/2mqghp/why_does_stringlength_count_code_units_instead_of/

This is really encouraging to me that there's been such a 
positive response. The question every so often comes up here too, 
people saying string.length should give number of characters, and 
of course, we have the automatic UTF decoding done in Phobos that 
comes up from time to time.

It looks like D, the language, made the right decisions here.

This reddit comment applies to the phobos thing though:

"Most people like to pick on surrogate pairs here, and decry 
languages which don't handle them "properly", but I think it's 
important to point out that handling surrogate pairs as a single 
character doesn't in any way fix the underlying issue -- many 
multiple-codepoint sequences are a single logical glyph even if 
you use 32 bit wide chars."


I know this has been said a lot of times... but I think the auto 
decoding in phobos was and is a mistake. The bigger question is 
what I posited on stackoverflow: "Moreover, what's the point? Why 
does these metrics matter?" Similarly with std.algorithm on 
strings, why would you ever want to call sort on a string? Well, 
I can think of a few reasons, like checking on the frequency of 
letter, but I think we should see what happens if Phobos changes 
from autodecoding to compile error when that would occur. Then we 
can fix it by casting to .representation or whatever to work with 
code units or manually adding a .utfDecode to work with dchars 
and make the decision explicitly.

That'd offer a way forward and I suspect would break less code 
than we might think.


* stack overflow votes are a silly thing, a somewhat easy answer 
like this gets a bazillion whereas difficult questions with 
difficult answers get me one, maybe two votes. oh well.


More information about the Digitalmars-d-announce mailing list