Making all strings UTF ranges has some risk of WTF
Steven Schveighoffer
schveiguy at yahoo.com
Mon Feb 8 07:41:02 PST 2010
On Wed, 03 Feb 2010 23:41:02 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> wrote:
> dsimcha wrote:
>> I personally would find this extremely annoying because most of the
>> code I write
>> that involves strings is scientific computing code that will never be
>> internationalized, let alone released to the general public. I
>> basically just use
>> ASCII because it's all I need and if your UTF-8 string contains only
>> ASCII
>> characters, it can be treated as random-access. I don't know how many
>> people out
>> there are in similar situations, but I doubt they'll be too happy.
>> On the other hand, I guess it wouldn't be hard to write a simple
>> wrapper struct on
>> top of immutable(ubyte)[] and call it AsciiString. Once alias this
>> gets fully
>> debugged, I could even make it implicitly convert to immutable(char)[].
>
> It's definitely going to be easy to use all sensible algorithms with
> immutable(ubyte)[]. But even if you go with string, there should be no
> problem at all. Remember, telling ASCII from UTF is one mask and one
> test away, and the way Walter and I wrote virtually all related routines
> was to special-case ASCII. In most cases I don't think you'll notice a
> decrease in performance.
I'm in the same camp as dsimcha, I generally write all my apps assuming
ASCII strings (most are internal tools anyways).
Can the compiler help making ASCII strings easier to use? i.e., this
already works:
wstring s = "hello"; // converts to immutable(wchar)[]
what about this?
asciistring a = "hello"; // converts to immutable(ubyte)[] (or
immutable(ASCIIChar)[])
asciistring a = "\uFBCD"; // error, requires cast.
The only issue that remains to be resolved then is the upgradability that
ascii characters currently enjoy for utf8. I.e. I can call any utf-8
accepting function with an ASCII string, but not an ASCII string accepting
function with utf-8 data.
Ideally, there should be a 7-bit ASCII character type that implicitly
upconverts to char, and can be initialized with a string literal.
In addition, you are putting D's utf8 char even further away from C's
ASCII char. It would be nice to separate compatible C strings from d
strings. At some point, I should be able to designate a function (even a
C function) takes only ASCII data, and the compiler should disallow
passing general utf8 data into it. This involves either renaming D's char
to keep source closer to C, or rewriting C function signatures to reflect
the difference.
-Steve
More information about the Digitalmars-d
mailing list