String implementations

Mon Jan 21 00:11:36 PST 2008

On Jan 20, 2008 11:01 PM, James Dennett <jdennett at acm.org> wrote:
> That view has lead to many security issues, where different
> software reacts differently to byte strings which are not
> valid UTF-8 in places where UTF-8 is expected.

Such input should always be rejected. D will throw an exception, which
is the right thing to do. If a programmer wants to be more flexible,
they can always catch the exception and delete invalid sequences.

Security issues have arisen as a result of what are called
non-shortest-sequences. For example, the slash character is
represented in UTF-8 as 2F. Some hackers have attempted to get past
certain filters by representing the slash character as C0 AF. This is
not valid UTF-8 (because UTF-8 forbids non-shortest sequences), but a
buggy implementation might get that wrong and interpret that as
'\u002F'. The important point that I want to make here is that *D GETS
IT RIGHT*. D's implementation will throw an exception on all invalid
UTF-8 sequences, and this will block all such security issues. The
only way they can resurface is if you hand-code your own UTF handling.
So long as you stick to the built-in UTF-handling stuff which D
provides, you will not encounter these security issues.

Other security issues arise as a result of Unicode itself, not UTF-8.
This is because Unicode is such a large character set - which makes it
really good for phishing attacks. After all, if you spell "amazon"
with the greek letter lowercase omicron instead of latin lowercase o,
who's going to notice. However, this is not D's problem - it's a
problem for browser writers, and one they will encounter regardless of
what programming language they use. (Similar issues arise if browsers
fail to convert URLs to Normalisation Form C, but again, that would be
a browser problem, not a D problem. It would also be a bug).

> (char doesn't represent a character in D.  Not great
> naming?

It's reasonable naming, given that UTF-8 code units in the range 00 to
7F do, in fact, correspond to (ASCII) characters. OK, so it's
inappropriately named for holding values 80 to FF, but alternatives
such as codeunit, or utf8, would probably not catch on so easily.

> But let's get more concrete:
> suppose D code finds that an alleged char[] passed to it is, in
> fact, broken (i.e., violates the UTF8 invariants).  What should
> it do -- abort, throw an exception, offer a policy for handling
> such bugs, other?

It should, and does, throw an exception. Your program may catch the
exception, but it should reject the input.