Rename std.ctype to std.ascii?

Thu Jun 16 13:43:59 PDT 2011

On 2011-06-16 12:51, Jouko Koski wrote:
> "Jonathan M Davis" <jmdavisProg at gmx.com> wrote:
> > On 2011-06-14 11:53, Jouko Koski wrote:
> >> I would not consider it being good idea to include this kind of
> >> ascii-only
> >> utilities in the standard-ish library.
> > 
> > For some classes of operations, it makes perfect sense to be checking for
> > ASCII characters only. For others, it's just people not worrying about
> > internationalization like they should be. For instance, format strings
> > don't
> > care about unicode as far as their escape sequences go. %a, %d, etc. are
> > all
> > pure ASCII.
> 
> Do we really need a common library utility for such a bounded domain? I
> would vote dropping ascii-only std.ctype altogether. Those who know and
> ensure that they are dealing with ascii-only, ebcdic-only or whatever-only
> representations can easily write their own utilities to their particular
> domains - maybe even better optimized than std.ctype because the domain may
> be even more restricted. A common use ascii-only utility will be used
> inevitably in places where it shouldn't.
> 
> > std.ctype/std.ascii deals with ASCII for those situations where you
> > really do
> > only care about ASCII. It deals with unicode characters, but it returns
> > false
> > for everything with them which returns a bool, and it never tries to
> > change
> > their case. std.uni actually deals with unicode and worries about things
> > like
> > whether a unicode character is uppercase or not.
> 
> That is what <ctype.h> (or <wctype.h>) utilities do when the default locale
> setting is in effect. Some other posters seem to suggest that a more
> generalized library module does this, too, without losing performance. 

You actually do get a performance loss for a number of functions. They do tend 
to shortcut on ASCII in many cases, but they tend to become too large to be 
inlined, and if all you care about is ASCII, even if there are unicode 
characters in the string (which is common enough in domains that have nothing 
to do with English - e.g. regular expressions), you take a performance hit for 
all characters which aren't ASCII. There are also a number of functions which 
arguably don't make much sense to try and turn into unicode functions (e.g. 
isDigit) but are heavily used. Another fun one is isWhite vs isUniWhite. In 
most cases, you _don't_ care about unicode whitespace, and it is definitely 
more expensive to call isUniWhite than isWhite, because there are a _lot_ of 
extraneous whitespace characters in unicode.

std.ctype/std.ascii is _not_ going away. Too many people find those functions 
to be useful. I grant you that too many programmers don't worry about unicode 
when they should, but there are so many issues surrounding the proper handling 
of unicode that programmers aren't going to get it right unless they're 
actully trying to get it right. D provides a lot of the tools to make unicode 
mostly work correctly out of the box, but it's still complicated enough that 
you can't expect it to "just work" without programmers having some clue of 
what they're doing. And forcing people to come up with their own functions for 
basic ASCII operations (which pretty much every other programming language 
has) isn't going to help any.

- Jonathan M Davis