RFC: Case-Insensitive Strings (And usually they really do *have*case)
Nick Sabalausky
a at a.a
Sun Jan 9 23:10:28 PST 2011
"Jonathan M Davis" <jmdavisProg at gmx.com> wrote in message
news:mailman.529.1294620116.4748.digitalmars-d at puremagic.com...
> On Sunday 09 January 2011 13:52:53 Jim wrote:
>> I'm a firm believer of alternative B: Store the string with its original
>> case, unless it's particularly important to do otherwise.
>>
>> The cost of case-insensitive comparison is REALLY small. Anytime you are
>> to
>> compare two strings ask yourself whether case-sensitive or
>> case-insensitive is what you need. Have no inclination to prefer one type
>> of comparison to the other. Problem solved. Bloat avoided.
>>
>>
>> Creating specific types of strings that carry with them data on how they
>> are to be interpreted is over-engineering, solving a problem that doesn't
>> exist.
The problem certainly cropped up for me. See below.
>
> I don't know that it's over-engineering. I expect that there _are_ cases
> where
> it makes perfect sense. However, in the general case, I do think that it's
> overkill. std.string.icmp() deals with most cases where you need case-
> insensitive comparison, but what if you really do need it everywhere as in
> Nick's case? Or what about cases like associative arrays, which you can't
> give a
> comparison function to (it has to be built into the type)? I don't think
> that
> the cost of the comparison here is really the issue. If that's all you
> need,
> then there's icmp(). It's when you need the same comparison _everywhere_
> that it
> matters.
>
Right. FWIW, this is the scenario that originally inspired it:
I was working on some code that processes a grammar definition
(specifically, the BNF-style language that GOLD uses). The grammar
definition language includes various pre-defined character sets, and allows
user-defined character sets. These character sets are referred to be name
(such as "AlphaNumeric", "Whitespace", or "Cyrillic Supplementary"). But
those character set names are defined by the language as being
case-insensitive. (Come to think of it, all the names of everything are
case-insensitive: tokens, char sets, meta-data, etc.)
Due to the usage patterns in my program, it made sense to store the
character sets as an associative array where the keys were the names of the
character sets and the values were the data describing what characters were
included in the set. And there were plenty of other AAs for other things
that were all indexed by case-insensitive names. Obviously, I needed to
ensure that *all* comparisons involving these names were done insensitively
(to do otherwise would be a bug). And there were also times when I needed to
display one of the character set names (error messages, for instance), and
it would be awkward not to show the original capitalization. So I had to
follow the convention of always creating lower-case versions to insert into
and lookup from the AAs, and also maintain the original names (and be very
careful about all of it). This quickly became an awful mess. But as soon as
I wrote and started using the "Insensitive" type, the whole thing was
simplified enormously.
While writing and dealing with all that code I realized something: While
programmers are usually heavily conditioned to think of case-sensitivity as
an attribute of the comparison, it's very frequent that the deciding factor
in which comparison to use is *not* the comparison itself but *what* gets
compared. And in those cases, you have to use the awful strategy of "relying
on convention" to make sure you get it right in *every* place that
particular data gets compared.
It's analogous to how Asm has separate operators for signed-integer,
unsigned-integer and floating-point math: Many times a specific memory
location is *supposed* to be treated as either signed, unsigned or float in
*all* operations they participate in. Handling this with separate operators
that behave differently is notoriously tedious and error-prone. That's why
non-asm languages, even ones as low-level as C, employ a type system which
allows the programmer to *force* a variable to always, and automatically, be
used with the proper version of the given operator. Heck, it all goes back
to the whole original point of a type-system.
>
> Now, I do wonder if perhaps this idea should be generalized to any type
> and/or a
> given binary predicate to test for equality rather than making it specific
> to
> strings and case-insensitive comparison. The issue here (in the general
> sense)
> is that you want to wrap a type so that it will use a specialized
> comparison
> function everywhere, and that seems like it should be highly
> generalizable,
> though doing it right may require alias this, which _is_ rather buggy at
> the
> moment. Still, it would seem to me to be worthwhile to consider how it
> could
> and/or should be generalized.
>
That's a very good thought. Have to say I'm not really sure offhand how I'd
do that though.
More information about the Digitalmars-d
mailing list