RFC: Case-Insensitive Strings (And usually they really do *have*case)

Nick Sabalausky a at a.a
Sun Jan 9 23:10:28 PST 2011


"Jonathan M Davis" <jmdavisProg at gmx.com> wrote in message 
news:mailman.529.1294620116.4748.digitalmars-d at puremagic.com...
> On Sunday 09 January 2011 13:52:53 Jim wrote:
>> I'm a firm believer of alternative B: Store the string with its original
>> case, unless it's particularly important to do otherwise.
>>
>> The cost of case-insensitive comparison is REALLY small. Anytime you are 
>> to
>> compare two strings ask yourself whether case-sensitive or
>> case-insensitive is what you need. Have no inclination to prefer one type
>> of comparison to the other. Problem solved. Bloat avoided.
>>
>>
>> Creating specific types of strings that carry with them data on how they
>> are to be interpreted is over-engineering, solving a problem that doesn't
>> exist.

The problem certainly cropped up for me. See below.

>
> I don't know that it's over-engineering. I expect that there _are_ cases 
> where
> it makes perfect sense. However, in the general case, I do think that it's
> overkill. std.string.icmp() deals with most cases where you need case-
> insensitive comparison, but what if you really do need it everywhere as in
> Nick's case? Or what about cases like associative arrays, which you can't 
> give a
> comparison function to (it has to be built into the type)? I don't think 
> that
> the cost of the comparison here is really the issue. If that's all you 
> need,
> then there's icmp(). It's when you need the same comparison _everywhere_ 
> that it
> matters.
>

Right. FWIW, this is the scenario that originally inspired it:

I was working on some code that processes a grammar definition 
(specifically, the BNF-style language that GOLD uses). The grammar 
definition language includes various pre-defined character sets, and allows 
user-defined character sets. These character sets are referred to be name 
(such as "AlphaNumeric", "Whitespace", or "Cyrillic Supplementary"). But 
those character set names are defined by the language as being 
case-insensitive. (Come to think of it, all the names of everything are 
case-insensitive: tokens, char sets, meta-data, etc.)

Due to the usage patterns in my program, it made sense to store the 
character sets as an associative array where the keys were the names of the 
character sets and the values were the data describing what characters were 
included in the set. And there were plenty of other AAs for other things 
that were all indexed by case-insensitive names. Obviously, I needed to 
ensure that *all* comparisons involving these names were done insensitively 
(to do otherwise would be a bug). And there were also times when I needed to 
display one of the character set names (error messages, for instance), and 
it would be awkward not to show the original capitalization. So I had to 
follow the convention of always creating lower-case versions to insert into 
and lookup from the AAs, and also maintain the original names (and be very 
careful about all of it). This quickly became an awful mess. But as soon as 
I wrote and started using the "Insensitive" type, the whole thing was 
simplified enormously.

While writing and dealing with all that code I realized something: While 
programmers are usually heavily conditioned to think of case-sensitivity as 
an attribute of the comparison, it's very frequent that the deciding factor 
in which comparison to use is *not* the comparison itself but *what* gets 
compared. And in those cases, you have to use the awful strategy of "relying 
on convention" to make sure you get it right in *every* place that 
particular data gets compared.

It's analogous to how Asm has separate operators for signed-integer, 
unsigned-integer and floating-point math: Many times a specific memory 
location is *supposed* to be treated as either signed, unsigned or float in 
*all* operations they participate in. Handling this with separate operators 
that behave differently is notoriously tedious and error-prone. That's why 
non-asm languages, even ones as low-level as C, employ a type system which 
allows the programmer to *force* a variable to always, and automatically, be 
used with the proper version of the given operator. Heck, it all goes back 
to the whole original point of a type-system.

>
> Now, I do wonder if perhaps this idea should be generalized to any type 
> and/or a
> given binary predicate to test for equality rather than making it specific 
> to
> strings and case-insensitive comparison. The issue here (in the general 
> sense)
> is that you want to wrap a type so that it will use a specialized 
> comparison
> function everywhere, and that seems like it should be highly 
> generalizable,
> though doing it right may require alias this, which _is_ rather buggy at 
> the
> moment. Still, it would seem to me to be worthwhile to consider how it 
> could
> and/or should be generalized.
>

That's a very good thought. Have to say I'm not really sure offhand how I'd 
do that though.





More information about the Digitalmars-d mailing list