RFC: Case-Insensitive Strings (And usually they really do have case)

Sun Jan 9 13:09:08 PST 2011

Imagine things like Windows filenames or tokenized BASIC code (There's 
probably plenty of other examples too, for instance, I came across this 
issue when dealing with GOLD's pre-defined character set names: 
http://www.devincook.com/goldparser/doc/grammars/character-sets.htm ).

These things *do* have a specific case, but the vast majority of time their 
*comparisons* should be done insensitively. This can lead to awkwardness. 
For instance, how do you store it?

A. If you store it as lower-case: You lose the information of the original 
case. The casing information may not be important for comparisons, but it 
*is* an inherent part of the data. For instance, if the string gets output 
in some way, it's no longer what it originally was.

B. If you store it preserving case: You have to make sure to remember to 
convert to lower-case on most comparisons. Which has three problems:
    1. Easy to forget and cause a bug.
    2. May end up converting lower-case more often than really needed.
    3. What if the comparison occurs inside a routine that has no idea what 
it's supposed to be? You'd have to remember to convert to lower-case when 
passing to that routine. And then within the routines you're back to the 
problems of strategy "A".

C. If you store both the original and lower-case versions: You have to take 
special care to maintain two separate variables and keep them both in-sync.

All three options are kinda sucky. And in any way, how to do distinguish 
between variables that are suposed to be case-insensitive and ones that 
aren't? Naming convention?

I think treating case-sensitivity as an attribute of the data rather than 
the comparison, and utilizing the type-system, cleans things up 
considerably. So I've created a struct to represent a "case-insensitive 
string" (templated for string, wstring and dstring). I think it would be a 
good thing to have in phobos, and would like to submit it for review.

It's currently named Insensitive/WInsensitive/DInsensitive, but I'm thinking 
now that maybe it should be renamed something like 
istring/iwstring/idstring. They're used like this:

// Create
auto normalString = "Hello";
auto insensitiveString = Insensitive("Hello");

// Convert
normalString = insensitiveString.toString();
insensitiveString = Insensitive(normalString);

// Compare
assert(Insensitive("Hello") == Insensitive("hELLo"));

// Mixed-comparisons are deliberately disallowed at
// compile-time since the intent is ambiguous:
assert(normalString == insensitiveString); // Compile Error!

// To disambiguate:
assert(Insensitive(normalString) == insensitiveString)); // Case-Insensitive
assert(normalString == insensitiveString.toString())); // Case-Sensitive

// Preserves original case:
writeln(Insensitive("hELLo")); // Output: hELLo

It also works as expected as the key of an assoc array (which I've 
personally found useful). Also, it doesn't convert to lower-case until it 
actually needs to, and once it does, it caches it.

Here is the source:
http://www.dsource.org/projects/semitwist/browser/trunk/src/semitwist/util/text.d#L758

While it is in the middle of my big grab-bag library, I'm pretty sure it 
doesn't rely on anything that's not in phobos (unless I've missed something, 
in which case, I can correct). The unittests for it do use my own 
unittesting routines, but it should be trivial to see how they'd convert to 
unittest{}, assert() and assertPred.

RFC: Case-Insensitive Strings (And usually they really do *have* case)

RFC: Case-Insensitive Strings (And usually they really do have case)