RFC: Case-Insensitive Strings (And usually they really do have case)

Sun Jan 9 13:52:53 PST 2011

I'm a firm believer of alternative B: Store the string with its original case, unless it's particularly important to do otherwise.

The cost of case-insensitive comparison is REALLY small. Anytime you are to compare two strings ask yourself whether case-sensitive or case-insensitive is what you need. Have no inclination to prefer one type of comparison to the other. Problem solved. Bloat avoided.

Creating specific types of strings that carry with them data on how they are to be interpreted is over-engineering, solving a problem that doesn't exist.

Nick Sabalausky Wrote:

> Imagine things like Windows filenames or tokenized BASIC code (There's 
> probably plenty of other examples too, for instance, I came across this 
> issue when dealing with GOLD's pre-defined character set names: 
> http://www.devincook.com/goldparser/doc/grammars/character-sets.htm ).
> 
> These things *do* have a specific case, but the vast majority of time their 
> *comparisons* should be done insensitively. This can lead to awkwardness. 
> For instance, how do you store it?
> 
> A. If you store it as lower-case: You lose the information of the original 
> case. The casing information may not be important for comparisons, but it 
> *is* an inherent part of the data. For instance, if the string gets output 
> in some way, it's no longer what it originally was.
> 
> B. If you store it preserving case: You have to make sure to remember to 
> convert to lower-case on most comparisons. Which has three problems:
>     1. Easy to forget and cause a bug.
>     2. May end up converting lower-case more often than really needed.
>     3. What if the comparison occurs inside a routine that has no idea what 
> it's supposed to be? You'd have to remember to convert to lower-case when 
> passing to that routine. And then within the routines you're back to the 
> problems of strategy "A".
> 
> C. If you store both the original and lower-case versions: You have to take 
> special care to maintain two separate variables and keep them both in-sync.
> 
> All three options are kinda sucky. And in any way, how to do distinguish 
> between variables that are suposed to be case-insensitive and ones that 
> aren't? Naming convention?
> 
> I think treating case-sensitivity as an attribute of the data rather than 
> the comparison, and utilizing the type-system, cleans things up 
> considerably. So I've created a struct to represent a "case-insensitive 
> string" (templated for string, wstring and dstring). I think it would be a 
> good thing to have in phobos, and would like to submit it for review.
> 
> It's currently named Insensitive/WInsensitive/DInsensitive, but I'm thinking 
> now that maybe it should be renamed something like 
> istring/iwstring/idstring. They're used like this:
> 
> // Create
> auto normalString = "Hello";
> auto insensitiveString = Insensitive("Hello");
> 
> // Convert
> normalString = insensitiveString.toString();
> insensitiveString = Insensitive(normalString);
> 
> // Compare
> assert(Insensitive("Hello") == Insensitive("hELLo"));
> 
> // Mixed-comparisons are deliberately disallowed at
> // compile-time since the intent is ambiguous:
> assert(normalString == insensitiveString); // Compile Error!
> 
> // To disambiguate:
> assert(Insensitive(normalString) == insensitiveString)); // Case-Insensitive
> assert(normalString == insensitiveString.toString())); // Case-Sensitive
> 
> // Preserves original case:
> writeln(Insensitive("hELLo")); // Output: hELLo
> 
> It also works as expected as the key of an assoc array (which I've 
> personally found useful). Also, it doesn't convert to lower-case until it 
> actually needs to, and once it does, it caches it.
> 
> Here is the source:
> http://www.dsource.org/projects/semitwist/browser/trunk/src/semitwist/util/text.d#L758
> 
> While it is in the middle of my big grab-bag library, I'm pretty sure it 
> doesn't rely on anything that's not in phobos (unless I've missed something, 
> in which case, I can correct). The unittests for it do use my own 
> unittesting routines, but it should be trivial to see how they'd convert to 
> unittest{}, assert() and assertPred.
> 
> 

RFC: Case-Insensitive Strings (And usually they really do *have* case)

RFC: Case-Insensitive Strings (And usually they really do have case)