RFC: Case-Insensitive Strings (And usually they really do *have* case)
Nick Sabalausky
a at a.a
Sun Jan 9 13:09:08 PST 2011
Imagine things like Windows filenames or tokenized BASIC code (There's
probably plenty of other examples too, for instance, I came across this
issue when dealing with GOLD's pre-defined character set names:
http://www.devincook.com/goldparser/doc/grammars/character-sets.htm ).
These things *do* have a specific case, but the vast majority of time their
*comparisons* should be done insensitively. This can lead to awkwardness.
For instance, how do you store it?
A. If you store it as lower-case: You lose the information of the original
case. The casing information may not be important for comparisons, but it
*is* an inherent part of the data. For instance, if the string gets output
in some way, it's no longer what it originally was.
B. If you store it preserving case: You have to make sure to remember to
convert to lower-case on most comparisons. Which has three problems:
1. Easy to forget and cause a bug.
2. May end up converting lower-case more often than really needed.
3. What if the comparison occurs inside a routine that has no idea what
it's supposed to be? You'd have to remember to convert to lower-case when
passing to that routine. And then within the routines you're back to the
problems of strategy "A".
C. If you store both the original and lower-case versions: You have to take
special care to maintain two separate variables and keep them both in-sync.
All three options are kinda sucky. And in any way, how to do distinguish
between variables that are suposed to be case-insensitive and ones that
aren't? Naming convention?
I think treating case-sensitivity as an attribute of the data rather than
the comparison, and utilizing the type-system, cleans things up
considerably. So I've created a struct to represent a "case-insensitive
string" (templated for string, wstring and dstring). I think it would be a
good thing to have in phobos, and would like to submit it for review.
It's currently named Insensitive/WInsensitive/DInsensitive, but I'm thinking
now that maybe it should be renamed something like
istring/iwstring/idstring. They're used like this:
// Create
auto normalString = "Hello";
auto insensitiveString = Insensitive("Hello");
// Convert
normalString = insensitiveString.toString();
insensitiveString = Insensitive(normalString);
// Compare
assert(Insensitive("Hello") == Insensitive("hELLo"));
// Mixed-comparisons are deliberately disallowed at
// compile-time since the intent is ambiguous:
assert(normalString == insensitiveString); // Compile Error!
// To disambiguate:
assert(Insensitive(normalString) == insensitiveString)); // Case-Insensitive
assert(normalString == insensitiveString.toString())); // Case-Sensitive
// Preserves original case:
writeln(Insensitive("hELLo")); // Output: hELLo
It also works as expected as the key of an assoc array (which I've
personally found useful). Also, it doesn't convert to lower-case until it
actually needs to, and once it does, it caches it.
Here is the source:
http://www.dsource.org/projects/semitwist/browser/trunk/src/semitwist/util/text.d#L758
While it is in the middle of my big grab-bag library, I'm pretty sure it
doesn't rely on anything that's not in phobos (unless I've missed something,
in which case, I can correct). The unittests for it do use my own
unittesting routines, but it should be trivial to see how they'd convert to
unittest{}, assert() and assertPred.
More information about the Digitalmars-d
mailing list