RFC: Case-Insensitive Strings (And usually they really do*have*case)

Nick Sabalausky a at a.a
Mon Jan 10 13:14:22 PST 2011


"Nick Sabalausky" <a at a.a> wrote in message 
news:igfq0n$22u8$1 at digitalmars.com...
> "Jonathan M Davis" <jmdavisProg at gmx.com> wrote in message 
> news:mailman.538.1294690510.4748.digitalmars-d at puremagic.com...
>> On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:
>>> "Jim" <bitcirkel at yahoo.com> wrote in message
>>> news:igfado$11g3$1 at digitalmars.com...
>>>
>>> >> While writing and dealing with all that code I realized something: 
>>> >> While
>>> >> programmers are usually heavily conditioned to think of 
>>> >> case-sensitivity
>>> >> as
>>> >> an attribute of the comparison, it's very frequent that the deciding
>>> >> factor
>>> >> in which comparison to use is *not* the comparison itself but *what*
>>> >> gets compared. And in those cases, you have to use the awful strategy
>>> >> of "relying
>>> >> on convention" to make sure you get it right in *every* place that
>>> >> particular data gets compared.
>>> >
>>> > You have a point. Your case-sensitivity-aware string types will 
>>> > guarantee
>>> > correctness in a large and complex program. I like that. Ideally 
>>> > though,
>>> > they would only be compile-time constraints (i.e. not carrying any 
>>> > other
>>> > data).
>>>
>>> Not carrying any other data means not caching the lowercase version, 
>>> which
>>> means recreating the lowercase version more than necessary. So it's the
>>> classic speed vs. space tradeoff. I would think there would be cases 
>>> where
>>> they get compared enough for that to make a difference, although I 
>>> suppose
>>> we'd really need benchmarks to see. OTOH, there are certainly cases 
>>> (such
>>> as my original motivating case) where the extra space is not an issue at
>>> all.
>>
>> Why is caching necessary? Shouldn't you just be able to use 
>> std.string.icmp()
>> for comparisons internally, avoiding any copying or caching? That 
>> shouldn't need
>> to duplicate anything. Or do you need to cache the lower-case version for
>> something other than comparison?
>>
>
> Anything involving toHash (such as using Insensitive as an AA key) 
> requires the use of a lower-case version. For anything else, you're 
> probably right, icmp should be fine (Although I'd like to do a benchmark 
> of icmp vs regular string comparison).
>

The other (smaller) thing I'm concerned about, and the reason I keep 
mentioning I want to do a benchmark, is that case-sensitive comparisons are 
able to use the max-speed memcmp whereas icmp has to not only avoid memcmp 
but also stick extra logic in the middle of the loop (and currently that 
also includes calls to utf.decode). Without checking, I'm not sure if that 
wouldn't make multiple calls to icmp slower than just making a lower-case 
version once and using case-sensitive comparisons on them.




More information about the Digitalmars-d mailing list