Why is std.regex slow, well here is one reason!

Dmitry Olshansky dmitry.olsh at gmail.com
Fri Mar 3 06:44:26 UTC 2023


On Thursday, 2 March 2023 at 20:11:14 UTC, Walter Bright wrote:
> On 3/1/2023 11:49 PM, Dmitry Olshansky wrote:
>> I would insist that there are times when “looks the same” is 
>> not a good option. Typically programs do not have the context, 
>> that we as humans use to disambiguate.
>
> Programs can't tell if "die" means "the" or "expire" without 
> context, either.
>

We are talking about characters. Yes we can’t tell the meaning 
but we can upper/lowercase or word break it at ease.

> The point is, once invisible semantic meaning is added, an 
> infinite number of Unicode code points is required.

> > You’d be surprised
>
> Not at all. People use different fonts to assert different 
> meanings all the time.
>
> > but there are typesets where Cyrillic A is visually different
> from ASCII A.
>
> Yes, and there are italic fonts, and people embed them in text 
> using markup, not different code points.

Let’s see another example. Cyrillic letter ‘В’ looks the same as 
ASCII ‘B’ when capitalized, hence by your reasoning it’s the same 
codepoint. Now lowercase ‘в’ and ‘b’ don’t look the same hence 
different codepoints. Voila, you just made 
lowercasing/uppercasing impossible without some external context, 
so <cyrillic>В</cyrillic> ?

I’d rather live in a world where codepoints represent particular 
alphabet allowing us to generically manipulate text according to 
the language standards even if we do not know the semantics of 
words. Context if required is for high-level meaning.
—
Dmitry Olshansky





More information about the Digitalmars-d mailing list