Why is std.regex slow, well here is one reason!

Walter Bright newshound2 at digitalmars.com
Fri Feb 24 20:44:17 UTC 2023


On 2/24/2023 12:05 PM, Max Samukha wrote:
> On Friday, 24 February 2023 at 18:34:42 UTC, Walter Bright wrote:
> 
>> Let's say I write "x". Is that the letter x, or the math symbol x? I know 
>> which it is from the context. But in Unicode, there's a letter x and the math 
>> symbol x, although they look identical.
> 
> Same as 'A' in KOI8 or Windows-1251? Latin and Cyrillic 'A' look identical but 
> have different codes. Not that I disagree with you, but Unicode just upheld the 
> tradition.

Is 'A' in German different from the 'A' in English? Yes. Do they have different 
keys on the keyboard? No. Do they have different Unicode code points? No. How do 
you tell a German 'A' from an English 'A'? By the context.

The same for the word "die". Is it the German "the"? Or is it the English 
"expire"? Should we embed this in the letters themselves? Of course not.

 > Not that I disagree with you, but Unicode just upheld the
 > tradition.

Inventing a new code encoding needn't follow tradition, or take tradition to 
such an extreme that it makes everyone who uses Unicode miserable.


>> There is no end to semantic meanings for "x", and so any attempt to encode 
>> semantics into Unicode is doomed from the outset.
> 
> That is similar to attempts to encode semantics in, say, binary operators - they 
> are nothing but functions, but...

We know the meaning by context.


> The meaning of a code point can be inferred without the need to keep track of 
> the context.

Meaning in a character set simply should not exist outside the visual appearance.


> Is Latin 'A' the 
> same character as Cyrillic 'A'? Should they have the same code?

It's the same glyph, and so should have the same code. The definitive test is, 
when printed out or displayed, can you see a difference? If the answer is "no" 
then they should be the same code.

It's fine if one wishes to develop another layer over Unicode which encodes 
semantics, style, font, language, emphasis, bold face, italics, etc. But these 
just do not belong in Unicode. They belong in a separate markup language.



More information about the Digitalmars-d mailing list