Why is std.regex slow, well here is one reason!

H. S. Teoh hsteoh at qfbox.info
Fri Feb 24 22:57:27 UTC 2023


On Fri, Feb 24, 2023 at 10:34:42AM -0800, Walter Bright via Digitalmars-d wrote:
> On 2/23/2023 11:28 PM, Max Samukha wrote:
> > On Thursday, 23 February 2023 at 23:11:56 UTC, Walter Bright wrote:
> > > Unicode is a brilliant idea, but its doom comes from the execrable
> > > decision to apply semantic meaning to glyphs.
> > 
> > Unicode did not start that. For example, all Cyrillic encodings
> > encode Latin А, K, H, etc. differently than the similarly looking
> > Cyrillic counterparts. Whether that decision was execrable is highly
> > debatable.
> 
> Let's say I write "x". Is that the letter x, or the math symbol x? I
> know which it is from the context. But in Unicode, there's a letter x
> and the math symbol x, although they look identical.

Actually x and × are *not* identical if you're using a sane font. They
have different glyph shapes (that though very similar are actually
different -- × for example will never have serifs even in a serif font),
and different font metrics (× has more space around it on either side; x
may be kerned against an adjacent letter). If you print them they will
have a different representation of dots on the paper, even if the
difference is fine enough you don't notice it.

With all due respect, writing systems aren't as simple as you think.
Sometimes what to you seems like a lookalike glyph may be something
completely different. For example, in English if you see:

	m

you can immediately tell that it's a lowercase M.  So it makes sense to
have just one Unicode codepoint to encode this, right?

Now take the lowercase Cyrillic letter т.  Completely different glyph,
so completely different Unicode codepoint, right?  The problem is, the
*cursive* version of this letter looks like this:

	m

According to your logic, we should encode this exactly the same way you
encode the English lowercase M.  But now you have two completely
different codepoints for the same letter, which makes no sense because
it implies that changing the display font (from upright to cursive)
requires re-encoding your string.

This isn't the only instance of this. Another example is lowercase
Cyrillic П, which looks like this in upright font:

	п

but in cursive:

	n

Again, you have the same problem.

It's not reasonable to expect that changing your display font requires
reencoding the string. But then you must admit that the English
lowercase n must be encoded differently from the Cyrillic cursive n.

Which means that you must encode the *logical* symbol rather than the
physical representation of it. I.e., semnatics.


> There is no end to semantic meanings for "x", and so any attempt to
> encode semantics into Unicode is doomed from the outset.

If we were to take your suggestion that "x" and "×" should be encoded
identically, we would quickly run into readability problems with English
text that contains mathematical fragments, say, the text talks about 3×3
matrices.  How will your email reader render the ×?  Not knowing any
better, it sees the exact same codepoint as x and prints it as an
English letter x, say in a serif font.  Which looks out-of-place in a
mathematical expression. To fix that, you have to explicitly switch to a
different font in order to have a nicer symbol.  The computer can't do
this for you, because, as you said, the interpretation of a symbol is
context-dependent --- and computers are bad at context-dependent stuff.
So you'll need complex information outside of the text itself (e.g. use
HTML or some other markup) to tell the computer which meaning of "x" is
intended here.  The *exact same kind of complex information* that
Unicode currently deals with.

So you're not really solving anything, just pushing the complexity from
one place to another.  And not having this information directly encoded
in the string means that you're now going back to the bad ole days where
there is no standard for marking semantics in a piece of text; everybody
does it differently, and copy-n-pasting text from one program to another
will almost guarantee the loss of this information (that you then have
to re-input in the target software).


[...]
> Implementing all this stuff is hopelessly complex, which is why
> Unicode had to introduce "levels" of Unicode support.

Human writing systems are hopelessly complex.  It's just par for the
course. :-D


T

-- 
You have to expect the unexpected. -- RL


More information about the Digitalmars-d mailing list