Fix Phobos dependencies on autodecoding

H. S. Teoh hsteoh at quickfur.ath.cx
Fri Aug 16 21:57:13 UTC 2019


On Fri, Aug 16, 2019 at 01:44:20PM -0700, Walter Bright via Digitalmars-d wrote:
[...]
> Google translate can (and does) figure it out from the context, just
> like a human reader would.

Ha!  Actually, IME, randomly substituting lookalike characters from
other languages in the input to Google Translate often transmutes the
result from passably-understandable to outright hilarious (and
ridiculous).  Or the poor befuddled software just gives up and spits the
input back at you verbatim.


[...]
> And frankly, if data processing software is totally reliant on using
> the correct language-specific glyph, it will fail, because people will
> not type in the correct one, and visually they cannot proof it for
> correctness.  Anything that does OCR is going to completely fail at
> this.
> 
> Robust data processing software is going to be forced to accept and
> allow for multiple encodings of the same glyph, pretty much rendering
> the semantic difference meaningless.

It's not a hard problem. You just need a preprocessing stage to
normalize such stray glyphs into the correct language-specific code
points, and all subsequent stages in your software pipeline will Just
Work(tm). Think of it as a rudimentary "OCR" stage to sanitize your
inputs.

This option would be unavailable if you used an encoding scheme that
*cannot* encode language as part of the string.


> I bet in 10 or 20 years of being clobbered by experience you'll
> reluctantly agree with me that assigning semantics to individual code
> points was a mistake. :-)

That remains to be seen. :-)


> BTW, I was a winner in the 1986 Obfuscated C Code Contest with:
[...]
> I am indeed aware of the problems with confusing O0l1|. D does take
> steps to be more tolerant of bad fonts, such as 10l being allowed in
> C, but not D. I seriously considered banning the identifiers l and O.
> Perhaps I should have.  | is not a problem because the grammar (i.e.
> the context) detects errors with it.

I also won an IOCCC award once, albeit anonymously (see 2005/anon)...
though it had nothing to do with lookalike characters, but more to do
with what I call M.A.S.S. (Memory Allocated by Stack-Smashing), in which
the program does not declare any variables (besides the two parameters
to main()) nor calls any memory allocation functions, but happily
manipulates arrays of data. :-D


T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5


More information about the Digitalmars-d mailing list