Fix Phobos dependencies on autodecoding
H. S. Teoh
hsteoh at quickfur.ath.cx
Fri Aug 16 21:57:13 UTC 2019
On Fri, Aug 16, 2019 at 01:44:20PM -0700, Walter Bright via Digitalmars-d wrote:
[...]
> Google translate can (and does) figure it out from the context, just
> like a human reader would.
Ha! Actually, IME, randomly substituting lookalike characters from
other languages in the input to Google Translate often transmutes the
result from passably-understandable to outright hilarious (and
ridiculous). Or the poor befuddled software just gives up and spits the
input back at you verbatim.
[...]
> And frankly, if data processing software is totally reliant on using
> the correct language-specific glyph, it will fail, because people will
> not type in the correct one, and visually they cannot proof it for
> correctness. Anything that does OCR is going to completely fail at
> this.
>
> Robust data processing software is going to be forced to accept and
> allow for multiple encodings of the same glyph, pretty much rendering
> the semantic difference meaningless.
It's not a hard problem. You just need a preprocessing stage to
normalize such stray glyphs into the correct language-specific code
points, and all subsequent stages in your software pipeline will Just
Work(tm). Think of it as a rudimentary "OCR" stage to sanitize your
inputs.
This option would be unavailable if you used an encoding scheme that
*cannot* encode language as part of the string.
> I bet in 10 or 20 years of being clobbered by experience you'll
> reluctantly agree with me that assigning semantics to individual code
> points was a mistake. :-)
That remains to be seen. :-)
> BTW, I was a winner in the 1986 Obfuscated C Code Contest with:
[...]
> I am indeed aware of the problems with confusing O0l1|. D does take
> steps to be more tolerant of bad fonts, such as 10l being allowed in
> C, but not D. I seriously considered banning the identifiers l and O.
> Perhaps I should have. | is not a problem because the grammar (i.e.
> the context) detects errors with it.
I also won an IOCCC award once, albeit anonymously (see 2005/anon)...
though it had nothing to do with lookalike characters, but more to do
with what I call M.A.S.S. (Memory Allocated by Stack-Smashing), in which
the program does not declare any variables (besides the two parameters
to main()) nor calls any memory allocation functions, but happily
manipulates arrays of data. :-D
T
--
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
More information about the Digitalmars-d
mailing list