List of Phobos functions that allocate memory?
Dmitry Olshansky
dmitry.olsh at gmail.com
Tue Feb 18 00:14:58 PST 2014
17-Feb-2014 06:19, Marco Leise пишет:
> Am Sun, 09 Feb 2014 12:18:41 +0400
> schrieb Dmitry Olshansky <dmitry.olsh at gmail.com>:
>
>> 09-Feb-2014 09:35, Marco Leise пишет:
>>> Thats neither an improvement over calling "validate" nor does
>>> that deal with distinguishing between invalid UTF and
>>
>> Means text is broken but wasn't ever read...
>>> \uFFFD
>>> in the input.
>> ...means text was broken sometime before.
>>
>> Hardly makes any difference to the most applications.
>> Normal text doesn't contain \uFFFD.
>
> Of course it does. It is a valid symbol and a lot of websites
> describing the "Specials" Unicode block make use of it, like
> the one on Wikipedia:
> http://en.wikipedia.org/wiki/Specials_(Unicode_block)
>
> With your definition, pulling such a document from the web and
> parsing it in D would mean playing on broken strings.
In a sense, \uFFFD means broken encoding. What about lone surrogates?
Private use symbols that must not occur in transmission? They all
displayed in various Unicode listings. About 'playing on broken strings'
- ignoring broken/partially broken strings, I specifically think that
it's what most users/use cases want.
A more useful and sensible default of decoding is to substitute on
broken encoding. And it's a standard procedure. It's particularly better
for displaying text.
To remind: since it's only a decode you are still in the control of
original text - in fact you may re-test what bytes are there IF you want.
The way of "throw on bad encoding" could be useful but I hardly see it
as what you want for default.
I'm wary of breaking code that relies on throwing. For the moment I
think the best course of action would be to introduce xdecode or some
such that will do substitution on failure, see how it floats and then
change ranges/foreach etc to use xdecode.
>>>> [...]
>>>> Every single text editor out there seems to disagree with you: they do
>>>> show you partially substituted text, not a dialog box "My bad, it's
>>>> broken UTF-8, I'm giving up!".
>
>>> gedit does in fact throw an error message at you
>>> saying "My bad, it's broken UTF-8, I'm giving up!".
>
>> I know and it's piece of junk :)
>> Seriously it doesn't even has regular expressions for search and replace!
>
> https://yourlogicalfallacyis.com/no-true-scotsman :p
Well, gedit is a nice example of why just throwing exception is not good
enough for many apps (editors in particular). The fact that it's piece
of junk might be irrelevant ;)
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list