List of Phobos functions that allocate memory?

Dmitry Olshansky dmitry.olsh at gmail.com
Sun Feb 9 00:18:41 PST 2014


09-Feb-2014 09:35, Marco Leise пишет:
> Am Sat, 08 Feb 2014 15:21:26 +0400
> schrieb Dmitry Olshansky <dmitry.olsh at gmail.com>:
>
>> 08-Feb-2014 02:57, Jonathan M Davis пишет:
>>> On Friday, February 07, 2014 20:43:38 Dmitry Olshansky wrote:
>>>> 07-Feb-2014 20:29, Andrej Mitrovic пишет:
>>>>> On Friday, 7 February 2014 at 16:27:35 UTC, Andrei Alexandrescu wrote:
>>>>>> Add a bugzilla and let's define isValid that returns bool!
>>>>>
>>>>> Add std.utf.decode() to that as well. IOW, it should have an overload
>>>>> which returns a status code
>>>>
>>>> Much simpler - it returns a special dchar to designate bad encoding. And
>>>> there is one defined by Unicode spec.
>>>
>>> Isn't that actually worse?
>>
>> No, it's better and more flexible for those who care to repair broken
>> text in case it's broken. We currently have ZERO facilities to work with
>> partly broken UTF and it's not that rare thing to have it.
>
> Your argument is unsubstantiated, since we have this already:
> http://dlang.org/phobos/std_encoding.html#.sanitize

Working with ranges of dchar? Nobody is taking eager validation from 
your hands anyway.

>
>>> Unless you're suggesting that we stop throwing on
>>> decode errors,
>>
>> That is exactly what I suggest.
>>
>> then functions like std.array.front will have to check the
>>> result on every call to see whether it was valid or not and thus whether they
>>> should throw, which would mean extra overhead over simply having decode throw
>>> on decode errors.
>>
>> Why the heck? It will not throw either. In the very end bad encoding is
>> handled by displaying the 'substituted' (typically '?') character in
>> places where it broke not by throwing up hands in the air and spitting
>> "UTF Exception: offset 4302 bad UTF sequence". This is not good enough
>> (in case somebody though that it is).
>>
>> Those who care about throwing add a trivial map!(x => x != '\uFFFD' ||
>> die()) over a string, where die function throws an exception.
>
> Thats neither an improvement over calling "validate" nor does
> that deal with distinguishing between invalid UTF and

Means text is broken but wasn't ever read...
>\uFFFD
> in the input.
...means text was broken sometime before.

Hardly makes any difference to the most applications.
Normal text doesn't contain \uFFFD.

And you can test a string with proper 'validate', it's just that while 
decoding the default is to substitute.

>>> validate has no business throwing, and we definitely should
>>> add isValidUnicode (or isValid or whatever you want to call it) for validation
>>> purposes. Code can then call that to validate that a string is valid and not
>>> worry about any UTFExceptions being thrown as long as it doesn't manipulate
>>> the string in a way that could result in its Unicode becoming invalid.
>>
>> Yet later down the road decode will triple check that anyway. Just
>> saying. BTW if the string was checked beforehand there is no difference
>> between 2 approaches at all (don't have to check).
>>
>>> However, I would argue that assuming that everyone is going to validate their
>>> strings and that pretty much all string-related functions shouldn't ever have
>>> to worry about invalid Unicode is just begging for subtle bugs all over the
>>> place IMHO. You're essentially dealing with error codes at that point, and I
>>> think that experience has shown quite clearly that error codes are generally a
>>> bad way to go. Almost no one checks them unless they have to. I think that
>>> having decode throw on invalid Unicode is exactly what it should be doing. The
>>> problem is that validate shouldn't.
>>
>> Every single text editor out there seems to disagree with you: they do
>> show you partially substituted text, not a dialog box "My bad, it's
>> broken UTF-8, I'm giving up!".
>
> Editor do different things. They often try to detect the
> encoding with a fall back to Latin1. If you open a file
> explicitly as UTF-8 they may display a substitution char or
> detect the error and use the fall back, as is the case with
> Geany and

Throwing exception here is not something useful in 90% of cases. 
Requiring everybody to call sanitize on every string from the outside 
smells like a wrong default to me.

> gedit does in fact throw an error message at you
> saying "My bad, it's broken UTF-8, I'm giving up!".

I know and it's piece of junk :)
Seriously it doesn't even has regular expressions for search and replace!

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list