List of Phobos functions that allocate memory?
Jonathan M Davis
jmdavisProg at gmx.com
Fri Feb 7 21:45:18 PST 2014
On Friday, February 07, 2014 21:04:08 Jonathan M Davis wrote:
> On Saturday, February 08, 2014 05:29:35 Marco Leise wrote:
> > I guess we just have two use cases here. One where invalid
> > encoding is not an error (e.g. for sanitizing purposes) and
> > one where you don't want to lose information and have to
> > enforce correct encoding.
> > Name the first one "decodeSubst" maybe and have decode call
> > that and check for 0xFFFD?
>
> I think that that would call for us to have 3 related but distinct
> functions:
>
> 1. decode, which throws on invalid Unicode. We already have this.
>
> 2. isValidUnicode, which returns whether the string is valid Unicode and
> does not throw. We don't yet have this. Rather, we have validate which does
> the same job and then throws instead of returning bool.
>
> 3. sanitizeUnicode (or whatever would be a good name for it), which replaces
> invalid Unicode with 0xFFFD (or whatever the appropriate character is) so
> that it can be operated on without causing decode to throw in spite of the
> fact that it was invalid Unicode. We don't have anything like this yet.
Actually, thinking this through some more, if we can replace invalid Unicode
with 0xFFFD, and have all algorithms work with that and consider it valid
Unicode (rather than getting weird bugs due to invalid Unicode), then if
decode returned that on error rather than throwing, we wouldn't actually need
to check the return value. It wouldn't matter that the Unicode was invalid.
So, we wouldn't even need to _care_ that the Unicode was invalid. Anyone who
_did_ care could call isValidUnicode to validate the Unicode first, and those
who didn't wouldn't need to worry about UTFException being thrown, because
everything would still work even if the string was invalid Unicode.
So, if that's indeed what 0xFFFD does, and that's what Dmitry meant by
proposing that we return that rather than throwing, then I rescind my
assessment that throwing was the best way to go and have to agree that
returning 0xFFFD would be better. I was responding under the assumption that
you had to check for 0xFFFD and respond to it order to avoid having your code
be buggy, in which case throwing would be far better. But if 0xFFFD is
considered valid Unicode, then returning that would be a fantastic solution.
And if that's the case, we only need two functions, not three:
1. decode, which returns 0xFFFD on decode failure
2. isValidUnicode, which returns whether the string is valid
And I actually really like the idea that we could just operate on invalid
Unicode as valid Unicode this way, making it so that most code doesn't need to
care, and code that _does_ need to care, can validate the strings first. Right
now, pretty much all string code needs to care in order to avoid processing
invalid Unicode, which is much messier.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list