List of Phobos functions that allocate memory?

Jonathan M Davis jmdavisProg at gmx.com
Fri Feb 7 21:45:18 PST 2014


On Friday, February 07, 2014 21:04:08 Jonathan M Davis wrote:
> On Saturday, February 08, 2014 05:29:35 Marco Leise wrote:
> > I guess we just have two use cases here. One where invalid
> > encoding is not an error (e.g. for sanitizing purposes) and
> > one where you don't want to lose information and have to
> > enforce correct encoding.
> > Name the first one "decodeSubst" maybe and have decode call
> > that and check for 0xFFFD?
> 
> I think that that would call for us to have 3 related but distinct
> functions:
> 
> 1. decode, which throws on invalid Unicode. We already have this.
> 
> 2. isValidUnicode, which returns whether the string is valid Unicode and
> does not throw. We don't yet have this. Rather, we have validate which does
> the same job and then throws instead of returning bool.
> 
> 3. sanitizeUnicode (or whatever would be a good name for it), which replaces
> invalid Unicode with 0xFFFD (or whatever the appropriate character is) so
> that it can be operated on without causing decode to throw in spite of the
> fact that it was invalid Unicode. We don't have anything like this yet.

Actually, thinking this through some more, if we can replace invalid Unicode 
with 0xFFFD, and have all algorithms work with that and consider it valid 
Unicode (rather than getting weird bugs due to invalid Unicode), then if 
decode returned that on error rather than throwing, we wouldn't actually need 
to check the return value. It wouldn't matter that the Unicode was invalid. 
So, we wouldn't even need to _care_ that the Unicode was invalid. Anyone who 
_did_ care could call isValidUnicode to validate the Unicode first, and those 
who didn't wouldn't need to worry about UTFException being thrown, because 
everything would still work even if the string was invalid Unicode.

So, if that's indeed what 0xFFFD does, and that's what Dmitry meant by 
proposing that we return that rather than throwing, then I rescind my 
assessment that throwing was the best way to go and have to agree that 
returning 0xFFFD would be better. I was responding under the assumption that 
you had to check for 0xFFFD and respond to it order to avoid having your code 
be buggy, in which case throwing would be far better. But if 0xFFFD is 
considered valid Unicode, then returning that would be a fantastic solution. 
And if that's the case, we only need two functions, not three:

1. decode, which returns 0xFFFD on decode failure

2. isValidUnicode, which returns whether the string is valid

And I actually really like the idea that we could just operate on invalid 
Unicode as valid Unicode this way, making it so that most code doesn't need to 
care, and code that _does_ need to care, can validate the strings first. Right 
now, pretty much all string code needs to care in order to avoid processing 
invalid Unicode, which is much messier.

- Jonathan M Davis


More information about the Digitalmars-d mailing list