Thin UTF8 string wrapper

Jonathan M Davis newsgroup.d at jmdavisprog.com
Sat Dec 7 15:57:14 UTC 2019


On Saturday, December 7, 2019 5:23:30 AM MST Joseph Rushton Wakeling via 
Digitalmars-d-learn wrote:
> On Saturday, 7 December 2019 at 03:23:00 UTC, Jonathan M Davis
>
> wrote:
> > The module to look at here is std.utf, not std.encoding.
>
> Hmmm, docs may need updating then -- several functions in
> `std.encoding` explicitly state they are replacements for
> `std.utf`.  Did you mean `std.uni`?

> It is honestly a bit confusing which of these 3 modules to use,
> especially as they each offer different (and useful) tools.  For
> example, `std.utf.validate` is less useful than
> `std.encoding.isValid`, because it throws rather than returning a
> bool and giving the user the choice of behaviour.  `std.uni`
> doesn't seem to have any equivalent for either.
>
> Thanks in any case for the as-ever characteristically detailed
> and useful advice :-)

There may have been some tweaks to std.encoding here and there, but for the
most part, it's pretty ancient. Looking at the history, it's Seb who marked
some if it as being a replacement for std.utf, which is just plain wrong.
Phobos in general uses std.utf for dealing with UTF-8, UTF-16, and UTF-32,
not std.encoding. std.encoding is an old module that's had some tweaks done
to it but which probably needs a pretty serious overhaul. The only thing
that I've ever use it for is BOM stuff.

std.utf.validate does need a replacement, but doing so gets pretty
complicated. And looking at std.encoding.isValid, I'm not sure that what it
does is any better from simply wrapping std.utf.validate and returning a
bool based on whether an exception was thrown. Depending on the string, it
would actually be faster to use validate, because std.encoding.isValid
iterates through the entire string regardless. The way it checks validity is
also completely different from what std.utf does. Either way, some of the
std.encoding internals do seem to be an alternate implementation of what
std.utf has, but outside of std.encoding itself, std.utf is what Phobos uses
for UTF-8, UTF-16, and UTF-32, not std.encoding.

I did do a PR at one point to add isValidUTF to std.utf so that we could
replace std.utf.validate, but Andrei didn't like the implementation, so it
didn't get merged, and I haven't gotten around to figuring out how to
implement it more cleanly.

- Jonathan M Davis





More information about the Digitalmars-d-learn mailing list