std.uni, std.ascii, std.encoding, std.utf ugh!
WebFreak001
d.forum at webfreak.org
Tue May 5 19:24:41 UTC 2020
On Tuesday, 5 May 2020 at 18:41:50 UTC, learner wrote:
> Good morning,
>
> Trying to do this:
>
> ```
> bool foo(string s) nothrow { return s.all!isDigit; }
> ```
>
> I realised that the conversion from char to dchar could throw.
>
> I need to validate and operate over ascii strings and utf8
> strings, possibly in separate functions, what's the best way to
> transition between:
>
> ```
> immutable(ubyte)[] -> validate utf8 -> string -> nothrow usage
> -> isDigit etc
> immutable(ubyte)[] -> validate ascii -> AsciiString? -> nothrow
> usage -> isDigit etc
> string -> validate ascii -> AsciiString? -> nothrow
> usage -> isDigit etc
> ```
>
> Thank you
if you want nothrow operations on the sequence of characters
(bytes) of the strings, use `str.representation` to get
`immutable(ubyte)[]` and work on that. This is useful for example
for doing indexOf (countUntil), startsWith, endsWith, etc. Make
sure at least one of your inputs is validated though to avoid
potentially handling or cutting off unfinished code points. I
think this is the best way to go if you want to do simple things.
If your algorithm is sufficiently complex that you would like to
still decode but not crash, you can also manually call .decode
with UseReplacementDchar.yes to make it emit \uFFFD for invalid
characters.
To get the best of both worlds, use `.byUTF!dchar` which gives
you an input range to iterate over and defaults to using
replacement dchar. You can then call the various algorithm &
array functions on it.
Unless you are working with different encodings than UTF-8 (like
doing file or network operations) you shouldn't be needing
std.encoding.
Also short explanation about the different modules:
std.ascii - simple functions to check and modify ASCII characters
for various properties. Very easy to memorize everything inside
it, you could easily rewrite what you need from scratch yourself.
But of course this only handles all the basic ASCII characters,
meaning it's only really useful for doing low-level almost binary
file handling, not good for user facing parts which need to be
international.
std.utf - ONLY encoding/decoding of unicode code points to UTF-8
/ UTF-16 / UTF-32 byte representation. Doesn't have any idea what
the characters actually mean, only checks for format and has
limits on code point values. You could still reasonably rewrite
this from scratch if you ever choose to.
std.uni - All the categorization of every character into all the
different unicode types and algorithms modifying / combining /
normalizing / etc. codepoints into other codepoints. Doesn't do
anything with UTF encoding. I honestly wouldn't want to be the
one who rewrites this or ports this to another language.
More information about the Digitalmars-d-learn
mailing list