std.uni, std.ascii, std.encoding, std.utf ugh!

Wed May 6 11:14:15 UTC 2020

On Wednesday, 6 May 2020 at 10:57:59 UTC, learner wrote:
> On Tuesday, 5 May 2020 at 19:24:41 UTC, WebFreak001 wrote:
>> On Tuesday, 5 May 2020 at 18:41:50 UTC, learner wrote:
>>> Good morning,
>>>
>>> Trying to do this:
>>>
>>> ```
>>> bool foo(string s) nothrow { return s.all!isDigit; }
>>> ```
>>>
>>> I realised that the conversion from char to dchar could throw.
>>>
>>> I need to validate and operate over ascii strings and utf8 
>>> strings, possibly in separate functions, what's the best way 
>>> to transition between:
>>>
>>> ```
>>> immutable(ubyte)[] -> validate utf8 -> string -> nothrow 
>>> usage -> isDigit etc
>>> immutable(ubyte)[] -> validate ascii -> AsciiString? -> 
>>> nothrow usage -> isDigit etc
>>> string             -> validate ascii -> AsciiString? -> 
>>> nothrow usage -> isDigit etc
>>> ```
>>>
>>> Thank you
>
> Thank you WebFreak,
>
>>
>> if you want nothrow operations on the sequence of characters 
>> (bytes) of the strings, use `str.representation` to get 
>> `immutable(ubyte)[]` and work on that. This is useful for 
>> example for doing indexOf (countUntil), startsWith, endsWith, 
>> etc. Make sure at least one of your inputs is validated though 
>> to avoid potentially handling or cutting off unfinished code 
>> points. I think this is the best way to go if you want to do 
>> simple things.
>
> What I really want is a way to validate an immutable(ubyte)[] 
> sequence for UFT8 or ASCII, and from that point forward, apply 
> functions like isDigit in nothrow functions.
>
>> If your algorithm is sufficiently complex that you would like 
>> to still decode but not crash, you can also manually call 
>> .decode with UseReplacementDchar.yes to make it emit \uFFFD 
>> for invalid characters.
>
> I will simply reject invalid UTF8 input, that's coming from I/O
>
>> To get the best of both worlds, use `.byUTF!dchar` which gives 
>> you an input range to iterate over and defaults to using 
>> replacement dchar. You can then call the various algorithm & 
>> array functions on it.
>
> Can you explain better?
>
>> Unless you are working with different encodings than UTF-8 
>> (like doing file or network operations) you shouldn't be 
>> needing std.encoding.
>
> I'm expecting UTF8 and ASCII encoding from I/O
>
> Thank you!

Using .representation would be like assuming UTF-8 and 
.byUTF!dchar will still test and replace invalid characters.

If you want to check if a string is UTF-8 beforehand, use 
`std.utf : validate` - it will throw an UTFException in case of 
malformed UTF-8. However this will not magically make your 
algorithms nothrow, except of course it won't actually throw 
because of decoding exceptions in that case. If you want to give 
the nothrow attribute to your functions, you will need to work 
with .representation or .byUTF!dchar