Make Simple Things Hard to Figure out

Mon Dec 21 10:02:55 PST 2015

On Monday, 21 December 2015 at 16:20:18 UTC, Adam D. Ruppe wrote:
> On Monday, 21 December 2015 at 13:51:57 UTC, default0 wrote:
>> The thing I was trying to do was dead simple: Receive a base64 
>> encoded text via a query parameter.
>
> So when I read this, I thought you might have missed another 
> little fact... there's more than one base64.

I am aware of this and I used Base64URL in my code, as does my 
frontend :-) Glad you pointed it out though, I really did write 
my post as if I missed that fact.

> Yup, normal Base64 encoding uses + and / as characters, which 
> are special in URLs, so often (but not always!), base64 url 
> encoding uses - and _ instead.
>
> This isn't D specific, it is just part of the confusing mess 
> that is the real world of computer data.
>
> Normal base64 does work in urls, as long as it is properly url 
> encoded. (Got enough encoding yet?!)

Oh you can keep going, I'm not that easily scared :D
>> My first instinct was to use google.
>
> Tip I tell people at work too: yes, look for it yourself, but 
> if you don't see an answer with a few minutes, go ahead and ask 
> us, drop a quick question in the chatroom. D has one on IRC 
> freenode called #d.

I don't have an IRC client set up since I rarely use that, plus 
an IRC is always kind of "out of the way". It's good to know, but 
if you're a beginner trying to learn about basics of a language, 
standalone tutorials and/or easy-to-understand documentation with 
examples are miles better :-)

>> There is a decode function, but I couldn't quite figure out 
>> what it did or how I was supposed to use it, if it did what I 
>> wanted it to - no examples.
>
> std.utf.decode will take a few chars and decode them into a 
> single wchar or dchar.
>
> Take the character “ for example, the double curly quote that 
> Microsoft Word likes to put in when you type " on your keyboard.
>
> “ has several different encodings as bytes.
>
> http://www.fileformat.info/info/unicode/char/201c/index.htm
>
> UTF-8 (hex) 	0xE2 0x80 0x9C (e2809c)
> UTF-16 (hex) 	0x201C (201c)
> UTF-32 (hex) 	0x0000201C (201c)
>
>
> UTF-8 is char in D. That curly quote takes up three chars:
>
> char[] curlyQuote = [0xE2, 0x80, 0x9C];
> size_t idx = 0;
> dchar curlyQuoteAsDchar = decode(curlyQuote[], idx);
>
> assert(curlyQuoteAsDchar == '\u201c');
>

Nice explanation, thanks. I wish the documentation could have 
taught me that information as clearly as you did :-)

> There's one big exception though... the validate function.
>
> http://dlang.org/phobos/std_utf.html#validate
>
> That works on a whole string and validates the whole sequence 
> of chars as being valid utf8, throwing an exception if it 
> isn't. (Weird behavior btw, I think I would have preferred 
> `isValid` returning bool, or `validate` taking bytes and 
> returning chars - which would be exactly what you wanted - but 
> it returns void and throws instead :( )

Well, a ubyte[] isn't exactly an array of code-points, so just 
calling validate and casting is confusing (even though logical if 
you think about it for a second).
Having an API like bool tryDecode(ubyte[], char[] outBuf) except 
more rangified and an analogous char[] decode(ubyte[]) (also 
rangified) would be much easier to
understand (and I would argue use, too). The task I'm trying to 
do is explicitly not "casting this byte array to code points" but 
"decode this byte array into code points". That an implementation 
of this functionality may simply cast the original
array is an implementation detail, so going for 
cast(string)ubytes in the first place is kind of 
counter-intuitive (since I did have some D exposure for a while I 
managed to figure that one out without too much of a hassle 
though).

>
> This stuff btw is pretty confusing, there's an awful lot to 
> know about text encoding, so don't feel bad if it makes very 
> little sense to you. I spent like four pages in my book 
> introducing unicode as part of the discussion on D strings... 
> and still, that left out a lot of things too...

Text encoding in general makes sense to me - I don't usually have 
trouble dealing with it. It was just hard to navigate the 
information available on how to write the code to do the 
necessary things in D :-)

>> After that I moved on to std.string. It only had one function 
>> that seemed somewhat interesting - assumeUTF. After reading 
>> through the docs, it failed my criteria since it had no 
>> validation - as its name states, it simply assumes that 
>> whatever you give it is correctly encoded. I didn't expect 
>> much here anyways, it would have been an odd place to put this 
>> functionality.
>
> Ooooh you're close though.
>
> If you did
>
> ---
> import std.base64, std.string, std.utf;
>
> auto utf = assumeUTF(Base64.decode(it));
> validate(utf);
> ---
>
> you'd probably get what you wanted...

That plus some text explaining the details should be the answer 
to the SO question. 
http://stackoverflow.com/questions/34401744/convert-ubyte-to-string-in-d is where I asked. Would be awesome if you could respond there!

>
>> Really inconvenient. It then goes on to state that it 
>> supersedes std.utf.decode, but I don't remember reading any 
>> notice in std.utf.decode that it actually was superseded and I 
>> shouldn't even really bother trying to learn about it, weird 
>> but okay.
>
> blargh I had to look at the source to understand what these 
> actually did

That sounds painful @_@

>> EncodingScheme.create("UTF-8").isValid(decodedBase64) followed 
>> by a type-system-ignoring cast from ubyte[] to char[] (since I 
>> now know it is valid so this cast is fine). All in all, 
>> including the explicit error handling required by isValid this 
>> has taken about an hour of research and 7 lines of code.
>
> yeah that works too
>
>> So with that in mind, any ideas to improve the situation (that 
>> do not require 500 man-decades of work)?
>
> We need a lot more examples, and not just of individual 
> functions. Examples on how to bring the functions together to 
> do real world tasks.

Yup, lots of things in D require composition of different parts 
of std. This is not easy to learn or understand unless you are 
quite familiar with std - or have a heap of examples for lots of 
different tasks somewhere.