Make Simple Things Hard to Figure out

Adam D. Ruppe via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Mon Dec 21 08:20:18 PST 2015


On Monday, 21 December 2015 at 13:51:57 UTC, default0 wrote:
> The thing I was trying to do was dead simple: Receive a base64 
> encoded text via a query parameter.

So when I read this, I thought you might have missed another 
little fact... there's more than one base64.

Yup, normal Base64 encoding uses + and / as characters, which are 
special in URLs, so often (but not always!), base64 url encoding 
uses - and _ instead.

This isn't D specific, it is just part of the confusing mess that 
is the real world of computer data.

Normal base64 does work in urls, as long as it is properly url 
encoded. (Got enough encoding yet?!)

Anywho if you are consuming this from some other source, make 
sure you are using the same kind as base64 as they are.

import std.base64;

// for normal base64
ubyte[] bytes = Base64.decode(your_string);

// for the url-optimized variant of base64
ubyte[] bytes = Base64URL.decode(your_string);

> My first instinct was to use google.

Tip I tell people at work too: yes, look for it yourself, but if 
you don't see an answer with a few minutes, go ahead and ask us, 
drop a quick question in the chatroom. D has one on IRC freenode 
called #d.

We won't necessarily even see your question and might not know, 
so keep trying to figure it out yourself, but you might be able 
to save a lot of time by just picking our brains.

> There is a decode function, but I couldn't quite figure out 
> what it did or how I was supposed to use it, if it did what I 
> wanted it to - no examples.

std.utf.decode will take a few chars and decode them into a 
single wchar or dchar.

Take the character “ for example, the double curly quote that 
Microsoft Word likes to put in when you type " on your keyboard.

“ has several different encodings as bytes.

http://www.fileformat.info/info/unicode/char/201c/index.htm

UTF-8 (hex) 	0xE2 0x80 0x9C (e2809c)
UTF-16 (hex) 	0x201C (201c)
UTF-32 (hex) 	0x0000201C (201c)


UTF-8 is char in D. That curly quote takes up three chars:

char[] curlyQuote = [0xE2, 0x80, 0x9C];
size_t idx = 0;
dchar curlyQuoteAsDchar = decode(curlyQuote[], idx);

assert(curlyQuoteAsDchar == '\u201c');



The std.utf module mostly works on this level, chars to dchars 
and back.

There's one big exception though... the validate function.

http://dlang.org/phobos/std_utf.html#validate

That works on a whole string and validates the whole sequence of 
chars as being valid utf8, throwing an exception if it isn't. 
(Weird behavior btw, I think I would have preferred `isValid` 
returning bool, or `validate` taking bytes and returning chars - 
which would be exactly what you wanted - but it returns void and 
throws instead :( )


This stuff btw is pretty confusing, there's an awful lot to know 
about text encoding, so don't feel bad if it makes very little 
sense to you. I spent like four pages in my book introducing 
unicode as part of the discussion on D strings... and still, that 
left out a lot of things too...

> After that I moved on to std.string. It only had one function 
> that seemed somewhat interesting - assumeUTF. After reading 
> through the docs, it failed my criteria since it had no 
> validation - as its name states, it simply assumes that 
> whatever you give it is correctly encoded. I didn't expect much 
> here anyways, it would have been an odd place to put this 
> functionality.

Ooooh you're close though.

If you did

---
import std.base64, std.string, std.utf;

auto utf = assumeUTF(Base64.decode(it));
validate(utf);
---

you'd probably get what you wanted...


> Really inconvenient. It then goes on to state that it 
> supersedes std.utf.decode, but I don't remember reading any 
> notice in std.utf.decode that it actually was superseded and I 
> shouldn't even really bother trying to learn about it, weird 
> but okay.

blargh I had to look at the source to understand what these 
actually did

> EncodingScheme.create("UTF-8").isValid(decodedBase64) followed 
> by a type-system-ignoring cast from ubyte[] to char[] (since I 
> now know it is valid so this cast is fine). All in all, 
> including the explicit error handling required by isValid this 
> has taken about an hour of research and 7 lines of code.

yeah that works too

> So with that in mind, any ideas to improve the situation (that 
> do not require 500 man-decades of work)?

We need a lot more examples, and not just of individual 
functions. Examples on how to bring the functions together to do 
real world tasks.


More information about the Digitalmars-d-learn mailing list