If you needed any more evidence that memory safety is the future...

Thu Mar 2 17:41:25 PST 2017

On Thursday, 2 March 2017 at 22:25:49 UTC, H. S. Teoh wrote:
> [...]
>> 
>> http://www.geekwire.com/2017/amazon-explains-massive-aws-outage-says-employee-error-took-servers-offline-promises-changes/
>
> Yes, which inevitably happens every now and then, because of 
> human fallability.
>
> But again, the elephant in the room is that in the good ole 
> clear-weather days, such an error would at most take out one or 
> two (or a small handful) of related sites; whereas in today's 
> cloudy situation a single error in umbrella services like AWS 
> can mean the outage of thousands or maybe even millions of 
> otherwise-unrelated sites.

To me it seems like a lot of people - once again - gambled (and 
lost) on one of the primary criteria of reliable engineering: 
Redundancy.
The relevant question now, I think, is why do people keep doing 
this (as this is not a new phenomenon)? My current favorite 
hypothesis (as I don't have enough reliable data) is that they 
simply don't *have* to care about a couple of hours of downtime 
in the sense that whatever profits they may lose per year related 
to those outages does not come close to what they save by not 
paying for redundancy.

>
> And thanks to the frightening notion of the Internet of Things, 
> one day all it will take is a single failure and society would 
> stop functioning altogether.

One of the primary reasons (for us all) to invest in 
technological heterogeneity, imho:
Multiple competing hardware platforms, operating systems, 
software stacks, etc.
The more entities we have that perform similar functions but 
don't necessarily work the same the higher our resistance against 
this kind of outcome (analogous to - IIRC - how diverse 
ecosystems tend to be more resistant to unforeseen changes).