apache spark - not disk or network bound but CPU bound

Fri May 1 04:15:00 PDT 2015

http://radar.oreilly.com/2015/04/investigating-sparks-performance.html

"For many who use and deploy Apache Spark, knowing how to find 
critical bottlenecks is extremely important. In a recent O’Reilly 
webcast, Making Sense of Spark Performance, Spark committer and 
PMC member Kay Ousterhout gave a brief overview of how Spark 
works, and dove into how she measured performance bottlenecks 
using new metrics, including block-time analysis. Ousterhout 
walked through high-level takeaways from her in-depth analysis of 
several workloads, and offered a live demo of a new performance 
analysis tool and explained how you can use it to improve your 
Spark performance.

Her research uncovered surprising insights into Spark’s 
performance on two benchmarks (TPC-DS and the Big Data 
Benchmark), and one production workload. As part of our overall 
series of webcasts on big data, data science, and engineering, 
this webcast debunked commonly held ideas surrounding network 
performance, showing that CPU — not I/O — is often a critical 
bottleneck, and demonstrated how to identify and fix stragglers."