Summary
- MapReduce is ill suited for certain types of applications that are iterative in nature. This is due to the heavy I/O cost involved in reading inputs from DFS and writing them back to DFS for each iteration.
- Spark is an optimized, in-memory framework suited for iterative, interactive, and streaming applications.
- Spark relies on resilient distributed datasets (RDDs), a distributed memory abstraction to support fault-tolerant, in-memory computations on large clusters.
- Spark can either be run in standalone mode, or in a cluster using either the Mesos or YARN resource manager.
- RDDs are in-memory read-only (immutable) objects partitioned across the cluster.
- RDDs are fault tolerant by using a lineage tracking technique that keeps track of the sequence operations performed to transform on-disk data to its current form in memory.
- Dependencies in RDDs are classified as either narrow or wide.
- The Spark ecosystem includes Spark SQL, Spark Streaming, MLlib, and GraphX.