TLDR - Spark and RDD (2010)

Table of contents

original RDD paper by matei zaharia

unified (distributed) engine for large-scale data processing and analytics. spark generalizes the map and reduce steps into a complete notion of multistep data flow graph. it supports iterative applications on data. multiple map reduce operations loop over same data

Its a unified engine that makes large workload easy and fast

Untitled

Spark uses RDD

It does not execute every input provided by the user. It queues them into some list of tasks to be done in a manner of DAG - directed acyclic graph.

Architecture

Untitled

Untitled

Resilient Distributed Dataset(RDD) - from the original paper

Why Spark?

2 main inefficiencies:

figure below shows how the intermediate result is written to a DFS and again read from there

Untitled

How Spark Solves this?

Untitled

Example operation:

"""
load error messages from the a logfile stored in HDFS or any other file sytem
then interactively search for various patterns
"""

# base RDD of strings. a collection of lines from error logfile, now converted
# as an RDD
lines = spark.textFile("hdfs://....")

# filter out errors from the base RDD of just lines
# now we get a transformed RDD: errors
errors = lines.filter(lambda s: s.startswith("ERROR"))

# just the messages from error, assuming tab seperation
error_messages = errors.filter(lambda s: s.split('\t')[2])

# until this step, all are just transformations. none of these are executed
# and they are evaluated lzaily
messages.cache() # store the errors efficiently

# count how many errors are related to MySQL
# count() is an action - this kick starts the parallel execution
# tasks are sent out from  Driver to Workers, they do the task, store the result
# in cache (because we said so) and return result
messages.filter(lambda s: "MySQL" in s).count()

# now if we want to filter error messages of Redis, we can just use the
# cached messages, instead of running all the processes again
messages.filter(lambda s: "Redis" in s).count()

Untitled

RDD - Resilient Distributed Dataset

Fault Tolerance in RDDs: RDDs track lineage info to rebuild lost data

RDD - fault-tolerant parallel data structures that let users explicitly persist intermediate results in memory, controlling their partitioning to optimize data placement and manipulate them with rich set of operators.

Untitled

Spark Runtime

Untitled

Narrow & Wide Dependency

What happens when a worker node fails the spark job has wide dependency?

Untitled

Spark Libraries

Spark Engine does all this. On top of this there are set of libraries like SparkSQL, SparkML, SparkStreaming, GraphX

These different processing can be combined

Untitled

Untitled

Spark Context vs Spark Session

SparkContext:

  1. Origin: Introduced in early versions of Spark.
  2. Purpose: It's the entry point for Spark functionality, mainly for RDD-based operations.
  3. Scope: One SparkContext per JVM (Java Virtual Machine).
  4. Usage: Used primarily with RDD API. considered low level API
  5. Configuration: Requires explicit configuration of SparkConf.

SparkSession:

  1. Origin: Introduced in Spark 2.0 as part of the unified API.
  2. Purpose: Provides a single point of entry to interact with Spark functionality.
  3. Scope: Multiple SparkSessions can exist in the same JVM.
  4. Usage: Used with DataFrame and Dataset APIs, as well as SQL. also supports RDD via spark context
  5. Configuration: Can be created with simpler builder methods.

Why use Structured Spark APIs like DataFrames & DataSets instead of RDDs

Untitled

Tags: #distributed systems