Apache Spark - Ryan Lynch's Hub

# Overview An [[in-memory]], distributed data processing engine. It improves upon the approach from [[MapReduce]] by only writing to disk during input and output (as opposed to between each step). Also, it uses operators which have more flexibility than mappers and reducers. Additionally, it stores intermediate state in [[Resilient Distributed Datasets (RDDs)]], which is in-memory. ## Apache Spark Runtime Architecture ![[Apache Spark 2025-12-08 11.15.58.excalidraw.svg]] %%[[Apache Spark 2025-12-08 11.15.58.excalidraw|🖋 Edit in Excalidraw]]%% - **Driver** - the brain of a Spark application, responsible for planning and coordinating execution. Analyzes the Spark applications and constructs a [[Directed Acyclic Graph (DAG)]] - **Cluster Manager / Master** - manages cluster resources and allocates them to the Driver - **Works** - nodes in the cluster that host Executors - **Executors** - processes on Worker nodes that execute tasks assigned by the driver ### Spark Application Execution ![[2025-12-08_Apache Spark.png]] ### DataFrames A distributed collection of records, all with the same pre-defined structure (schema). They are evaluated as DAGs using "lazy evaluation" (evaluation DataFrames one when required) and providing lineage and fault tolerance. Use the DataFrameReader (`spark.read`) and DataFrameWriter (`dataframe.write`) for data I/O. Data can be interacted with Transformations and Actions. A Transformation creates a new DataFrame. They define how data is transformed, but not trigger computation. An Action triggers computation and produces a result. Transformations include select, filter, withColumns, groupBy, agg. Actions include count, show, take, first, write.. [[PySpark]] #### DataFrame Schemas Each DataFrame has a defined schema that can be inferred from data or explicitly specified. ## Key Features ### Complex Data Types - Arrays - ordered collection of elements - Structs - Nested structures with pre-defined named fields - Maps - Key-value pairs (keys are not pre-defined) # Key Considerations Sparks addresses fault tolerance by looking at operators as either a: - Narrow Dependency - all computation is on a single node between two steps of a Spark job - count characters of each message - Wide Dependency - computation relies on data from other nodes For narrow dependencies, if there is a fault, then the data is moved to the online nodes and done in parallel. For wide dependencies, the intermediate data is written to disk to allow for recovery. # Implementation Details Spark using an in-memory, columnar execution engine called Tungsten. It is important to distinguish between this in-memory representation and the on-disk storage format (CSV, parquet, etc.). Tungsten works in conjunction with the Catalyst Optimizer, which converts the in-memory DataFrame operations into an optimized execution plan. The Catalyst Optimizer completes the following steps: ![[image-10.png]] To optimize Spark, you should consider the primary factors influencing Spark job efficiency: #flashcard - Resource Utilization (CPU, memory, network, disk IO) - Data characteristics (size, format, distribution, cardinality and skew) - Configuration and sizing of the cluster Some common bottlenecks in Spark include: #flashcard - Data skew and uneven or inadequate partitioning - Excessive shuffling and data movement - Memory pressure and garbage collection # Useful Links # Related Topics ## Reference #### Working Notes #### Sources