Learn Apache Spark In 30 minutes- Quick Introduction

Spark is a distributed processing engine capable of performing batch and streaming operations on massive data sets. It uses a variation of map reduce programming model for doing this. Spark splits the input data into chunks and distributes those chunks to the cluster for parallel processing. In a nutshell that is it. But it not a simple task and involves many components coordinating well to produce the output.

At a high level, a Spark application consists of a driver program that runs the userโ€™s main function and and workers executing various parallel operations in a cluster. We will look at more details of driver, worker, executor etc in later sections after covering a few important concepts.

Spark Architecture Diagram
Spark – Architecture

Challenges of Distributed computing

There are many problems to be solved in the case of any distributed computing problem. The below problems are applicable to all distributed parallel processing frameworks including spark. You don’t need to understand all these to use Spark, but it is good to know that behind the scenes spark does all that

  • How to divide the input data
  • How to assign the divided data to machines in the cluster
  • How to check and monitor a machine in the cluster is live and has resources to perform its duty
  • How to retry or reassign failed chunks to another machine or worker
  • If the computation involves any aggregation operation like a sum, how to collate results from many workers and compute the aggregation
  • Efficient use of memory , cpu and network
  • Monitoring the tasks
  • Overall job coordination
  • Keeping a global time

There are many more challenges but i have only put a few to explain the point.

Spark Usecases

  • ETL
  • Analytics
  • Machine Learning
  • Graph processing
  • SQL queries on large data sets
  • Batch processing
  • Stream processing

Features of Spark

Features of Spark
Features of Spark
  • In Memory Computation – Uses multi-stage (mostly) in-memory computing engine for performing most computations in memory, instead of storing temporary results of long running computations to file system.
  • Performance and Speed– Provides better performance for certain applications, e.g. iterative algorithms or interactive data mining. Spark focuses on speed, ease of use, compared to earlier Hadoop systems. Refer the link for more details of the benchmark – Databricks benchmark of Spark
  • Fault Tolerance – Spark retries failed tasks
  • Advanced User Friendly APIs and Data structures– Rich Set of data structures supporting tabular structures, SQL Queries, easier transformations, rich set of aggregate functions
  • Supports many Languages: Spark is mostly written in Scala but provides APIs for languages – java, python, sql and R
  • Lazy Evaluation – when we do a transformation on RDD, spark doesn’t immediately do the transformation. It updates its execution model ( DAG) with this information and only when the driver requests data, the DAG is executed. Do you see the benefit of this approach ? Think about it. Spark can make optimizations on the execution plan and refine it once it had a chance to look at the DAG in full. This would be impossible if it executed everything on the fly.

Before we go any further, let us understand what is an RDD. Basic understanding of what an RDD is very important to go any further because we will refer to RDD very often in the below sections.

What is an RDD

RDD or resilient distributed dataset is the fundamental data structure abstraction of Spark.Yes you work with RDDs or higher better variants of RDD namely datasets or dataframes when you code in Spark.

RDD is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. For instance you can create an RDD of integers and these gets partitioned and divided and assigned to various nodes in the cluster for parallel processing.

Features of RDD

  1. Resilient – They are fault tolerant and able to detect and recompute missing or damaged partitions of an RDD due to node or network failures.
  2. Distributed – Data is partitioned and resides on multiple nodes depending on the cluster size, type and configuration
  3. In Memory Data structure – Mostly they are in memory so that iterative operations runs faster and performs way better than traditional hadoop programs in executing iterative algorithms
  4. Dataset – it can represent any form of data be it reading from a csv file, loading data from a table using rdbms, text file, json , xml.

Create an RDD from a List

/ Example - How to create a RDD from a list
List<Integer> employeeNumbers= Arrays.asList(178, 122, 300, 675, 15);
JavaRDD<Integer> empRDD= sc.parallelize(employeeNumbers);

By default spark automatically splits the RDDs into partitions depending on the cluster. But in case you want to do it manually you can do so by passing the partition count to the sc.parallelize method. For instance sc.parallelize(employeeNumbers,6);

Okay, now we have enough information about RDD and we are good to explore other sections.

Components of a Spark Program

Driver ProgramCluster ManagerExecutors on Workers

Spark Architecture
Architecture
  • What is a Driver – Don’t get confused with JDBC Driver ๐Ÿ™‚ It is not that driver. Driver program in practical terms, is the code you write that applies various transformations and action on the RDDs. The entire program that you make as jar and execute is the spark application.
  • Put it another way, driver is the program containing the SparkContext or your main program
  • Spark uses a master/worker architecture. The driver talks to a single coordinator called master. The master manages the worker nodes in which executors run. Don’t worry, i will explain this more clearly in below sections.
  • Workers – Machines in which distributed task(s) are scheduled for execution
  • Executor – A separate Java process in the worker executing a task

Spark Cluster Managers

Spark supports three types of cluster managers at present. 1) StandAlone 2) Apache Mesos 3) Hadoop Yarn

The most commonly used spark cluster in production environments i have worked with is Hadoop Yarn.

Executors

In spark , an executor is an agent or a process running in a worker machine that is capable of executing tasks and as soon as it finishes the task execution, sends the results to the driver. Executor process also send heartbeats and other metrics to the driver. Executor typically runs for the entire duration of the Spark application. This is is called static allocation of executors. But if you want more elasticity to have better utilization of the resources, you could opt for dynamic allocation of executors.

Spark Executor HeatBeat
Spark Executor Driver – Heart Beat Notification

The executor is a separate process and hence is separate for each spark application providing total isolation in the case of multiple applications running in spark cluster. Hence there no direct application to application communication possible even when they are running in same spark cluster.

More On a Spark Program and Key Terms

  • Application jar – your program containing the application code. Mostly this will be a fat jar with all dependencies added to the jar.
  • Driver program – The program containing the main method ( entry point) and creating the SparkContext. This is your programs main method where you create the SparkContext.
  • Worker Node – A node capable of running a spark task
  • Executor – A process launched on the worker node that is responsible for executing the task of an application. Depending on the resources on the worker node, many executors can be started in a node
  • Task – A unit of work that is to be send to the executor for processing
  • Job – the single unit containing the different steps / tasks of processing.
  • Stage – a job gets divided into set of stages that depends on each other

Recap – So Far

What we learned till now

  • What is a Node – ( a machine )
  • Worker – the machine where one or many executors are started for task execution,
  • Executor – a process in a worker node where one or many tasks are going to be executed. A worked node may contain many executors
  • RDD – the basic abstraction of data provided by spark which has characteristics like fault tolerance, partitioning, resilience, distributed and uses memory storage.

Hello World Spark Program – Word Count

package com.stackrules.spark.examples;

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

/**
 * WordCount Example using SparkSession
 * Calculates the word count from a file in HDSF and writes the result back to another 
 * HDFS file. Provide a valid fileName for input and a proper hdfs file path for output
 */
public final class WordCountExample {

	public static void main(String[] args) throws Exception {

		 String inputFile = "hdfs://..."; 
		 String outputFile = "hdfs://...";
		 SparkSession spark = SparkSession
			  .builder()
			  .appName("JavaWordCount")
			  .getOrCreate();
			  
		 JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
			  
		//Word count program in spark using java
		 JavaRDD<String> dataFile= sc.textFile(inputFile);
		 JavaPairRDD<String, Integer> wordCounts= dataFile
			.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
			.mapToPair(word -> new Tuple2<>(word, 1))
			.reduceByKey((a, b) -> a + b);
		 counts.saveAsTextFile(outputFile);  

	}
}

Understanding RDD more

Spark RDD supports two types of Operations

  • Transformations
  • Actions

A transformation results in the creation of another RDD from the given RDD. Remember an RDD is immutable so what ever transform you do will result in a new RDD and the original RDD will be untouched. This is by design of spark and is the right thing. A Transform takes RDD as input and produces one or more RDD as output.

Note that, the transforms are not executed immediately and they are only executed when there is a need like applying an Action. We will see action next.

Examples of Spark Transform –

  • map
  • filter
  • filterMap
  • sample
  • union
  • join
  • distinct
  • intersection

A Spark Action results in the creation of a non RDD output. The values of actions are either returned to the driver or stored in some external storage system like HDFS, Hive , Cassandra etc.

Examples of Spark Action

  • count
  • collect
  • top
  • reduce
  • take(n)
  • fold
  • aggregate
  • forEach

References

1 thought on “Learn Apache Spark In 30 minutes- Quick Introduction”

Leave a Comment