Alfresco: Spark Intro

Saturday, November 9, 2019

Spark Intro

Spark is General purpose
in memory compute
engine

hadoop :
HDFS : storage
MR : For computation
yarn : resource manager

So spark is replacement of MR

Spark is pluggable compute engine. Can work with Messos, Kubernetes, yarn.

Output of each MR has to go to disk in MR processing , in spark it will be in memory.
So spark reduces the io, as it can store results in memory.

General purpose : processing happens in memory.
Pig is used for cleaning data in hadoop eco system.
Same style of coding can be used for cleaning data,joining, ML, filtering in spark so we call it
as General purpose framework.

Basic unit which holds data is called as RDD

(RDD)Resillient Distributed Dataset:

When we move the content from disk to memory, set of memory blocks are called as RDD.
Data will be lying across machines. RDDs are in memory.

Loading RDD
RDD1 = sc.textFile("hdfspath");

rdd created for the first time is called as baseRDD

RDD2 = RDD1.filter("filter condition")

New RDD is created for every operation.
RDD3. collect();

When we execute the line
RDD3. collect();
then only operations will happen, as it is lazy evaluation.

All the transformations are lazy, and Actions are eager.

textFile, filter are the transformations. collect is actions.
Actions are eager.

If we dont give the incorrect path in the line RDD1 = sc.textFile("hdfspath"); then we will not
get eror. As and when line gets executed throguh metadata it creates DAG.

Directed Acyclic Graph or execution plan. When action is invoked DAG will be executed.

What benifits we get by making lazy evaluation?
Spark will do additional optimizations. ex : In the file of 4gb , if I want to get the first line
as it will have execution plan.

usually we filter data first and then we process in SQL, similar kind of optimisations will be done
by spark.
spark map will work one by one mapping.

Predicate Pushdown : Moving the filters up in the DAG is called as Predicate Pushdown.

RDD is immutable. Every time we do tranformation new RDD will be created. So that it becomes
easy to recover. Spark internally does RDD hierarchy and recostrcucts RDD based on need.

Immutability helps in Resilliency.

Check if we can configure the memory size so that contents will be written to disk.

Transformation : converting RDD from one form to another form
Action : Requesting for results.

Spark can be written using Java, Scala, Python and R
Scala is the most supported language. Spark is written in scala.

Alfresco

Saturday, November 9, 2019

Spark Intro

No comments:

Post a Comment