Spark Parallelization Key Factors

Sivaprasad Mandapati
The Startup
Published in
3 min readJan 30, 2021

--

Spark is an unified analytics engine for Bigdata Processing, with built-in modules for ETL,Streaming,SQL,Machine Learning and Graph Processing.

Why Spark ?

Rather than having zoo of products like one for for each module, Why can’t we just have a single core product to serve all modules. I like the below picture from Databricks which demonstrates how Spark can replace all wide variety of tools for various streams and businesses.

Spark provides in-memory parallel processing across multiple nodes in cluster. The parallel processing spark is controlled by few factors.

  1. Hardware
  2. Logical/Application

Hardware :

One simple statement, ‘many nodes and many cores gets greater parallelization’. To elaborate it further

Logical/Application :

Would like to concentrate more on this topic. As developer/programmer we need to understand few settings in spark and how spark does computations.

We can divide this parallelization 3 types.

  1. Reading
  2. Writing
  3. Shuffling

Reading :

As mentioned above, The only key factor while reading a file is size of partition defined in spark application . By default it is 128MB but we are allowed to change this default setting to best suite our requirements.

Let’s examine how how partition size affects parallel processing

Spark.sql.files.maxPartitionBytes

As you can see with default setting , The input file split into 8 partitions and each partition will be distributed to one node or core in the cluster based on cluster size.

Let’s change the Partition Size

It’s clear evident that changing partition size has direct effect on partitions and therefore parallel processing .

Writing:

Once the data is being processed, the ultimate goal is how efficiently and quickly the results can be written. The key factors to consider

  1. Dataframe partitions
  2. No.Of records per partition

Dataframe partitions can either be shrink or divided further . Coalesce is to reduce partitions and repartition is used to arrange partitions.

8 Parquet partitions
2 ORC Partitions

The other deciding factor while writing is number of records per partition.

As you can see many partition files are created and each partition size of not more than 500 records.

Shuffle :

As we all know ,shuffling is very costly operation which requires all nodes should exchange data via network . But there is no other go when performing aggregates by grouping.

By default Spark sets shuffle partitions as 200. This must (**Must) be changed when dealing with large data sets.

Equation and example by Databricks

--

--

Sivaprasad Mandapati
The Startup

Azure ||Google Cloud Certified||AWS|| Big Data,Spark,ETL Frameworks,Informatica|| Database migration Specialist||Data Architect||Google Cloud Authorized Trainer