What’s New in Spark 3

Nabarun Chakraborti
3 min readJun 28, 2020

Few new features available in Spark 3.0 which will make it more efficient and faster in execution

Approx 3400 Jiras have been resolved. Majority in SPARK SQL area (46% deployment).

From deployment point of view below are the major focused areas :

1. Performance :

Adaptive Query Execution, Dynamic Partition Pruning, Query Compilation Speed Up, Join Hints

2. Built-in Data Sources:

Parquet/ORC Nested Column Pruning, CSV Filter Pushdown,
Parquet Nested Col Filter Pushdown, New Binary Data Source

3. Richer APIs:

Built-in Function, Pandas UDF Enhancements,

4. SQL Compatibility:

Overflow Checking, ANSI Store Assignment, Reserved Keywords

5. Extensibility and Ecosystem:

Data Source V2 API + Catalog Support, Hadoop 3 Support,
Hive 3.X Metastore, Hive 2.3 Execution, JDK 11 support

6. Monitoring and Debuggability:

Structured Streaming UI, DDL/DML Enhancements,
Event Log Rollover, Observable Metrics

With all the above changes we are going to experience —

  1. high performance for batch, interactive, streaming and ML workloads
  2. Enable new use cases and simplify the Spark application development using richer API and Built-in Functions (approx. 32 new built-in functions)
  3. Make Monitoring and Debugging Spark application more comprehensive and stable : all new Structured Streaming UI, also make query
    execution plan more readable
  4. Enhance the performance and functionalities of the built-in data source
  5. Improve the plug-in interface and extend the deployment environment

Below are the few major areas which makes Spark 3 way faster in terms of execution.

Adaptive Query Execution

spark query optimizer :
spark 1.X, Rule
spark 2.X, Rule + Cost
spark 3.0, Rule + Cost + Runtime

  • Based on the statistics of the finished plan nodes, re-optimize the execution plan of the remaining queries.
  • Convert Sort Merge Join to Broadcast Hash Join
  • Shrink the number of reducers
  • Handle skew join

Dynamic Partition Pruning

  • Avoid partition scanning based on the query results of the other query fragments
  • Significant speed up in terms of execution

Optimizer Hints

Join Hints influence optimizer to choose the following join strategies

Broadcast hash join : one side to be small, no shuffle, no sort, very fast
(SELECT /*+BROADCAST(a)*/ id FROM a JOIN b ON a.key = b.key)

Sort-merge join : robust, can handle any data size, slow when table size is small
(SELECT /*+MERGE(a,b)*/ id FROM a JOIN b ON a.key = b.key)

Shuffle hash join : only shuffle no sort, can handle large table, OOM if data is skewed
(SELECT /*+SHUFFLE_HASH(a,b)*/ id FROM a JOIN b ON a.key = b.key)

Shuffle nested loop join : no join keys required

