What’s New in Spark 3

1. Performance :

Adaptive Query Execution, Dynamic Partition Pruning, Query Compilation Speed Up, Join Hints

2. Built-in Data Sources:

Parquet/ORC Nested Column Pruning, CSV Filter Pushdown,
Parquet Nested Col Filter Pushdown, New Binary Data Source

3. Richer APIs:

Built-in Function, Pandas UDF Enhancements,
DELETE/UPDATE/MERGE in Catalyst

4. SQL Compatibility:

Overflow Checking, ANSI Store Assignment, Reserved Keywords

5. Extensibility and Ecosystem:

Data Source V2 API + Catalog Support, Hadoop 3 Support,
Hive 3.X Metastore, Hive 2.3 Execution, JDK 11 support

6. Monitoring and Debuggability:

Structured Streaming UI, DDL/DML Enhancements,
Event Log Rollover, Observable Metrics

  1. high performance for batch, interactive, streaming and ML workloads
  2. Enable new use cases and simplify the Spark application development using richer API and Built-in Functions (approx. 32 new built-in functions)
  3. Make Monitoring and Debugging Spark application more comprehensive and stable : all new Structured Streaming UI, also make query
    execution plan more readable
  4. Enhance the performance and functionalities of the built-in data source
  5. Improve the plug-in interface and extend the deployment environment

Adaptive Query Execution

  • Based on the statistics of the finished plan nodes, re-optimize the execution plan of the remaining queries.
  • Convert Sort Merge Join to Broadcast Hash Join
  • Shrink the number of reducers
  • Handle skew join

Dynamic Partition Pruning

  • Avoid partition scanning based on the query results of the other query fragments
  • Significant speed up in terms of execution

Optimizer Hints

Join Hints influence optimizer to choose the following join strategies

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nabarun Chakraborti

Nabarun Chakraborti

Big Data Solution Architect and pySpark Developer