Before discuss various optimization techniques have a quick review how does spark run

How does spark run:

User submits an application using spark-submit in cluster mode (there are local and client modes too, but considering production situation).

The spark-submit utility will then communicate with Resource Manager to start the Application Master in one of the data node.

The Driver program will be launched inside Application Master Container.

You can consider this as one of the reference notebook where will cover the below topics

  1. Spark SQL Execution Plan
  2. Lineage Vs. Dag
  3. Narrow Vs. Wide Dependency
  4. How Does Spark Read Large File


Whenever we create a dataframe or Spark SQL or a HIVE query, spark will

i. Generate an Unresolved Logical Plan*.

ii. Then it will apply Analysis rules and Schema catalog to convert into a Resolved Logical Plan.

What’s happening here? Understand the source of the datasets and types of the columns.

iii. The Optimization rules will finally create an Optimized Logical Plan.

The above mentioned 3 steps…

We all know that caching is basically keeping the important/most popular data in memory rather than in a disk for faster execution. But someone has to be sensible enough to decide what needs to be cached and how? There is a common tendency to start keeping any data in memory without realizing the fact that memory is expensive, also, more unwanted data means compromising with execution time.

There may be different reasons of caching data, but below are the 2 most important facts -

1. To reduce I/O (network calls) as much as possible.

2. Avoid re-computation.

Once we decide…

There are scenarios when we are using application password in our code. This is completely an unethical practice. Password should be hidden.

Here I’ve used python (cryptography pkg ) to encrypt a password and later decrypt it for further use.

> pip install cryptography

Mostly, we have seen application passwords are not changing very frequently. Hence, have taken below approaches:

Place a simple txt file contains ONLY password string.

This is a very basic but powerful python program to track the live position of ISS. My son used to run it in every 30–40 mins interval to see it’s position :-)

Prerequisites: python , IDE, install few packages — pandas, plotly, google.

Reference: We are going to use the existing API to get the location and other information in our code. Can refer to the below link incase you are curious:

Here, you will find below file which is registering the latitude and longitude of ISS.

## Content of the above JSON file :

{"iss_position": {"longitude": "-41.8150", "latitude"…

The JSON (JavaScript Object Notation) is a lightweight format to store and exchange data. The input JSON may be in different format —

  • simple,
  • multi line with complex format,
  • HTTP link,
  • a CSV which contains JSON column.

Below will cover all the above scenarios:

1. Simple JSON:

Input JSON file (Simple.json) -

Spark is no doubt a powerful processing engine and a distributed cluster computing framework for faster processing. It is getting enriched day by day with additional cool features. Unfortunately there are few areas where spark is struggling. But if we combine Spark with Delta Lake, it can overcomes all those challenges. Few of the drawbacks are —

  1. Spark is not ACID compliant.
  2. Issue with Small File processing.
  3. Lack of Schema enforcement.

1. What is ACID?

Data is all around and twitter is one of the golden source of data for any kind of sentiment analysis. There are lot of ways we can read twitter live data and process them. In this article I will demonstrate how easily we can create a connection with twitter account to get the live feeds and then transform the data by using Spark Structured Streaming. This article is not about applying machine learning algorithm or run any predictive analysis.

What are we planning to do?

This is a basic and simple documentation for those who never did any kind of video processing to detect different kind of objects like Car, Human, Bus etc. If you have free time and interested to play around then please follow this documentation. I hope this will give you some joy being a beginner.

What is OpenCV?

We all know OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. Mainly used for computer vision, machine learning, and image processing. …

Focus here is to analyse few use cases and design ETL pipeline with the help of Spark Structured Streaming and Delta Lake.

Why delta lake? Because, this open source storage layer brings ACID transactions to big data workloads. Also it is the combination of good of Data Ware House and good of Data Lake.

Nabarun Chakraborti

Big Data Engineer and pySpark Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store