This is a very basic but powerful python program to track the live position of ISS. My son used to run it in every 30–40 mins interval to see it’s position :-)

Prerequisites: python , IDE, install few packages — pandas, plotly, google.

Reference: We are going to use the existing API to get the location and other information in our code. Can refer to the below link incase you are curious:

http://open-notify.org/Open-Notify-API/ISS-Location-Now/

Here, you will find below file which is registering the latitude and longitude of ISS.

http://api.open-notify.org/iss-now.json

## Content of the above JSON file :

{"iss_position": {"longitude": "-41.8150", "latitude"…

The JSON (JavaScript Object Notation) is a lightweight format to store and exchange data. The input JSON may be in different format —

  • simple,
  • multi line with complex format,
  • HTTP link,
  • a CSV which contains JSON column.

Below will cover all the above scenarios:

1. Simple JSON:

Input JSON file (Simple.json) -


Spark is no doubt a powerful processing engine and a distributed cluster computing framework for faster processing. It is getting enriched day by day with additional cool features. Unfortunately there are few areas where spark is struggling. But if we combine Spark with Delta Lake, it can overcomes all those challenges. Few of the drawbacks are —

  1. Spark is not ACID compliant.
  2. Issue with Small File processing.
  3. Lack of Schema enforcement.

1. What is ACID?


Data is all around and twitter is one of the golden source of data for any kind of sentiment analysis. There are lot of ways we can read twitter live data and process them. In this article I will demonstrate how easily we can create a connection with twitter account to get the live feeds and then transform the data by using Spark Structured Streaming. This article is not about applying machine learning algorithm or run any predictive analysis.

What are we planning to do?


This is a basic and simple documentation for those who never did any kind of video processing to detect different kind of objects like Car, Human, Bus etc. If you have free time and interested to play around then please follow this documentation. I hope this will give you some joy being a beginner.

What is OpenCV?

We all know OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. Mainly used for computer vision, machine learning, and image processing. …


Focus here is to analyse few use cases and design ETL pipeline with the help of Spark Structured Streaming and Delta Lake.

Why delta lake? Because, this open source storage layer brings ACID transactions to big data workloads. Also it is the combination of good of Data Ware House and good of Data Lake.


You can consider this as one of the reference notebook where will cover the below topics

  1. Spark SQL Execution Plan
  2. Lineage Vs. Dag
  3. Narrow Vs. Wide Dependency
  4. How Does Spark Read Large File

1. SPARK SQL EXECUTION PLAN:

Whenever we create a dataframe or Spark SQL or a HIVE query, spark will

i. Generate an Unresolved Logical Plan*.

ii. Then it will apply Analysis rules and Schema catalog to convert into a Resolved Logical Plan.

What’s happening here? Understand the source of the datasets and types of the columns.

iii. The Optimization rules will finally create an Optimized Logical Plan.

The above mentioned 3 steps…


Few new features available in Spark 3.0 which will make it more efficient and faster in execution

Approx 3400 Jiras have been resolved. Majority in SPARK SQL area (46% deployment).

From deployment point of view below are the major focused areas :

1. Performance :

Adaptive Query Execution, Dynamic Partition Pruning, Query Compilation Speed Up, Join Hints

2. Built-in Data Sources:

Parquet/ORC Nested Column Pruning, CSV Filter Pushdown,
Parquet Nested Col Filter Pushdown, New Binary Data Source

3. Richer APIs:

Built-in Function, Pandas UDF Enhancements,
DELETE/UPDATE/MERGE in Catalyst

4. SQL Compatibility:

Overflow Checking, ANSI Store Assignment, Reserved Keywords

5. Extensibility and Ecosystem:

Data Source V2 API + Catalog Support, Hadoop 3 Support,
Hive 3.X Metastore, Hive 2.3 Execution…


Before discuss various optimization techniques have a quick review how does spark run

How does spark run:

User submits an application using spark-submit in cluster mode (there are local and client modes too, but considering production situation).

The spark-submit utility will then communicate with Resource Manager to start the Application Master in one of the data node.

The Driver program will be launched inside Application Master Container.


To process unstructured data either we can use spark built-in functions or need to create our own functions to transform the unstructured data into a structural form based on the requirements.

Example: lets we have the below unstructured input data file and the highlighted part we need to pull out and create a dataframe for further processing.

Nabarun Chakraborti

Big Data Engineer and pySpark Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store