Read JSON using PySpark
The JSON (JavaScript Object Notation) is a lightweight format to store and exchange data. The input JSON may be in different format —
- simple,
- multi line with complex format,
- HTTP link,
- a CSV which contains JSON column.
Below will cover all the above scenarios:
1. Simple JSON:
Input JSON file (Simple.json) -
Code to read —
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("readJSON").getOrCreate()
readJSONDF = spark.read.json('Simple.json')
readJSONDF.show(truncate=False)
Output —
2. Multi-Line Complex JSON:
Input JSON file (ComplexJSON.json) — we will read Name and Location
Code to read-
To read multi-line JSON file we need to use option(“multiLine”,”true”). Also, in the above picture, the “details” tag is an array, so to read the content inside an array element we need to Explode it first. The below code will show how we can read Location and Name from the above input file
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("readJSON").getOrCreate()
# use multiline = true to read multi line JSON file
readComplexJSONDF = spark.read.option("multiLine","true").json('ComplexJSON.json')
# Explode Array to Structure
explodeArrarDF = readComplexJSONDF.withColumn('Exp_RESULTS',F.explode(F.col('details'))).drop('details')
# Read location and name
dfReadSpecificStructure = explodeArrarDF.select("Exp_RESULTS.user.location.*","Exp_RESULTS.user.name.*")
dfReadSpecificStructure.show(truncate=False)
Output —
3. Read from HTTP link:
Input JSON file (online data) — let’s read all the users name from the file
Schema:
Code to read-
To read online JSON we will use urllib.request, open the URL content using utf-8 decoding. After that convert into an RDD and finally create a Dataframe. Same like above will Explode the array and then read the struct ‘name’. Remember if there is no Array element then no need to explode.
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import functions as F
from urllib.request import Request, urlopen
sc = SparkContext(master="local[*]", appName= "readJSON")
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("readJSON").getOrCreate()
# Online data source
onlineData = 'https://randomuser.me/api/0.8/?results=10'
# read the online data file
httpData = urlopen(onlineData).read().decode('utf-8')
# convert into RDD
rdd = sc.parallelize([httpData])
# create a Dataframe
jsonDF = spark.read.json(rdd)
# read all the users name:
readUser = jsonDF.withColumn('Exp_Results',F.explode('results')).select('Exp_Results.user.name.*')
readUser.show(truncate=False)
Output —
4. CSV file contains JSON column:
Input CSV file (mixJSON.csv) — the input file is a CSV where “Info” column contains JSON like data. Will read Student name, Nationality and Subject details.
Code to read-
Will read the CSV file with option(“multiLine”,”true”) to get the multi line JSON format and option(‘escape’,”\””) to ignore “ inside JSON content (Info column in the above pic). Then we will use JSON_TUPLE to read the required details from the JSON content. JSON_TUPLE has two parameters, 1st is the column name and next are required tag values we are interested in.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("readJSON").getOrCreate()
# escape all " in the JSON content to read properly
readCSVFileDF = spark.read.option("multiLine","true").option('escape',"\"").option('header',True).csv('mixJSON4.csv')
readCSVFileDF.printSchema()
readCSVFileDF.show(truncate=False)
# use JSON_TUPLE to read required contents : student, nationality and subject from Info column
readJSONContentDF1 = readCSVFileDF.select("*",F.json_tuple("Info","student","nationality","subject")).drop('Info')
readJSONContentDF1.show(truncate=False)
# use JSON_TUPLE to read student details - gender, city, email
readJSONContentDF2 = readJSONContentDF1.select('*',F.col('c1').alias('Nationality'),F.json_tuple('c0','gender','city','email').alias('Gender','City','MailId')).drop('c0','c1')
readJSONContentDF2.show(truncate=False)
# use JSON_TUPLE to read subject details - mainsubject and optional
finalDF = readJSONContentDF2.select('*',F.json_tuple('c2','main','optional').alias('MainSubject','OptionalSubject')).drop('c2')
finalDF.show(truncate=False)
# another way : using withcolumn
readJSONDF = readCSVFileDF.select("*",F.json_tuple("Info","student","nationality","subject")).drop('Info')
readJSONDF.show(truncate=False)
finalDF1 = readJSONDF.withColumn('Gender',F.json_tuple('c0','gender')).\
withColumn('City',F.json_tuple('c0','city')).\
withColumn('MailId',F.json_tuple('c0','email')).\
withColumn('MainSubject',F.json_tuple('c2','main')).\
withColumn('OptionalSubject',F.json_tuple('c2','optional')).\
withColumn('Nationality',F.col('c1')).drop('c0','c1','c2')
finalDF1.show(truncate=False)
Output —
# another way : using withcolumn
Happy Reading !!