HOW TO PARSE XML DATA TO A SAPRK DATAFRAME

Purpose:-

In one of my project, I had a ton of XML data to perse and process. XML is an excellent format with tags, more like key-value pair. JSON also is almost the same, but more like strip down version of XML, So JSON is very lightweight while XML is heavy.
Initially, we thought of using python to parse the data and convert it to JSON for the spark to process. However, the challenge is the size of the data. For the entire 566GB of data would take a long time for python to perse alone.
So the obvious choice was the pyspark. We want to perse the data with the schema to a data frame for post-processing. However, I don't think, out of box pysaprk support XML format.
This document will demonstrate how to work with XML in pyspark. This same method should work in spark with scala without significant changes.

Option 1:-

Use spark-xml parser from data bricks
Data bricks have 2 xml parser; one spark compiles with scala 2.11 and another one with scala 2.12.
Please make sure you use the correct one.
I have spark compiled with scala 2.11
We can include the dependency in a pom or sbt file. Then open the spark-shell(submit) with "--packages com.databricks:spark-xml_2.11:0.6.0"
Ref: https://stackoverflow.com/questions/50429315/read-xml-in-spark

Option 2:-

Like my company, if you have difficulties in compiling the code with maven or sbt because of the dependency to download from the internet, then you can use the jar file from data bricks website.

scala 2.11: https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.6.0/spark-xml_2.11-0.6.0.jar
scala 2.12: https://repo1.maven.org/maven2/com/databricks/spark-xml_2.12/0.6.0/spark-xml_2.12-0.6.0.jar

While launching the spark-shell or submitting it, please include the JAR with --jars full/path/of the/jar

pyspark --jars /home/sandipan/Downloads/spark_jars/spark-xml_2.11-0.6.0.jar

How to parse the data:-

Very simple, read the xml with the format option and spark should infer the schema.

df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "SmsRecords") \
.option("rowTag", "sms") \
.load("full/path/of/the/xml")

Note:- the "rootTag" is the starting of the xml tag or the main (toor) tag while the "rowTag" is the main row tag.

Screen capture of my code and data frame

TheHadoopGuy

Search This Blog