TheHadoopGuy

HOW TO PARSE XML DATA TO A SAPRK DATAFRAME

Purpose :- In one of my project, I had a ton of XML data to perse and process. XML is an excellent format with tags, more like key-value pair. JSON also is almost the same, but more like strip down version of XML, So JSON is very lightweight while XML is heavy. Initially, we thought of using python to parse the data and convert it to JSON for the spark to process. However, the challenge is the size of the data. For the entire 566GB of data would take a long time for python to perse alone. So the obvious choice was the pyspark. We want to perse the data with the schema to a data frame for post-processing. However, I don't think, out of box pysaprk support XML format. This document will demonstrate how to work with XML in pyspark. This same method should work in spark with scala without significant changes. Option 1:- Use spark-xml parser from data bricks Data bricks have 2 xml parser; one spark compiles with scala 2.11 and another one with scala 2.12. Please make sure yo...

TheHadoopGuy

Search This Blog

Posts

HOW TO PARSE XML DATA TO A SAPRK DATAFRAME