Skip to main content

Cloud Computations - Quick data analysis with AWS Athena, Glue and Databricks spark

Cloud Computations - Quick data analysis with AWS Athena, Glue and Databricks spark

 Throughout my carrier, I always had a situation that I had to fix failing production jobs. Most of the time, the debug involved analysis of input data to figure out the error in the raw data.

For the last ten years, I have also been doing data analysis to provide quick business insights. This often involves running a complex query on an extensive set of data.


Most of the time, we do not have access to the production environment to debug a job or install the required packages. It's also advisable not to debug jobs in the production environment as it might have a negative performance impact or completely break the job.


We have been using a few tools to debug, mainly Hive, Presto, Tableau, etc. These tools are not always the best option as often it's required to have custom code/ser-der/packa need to be used for debugging falling jobs because of data issues. I like to use spark, however; the spark does not have a pretty GUI, which data analytics love. We can defiantly use iPython notebook or Zeppelin. But both tools required a considerable amount of time to set up. Often, these tools are not available in production. 


So, I was looking for an alternative that can be set up quickly, have a pretty GUI, and are easily scalable.

I was looking into Cloud (AWS) and came across Athena and Glue. So I tried the use case. I have considered Twitter data (5.5GB and letter 20GB for scalability) for the input data, which I had collected last year. This data is complex enough and should be able to serve our test purpose.


Athena: 

      • Serverless, so no need to set up and manage the infrastructure
      • Supports standard SQL, so need to learn a new language
      • Store the data in S3 and use the same data for analysis in Athena. No need to pre-processing the data
      • Hive in the backend, so support all hive properties and have wide varieties of data support(CSV, JSON, ORC, Avro, and Parquet etc)
      • We can easily use hive serializer deserializer to parse custom data format
      • As this is a managed service the perforce might not be same over the time
      • External table - by default it creates an external table. You can not create an internal table, even if you change the table structure command. This is done to avoid accidental deletion of the table and dropping the entire data from S3
      • It saved every query we run in Athena in a temp/work dir under the user-specified s3 location. I think we need to have some mechanism to delete these dirs/files in regular intervals. Else this will spin into a monster eating all the S3 budget

    To test this service, I kept the input data in the S3 bucket. Then create an Athena table from the S3 bucket. I chose the data format like JSON and it asked me to enter the columns name manually. This is a challenging job for my use case as the JSON has over 300 columns. It's challenging to enter 300 columns manually. I hoped the UI could parse the file(or first row as column name in CSV) to infer the schema. This could have been an excellent facility for creating a table with many columns. However, scanning files to infer schema can take longer and cost more if you have a sizeable amount of data. I've found writing queries manually is a better option than using the GUI.


    I also noticed the GUI cannot run multiple queries together, like a script. This is ridiculous. If I wanted to run a script, then I must run one query at a time.

            

Fig 1: Athena with creating columns manually



    I started looking for an alternative solution and came across AWS GLUE.


GLUE:

      • Serverless, so no need to set up and manage the infrastructure
      • It is a data integration/ETL/Metadata management tool, so it can be used in a wide range of use cases
      • Can read the data from S3, parse and load to Athena
      • Support wide varieties of data (CSV, JSON, ORC, Avro, and Parquet etc.)
      • Can schedule a crawler to get new data from pre-defined location, then parse and load to proper hive location or partition


So I set up a GLUE crawler to get the JSON data to a hive table. Then use that table to query in Athena. However, GLUE did not parse the nested JSON like I was except. I could have written a hive query to parse the nested data and flat the structure. However, this would have taken more time and will quickly become unmanageable. Also, querying the table in Athena produced the below error. Looks like the data is not parsed correctly (I know this is a proper file as I had parsed this data before and used it for Machine Learning).


    

Fig 2: Schema problem with a table created by GLUE



    I had used Databricks before and wanted to try this for the use case. Databricks have a free community edition and can be configured to read data from S3. The community edition has a data limit of 5GB. If we keep our data in S3, then this limitation will not apply here.


    Once you log into the community edition, create a cluster. I chose the below configuration. This step will take 5-10 minutes to finish.

    



Fig 3: Creating Databricks community edition cluster



    For the Databricks cluster to have read and write access to S3, we must have an IAM user/role having read and write access to s3. We need to have both reads and write as we want Databricks to write the output data to s3.



Fig 4: IAM role page in AWS


    

    

Fig 5: IAM - policies in AWS




    We can use the access id and key to read/write data from S3. Here is the code to do this.



import url lib

# unmount if already mounted

# dbutils.fs.unmount("/mnt/corona_s3")

# get the S3 mounted to the local DataBriks dir

access_key = “xxxxxxxxx”

secret_key = “xxxxxxxx”

# parse the special char

encoded_secret_key = urllib.parse.quote(secret_key).replace("/", "%2F")

aws_bucket_name = "sandipan-atena"

local_mount_name = "corona_s3"

dbutils.fs.mount("s3a://%s:%s@%s" %(access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" %local_mount_name)

# display the files

display(dbutils.fs.ls("/mnt/corona_s3/corona/"))

# run few sql

data_loc = "/mnt/corona_s3/corona/*.json"

data = spark.read.json(data_loc)

data.createOrReplaceTempView("data")

data.printSchema()

# Query to get the data points

extract_data = spark.sql("""select user.screen_name as user_screen_name,

                         user.location as user_location,

                         extended_tweet.full_text  extended_tweet_full_text,

                         retweeted_status.user.screen_name  retweeted_status_user_screen_name,

                         retweeted_status.text  retweeted_status_text,

                         retweeted_status.extended_tweet.full_text  retweeted_status_extended_tweet_full_text,

                         quoted_status.user.screen_name  quoted_status_user_screen_name,

                         quoted_status.text  quoted_status_text,

                         quoted_status.extended_tweet.full_text  quoted_status_extended_tweet_full_text,

                         place.country  place_country,

                         place.country_code  place_country_code,

                         timestamp_ms

              from data""")

# some sample data

extract_data.show(10,False)

# Write to final data

extract_data.coalesce(2).write.mode("Overwrite").parquet("/mnt/corona_s3/corona_extract/")



    

Fig 6: Running the query




Fig 7: Query running with stages





Fig 8: Writing the output data




Fig 9: Final output





    Once the output is ready, we can use the AWS GLUE crawler to crawl the data and create/update an Athena hive table exposed to business users.


    I have used Databricks's community edition, which has no consistent performance due to its free nature. However, the paid version can easily replace the free version.


The entire code you can get from git location:- https://github.com/ghoshm21/spark_book/tree/main/ch06_Clud_computation



Please contact contact@sandipan.tech' for any suggestion or help.







    

Comments

Post a Comment

Popular posts from this blog

HOW TO PARSE XML DATA TO A SAPRK DATAFRAME

Purpose :- In one of my project, I had a ton of XML data to perse and process. XML is an excellent format with tags, more like key-value pair. JSON also is almost the same, but more like strip down version of XML, So JSON is very lightweight while XML is heavy. Initially, we thought of using python to parse the data and convert it to JSON for the spark to process. However, the challenge is the size of the data. For the entire 566GB of data would take a long time for python to perse alone. So the obvious choice was the pyspark. We want to perse the data with the schema to a data frame for post-processing. However, I don't think, out of box pysaprk support XML format. This document will demonstrate how to work with XML in pyspark. This same method should work in spark with scala without significant changes. Option 1:- Use spark-xml parser from data bricks Data bricks have 2 xml parser; one spark compiles with scala 2.11 and another one with scala 2.12. Please make sure yo...

How to download really big data sets for big data testing

For a long time, I have been working with big data technologies, like MapReduce, Spark, Hive, and very recently I have started working on AI/ML. For different types of bigdata framework testing and text analysis, I do have to do a large amount of data processing. We have a Hadoop cluster, where we usually do this. However recently, I had a situation where I had to crunch 100 GBs of data on my laptop. I didn't have the opportunity to put this data to our cluster, since it would require a lot of approval, working with admin to get space, opening up the firewall, etc. So I took up the challenge to get it done using my laptop. My system only has 16 Gb of ram and i5 processor. Another challenge was I do not have admin access, so I can not install any required software without approval. However, luckily I had Docker installed.  For processing the data I can use Spark on local mode as spark support parallel processing using CPU cores. As i5 has 4 cores and 4 threads, the sp...

How to Install Spark 3 on Windows 10

 I have been using spark for a long time. It is an excellent, distributed computation framework. I use this regularly at work, and I also have it installed on my local desktop and laptop. This document is to show the installation steps for installing spark 3+ on Windows 10 in a sudo distributed mode. Steps:- Install WSL2 https://docs.microsoft.com/en-us/windows/wsl/install-win10 Install Ubuntu 20.4 LTS from the Microsoft store. Install windows terminal form the Microsoft store. This step is optional. You can use PowerShell or MobaXterm Fire up the Ubuntu from WSL Once logged in, then go to home dir “ cd ~ ” For spark, we need  Python3 Java  Latest Scala Spark with Hadoop, zip file Let's download and install all the prerequisite install python sudo apt-get install software-properties-common sudo apt-get install python-software-properties install Java (open JDK) sudo apt-get install openjdk-8-jdk Check the java and javac version java -version javac -version Install Scala ...