Cloud Computations - Quick data analysis with AWS Athena, Glue and Databricks spark

Throughout my carrier, I always had a situation that I had to fix failing production jobs. Most of the time, the debug involved analysis of input data to figure out the error in the raw data.

For the last ten years, I have also been doing data analysis to provide quick business insights. This often involves running a complex query on an extensive set of data.

Most of the time, we do not have access to the production environment to debug a job or install the required packages. It's also advisable not to debug jobs in the production environment as it might have a negative performance impact or completely break the job.

We have been using a few tools to debug, mainly Hive, Presto, Tableau, etc. These tools are not always the best option as often it's required to have custom code/ser-der/packa need to be used for debugging falling jobs because of data issues. I like to use spark, however; the spark does not have a pretty GUI, which data analytics love. We can defiantly use iPython notebook or Zeppelin. But both tools required a considerable amount of time to set up. Often, these tools are not available in production.

So, I was looking for an alternative that can be set up quickly, have a pretty GUI, and are easily scalable.

I was looking into Cloud (AWS) and came across Athena and Glue. So I tried the use case. I have considered Twitter data (5.5GB and letter 20GB for scalability) for the input data, which I had collected last year. This data is complex enough and should be able to serve our test purpose.

Athena:

Serverless, so no need to set up and manage the infrastructure
Supports standard SQL, so need to learn a new language
Store the data in S3 and use the same data for analysis in Athena. No need to pre-processing the data
Hive in the backend, so support all hive properties and have wide varieties of data support(CSV, JSON, ORC, Avro, and Parquet etc)
We can easily use hive serializer deserializer to parse custom data format
As this is a managed service the perforce might not be same over the time
External table - by default it creates an external table. You can not create an internal table, even if you change the table structure command. This is done to avoid accidental deletion of the table and dropping the entire data from S3
It saved every query we run in Athena in a temp/work dir under the user-specified s3 location. I think we need to have some mechanism to delete these dirs/files in regular intervals. Else this will spin into a monster eating all the S3 budget

To test this service, I kept the input data in the S3 bucket. Then create an Athena table from the S3 bucket. I chose the data format like JSON and it asked me to enter the columns name manually. This is a challenging job for my use case as the JSON has over 300 columns. It's challenging to enter 300 columns manually. I hoped the UI could parse the file(or first row as column name in CSV) to infer the schema. This could have been an excellent facility for creating a table with many columns. However, scanning files to infer schema can take longer and cost more if you have a sizeable amount of data. I've found writing queries manually is a better option than using the GUI.

I also noticed the GUI cannot run multiple queries together, like a script. This is ridiculous. If I wanted to run a script, then I must run one query at a time.

Fig 1: Athena with creating columns manually

I started looking for an alternative solution and came across AWS GLUE.

GLUE:

Serverless, so no need to set up and manage the infrastructure
It is a data integration/ETL/Metadata management tool, so it can be used in a wide range of use cases
Can read the data from S3, parse and load to Athena
Support wide varieties of data (CSV, JSON, ORC, Avro, and Parquet etc.)
Can schedule a crawler to get new data from pre-defined location, then parse and load to proper hive location or partition

So I set up a GLUE crawler to get the JSON data to a hive table. Then use that table to query in Athena. However, GLUE did not parse the nested JSON like I was except. I could have written a hive query to parse the nested data and flat the structure. However, this would have taken more time and will quickly become unmanageable. Also, querying the table in Athena produced the below error. Looks like the data is not parsed correctly (I know this is a proper file as I had parsed this data before and used it for Machine Learning).

Fig 2: Schema problem with a table created by GLUE

I had used Databricks before and wanted to try this for the use case. Databricks have a free community edition and can be configured to read data from S3. The community edition has a data limit of 5GB. If we keep our data in S3, then this limitation will not apply here.

Once you log into the community edition, create a cluster. I chose the below configuration. This step will take 5-10 minutes to finish.

Fig 3: Creating Databricks community edition cluster

For the Databricks cluster to have read and write access to S3, we must have an IAM user/role having read and write access to s3. We need to have both reads and write as we want Databricks to write the output data to s3.

Fig 4: IAM role page in AWS

Fig 5: IAM - policies in AWS

We can use the access id and key to read/write data from S3. Here is the code to do this.

import url lib
# unmount if already mounted
# dbutils.fs.unmount("/mnt/corona_s3")
# get the S3 mounted to the local DataBriks dir
access_key = “xxxxxxxxx”
secret_key = “xxxxxxxx”
# parse the special char
encoded_secret_key = urllib.parse.quote(secret_key).replace("/", "%2F")
aws_bucket_name = "sandipan-atena"
local_mount_name = "corona_s3"
dbutils.fs.mount("s3a://%s:%s@%s" %(access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" %local_mount_name)
# display the files
display(dbutils.fs.ls("/mnt/corona_s3/corona/"))
# run few sql
data_loc = "/mnt/corona_s3/corona/*.json"
data = spark.read.json(data_loc)
data.createOrReplaceTempView("data")
data.printSchema()
# Query to get the data points
extract_data = spark.sql("""select user.screen_name as user_screen_name,
   user.location as user_location,
   extended_tweet.full_text extended_tweet_full_text,
   retweeted_status.user.screen_name retweeted_status_user_screen_name,
   retweeted_status.text retweeted_status_text,
   retweeted_status.extended_tweet.full_text retweeted_status_extended_tweet_full_text,
   quoted_status.user.screen_name quoted_status_user_screen_name,
   quoted_status.text quoted_status_text,
   quoted_status.extended_tweet.full_text quoted_status_extended_tweet_full_text,
   place.country place_country,
   place.country_code place_country_code,
   timestamp_ms
from data""")
# some sample data
extract_data.show(10,False)
# Write to final data
extract_data.coalesce(2).write.mode("Overwrite").parquet("/mnt/corona_s3/corona_extract/")

Fig 6: Running the query

Fig 7: Query running with stages

Fig 8: Writing the output data

Fig 9: Final output

Once the output is ready, we can use the AWS GLUE crawler to crawl the data and create/update an Athena hive table exposed to business users.

I have used Databricks's community edition, which has no consistent performance due to its free nature. However, the paid version can easily replace the free version.

The entire code you can get from git location:- https://github.com/ghoshm21/spark_book/tree/main/ch06_Clud_computation

Please contact ‘contact@sandipan.tech' for any suggestion or help.

How to download really big data sets for big data testing

For a long time, I have been working with big data technologies, like MapReduce, Spark, Hive, and very recently I have started working on AI/ML. For different types of bigdata framework testing and text analysis, I do have to do a large amount of data processing. We have a Hadoop cluster, where we usually do this. However recently, I had a situation where I had to crunch 100 GBs of data on my laptop. I didn't have the opportunity to put this data to our cluster, since it would require a lot of approval, working with admin to get space, opening up the firewall, etc. So I took up the challenge to get it done using my laptop. My system only has 16 Gb of ram and i5 processor. Another challenge was I do not have admin access, so I can not install any required software without approval. However, luckily I had Docker installed. For processing the data I can use Spark on local mode as spark support parallel processing using CPU cores. As i5 has 4 cores and 4 threads, the sp...

TheHadoopGuy

Search This Blog