Skip to main content

Posts

Cloud Computations - Quick data analysis with AWS Athena, Glue and Databricks spark

Cloud Computations -  Quick data analysis with  AWS Athena, Glue and Databricks spark   Throughout my carrier, I always had a situation that I had to fix failing production jobs. Most of the time, the debug involved analysis of input data to figure out the error in the raw data. For the last ten years, I have also been doing data analysis to provide quick business insights. This often involves running a complex query on an extensive set of data. Most of the time, we do not have access to the production environment to debug a job or install the required packages. It's also advisable not to debug jobs in the production environment as it might have a negative performance impact or completely break the job. We have been using a few tools to debug, mainly Hive, Presto, Tableau, etc. These tools are not always the best option as often it's required to have custom code/ser-der/packa need to be used for debugging falling jobs because of data issues. I like to use spark, however; ...
Recent posts

How to Install Spark 3 on Windows 10

 I have been using spark for a long time. It is an excellent, distributed computation framework. I use this regularly at work, and I also have it installed on my local desktop and laptop. This document is to show the installation steps for installing spark 3+ on Windows 10 in a sudo distributed mode. Steps:- Install WSL2 https://docs.microsoft.com/en-us/windows/wsl/install-win10 Install Ubuntu 20.4 LTS from the Microsoft store. Install windows terminal form the Microsoft store. This step is optional. You can use PowerShell or MobaXterm Fire up the Ubuntu from WSL Once logged in, then go to home dir “ cd ~ ” For spark, we need  Python3 Java  Latest Scala Spark with Hadoop, zip file Let's download and install all the prerequisite install python sudo apt-get install software-properties-common sudo apt-get install python-software-properties install Java (open JDK) sudo apt-get install openjdk-8-jdk Check the java and javac version java -version javac -version Install Scala ...

How to download really big data sets for big data testing

For a long time, I have been working with big data technologies, like MapReduce, Spark, Hive, and very recently I have started working on AI/ML. For different types of bigdata framework testing and text analysis, I do have to do a large amount of data processing. We have a Hadoop cluster, where we usually do this. However recently, I had a situation where I had to crunch 100 GBs of data on my laptop. I didn't have the opportunity to put this data to our cluster, since it would require a lot of approval, working with admin to get space, opening up the firewall, etc. So I took up the challenge to get it done using my laptop. My system only has 16 Gb of ram and i5 processor. Another challenge was I do not have admin access, so I can not install any required software without approval. However, luckily I had Docker installed.  For processing the data I can use Spark on local mode as spark support parallel processing using CPU cores. As i5 has 4 cores and 4 threads, the sp...

HOW TO PARSE XML DATA TO A SAPRK DATAFRAME

Purpose :- In one of my project, I had a ton of XML data to perse and process. XML is an excellent format with tags, more like key-value pair. JSON also is almost the same, but more like strip down version of XML, So JSON is very lightweight while XML is heavy. Initially, we thought of using python to parse the data and convert it to JSON for the spark to process. However, the challenge is the size of the data. For the entire 566GB of data would take a long time for python to perse alone. So the obvious choice was the pyspark. We want to perse the data with the schema to a data frame for post-processing. However, I don't think, out of box pysaprk support XML format. This document will demonstrate how to work with XML in pyspark. This same method should work in spark with scala without significant changes. Option 1:- Use spark-xml parser from data bricks Data bricks have 2 xml parser; one spark compiles with scala 2.11 and another one with scala 2.12. Please make sure yo...

How to Install Docker in Windows 10 pro

This document will show how to install Docker in windows 10 pro. Why pro version?   Because windows pro supports HyperV. HyperV is the new virtualization technique from Microsoft Windows. It's actually not that simple.  Hyper-v is not just another VM platform, it's a type-1 hypervisor which starts before the host OS, therefore the host OS itself is a VM. After installing HyperV, you will not be able to use any other VM technologies, like Virtual box or VM Ware. So keep in mind, with HyperV, please do not install Virtual box or VMWare. Docker from windows uses HyperV technology to spin docker container. Before installing docker for windows, please make sure you do have windows 10 pro (updated) and HyperV is enabled. Sometimes, you might need to enable virtualization from bios as well. 1. We will download docker community edition for windows from " store.docker.com " the link I am using is " https://store.docker.com/editions/community/docker-ce-desktop-wi...

Install and Configure MySQL on Ubuntu 16.04

I have been working on a project for some time. This tutorial is part of that project. I will talk about the project in another post. Let's get back to the topic. MySql is a general purpose free RDBMS. This is a very popular database in the opensource world. I am using Ubuntu 16.04 for this tutorial. You can check your OS by using below command. lsb_release -a The  stable MySql package is available under Ubuntu repository. Let's update the OS before we install MySql package.  sudo apt-get update sudo apt-get upgrade sudo apt-get dist-upgrade All the above commands will update the OS. Once that is done you might need to restart the system, based on the what packages you are installing. If you are prompt to restart, then reboot your system. In the next step, we will install MySql package using the below command. sudo apt-get install mysql-server During the installation, you will be prompt to set root password. Please remember the password. You wi...