Skip to main content

How to download really big data sets for big data testing


For a long time, I have been working with big data technologies, like MapReduce, Spark, Hive, and very recently I have started working on AI/ML. For different types of bigdata framework testing and text analysis, I do have to do a large amount of data processing. We have a Hadoop cluster, where we usually do this.

However recently, I had a situation where I had to crunch 100 GBs of data on my laptop. I didn't have the opportunity to put this data to our cluster, since it would require a lot of approval, working with admin to get space, opening up the firewall, etc.

So I took up the challenge to get it done using my laptop. My system only has 16 Gb of ram and i5 processor. Another challenge was I do not have admin access, so I can not install any required software without approval. However, luckily I had Docker installed. 

For processing the data I can use Spark on local mode as spark support parallel processing using CPU cores. As i5 has 4 cores and 4 threads, the spark could run the entire process on 8 parallel processes.

How to get the Data: Yellow cab data


Now to the real topic, where to get the really big opensource data, which is 100GB in size? We need both structure(CSV) and semistructured(JSON)data

Source1:- After little research, I found out that we can download entire yellow cab data from the NYC gov data site. Here is the link

This does need a little bit of effort to download the data, as all the data are split in monthly CSV. Each CSV is 2 GB in size. So I wrote a python program that will download each month CSV for the website into a local directory and will also show a little progress bar on the screen.


JSON data

How about the semi-structured data? well, we can use ‘Open Library’ data. 
The Open Library is an initiative intended to create “one web page for every book ever published.” You can download their dataset which is about 20GB of compressed data.

we can download the data very easily using wget. Then unzip it using unzip command.

"wget  --continue http://openlibrary.org/data/ol_cdump_latest.txt.gz"
https://medium.com/@ghoshm21/how-to-download-really-big-data-sets-for-big-data-testing-ea33b9100f09?sk=499f483598f0d6244084ec36d3823eed

Code to download yellow cab data

import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

year = list(range(2009, 2020))
month = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
for x, y in [(x, y) for x in year for y in month]:
    print("fetching data for %s, %s" % (x, y))
    link = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_%s-%s.csv" % (x, y)
    file_name = '/home/sandipan/Documents/yellow_taxi/yellow_taxi_data_%s-%s.csv' % (x, y)
    print(link, file_name)
    download_url(link, file_name)

Comments

  1. Great job, this is good information which is shared by you. This information is very meaningful and factual for us to increase our knowledge about it. Always keep sharing this type of information. Thanks. Read more info about Data Analyst Recruitment Agency in USA

    ReplyDelete
  2. The details you have shared here about iPad pro case is very instructive as it contains some best knowledge which is very helpful for me. Thanks for posting it. iPhone Cases Covers & Accessories.Data Engineering Services

    ReplyDelete
  3. Really informative! For further reading, click this link : https://www.smedigital.in/seo-company-in-medavakkam.php

    ReplyDelete

Post a Comment

Popular posts from this blog

HOW TO PARSE XML DATA TO A SAPRK DATAFRAME

Purpose :- In one of my project, I had a ton of XML data to perse and process. XML is an excellent format with tags, more like key-value pair. JSON also is almost the same, but more like strip down version of XML, So JSON is very lightweight while XML is heavy. Initially, we thought of using python to parse the data and convert it to JSON for the spark to process. However, the challenge is the size of the data. For the entire 566GB of data would take a long time for python to perse alone. So the obvious choice was the pyspark. We want to perse the data with the schema to a data frame for post-processing. However, I don't think, out of box pysaprk support XML format. This document will demonstrate how to work with XML in pyspark. This same method should work in spark with scala without significant changes. Option 1:- Use spark-xml parser from data bricks Data bricks have 2 xml parser; one spark compiles with scala 2.11 and another one with scala 2.12. Please make sure yo...

How to Install Spark 3 on Windows 10

 I have been using spark for a long time. It is an excellent, distributed computation framework. I use this regularly at work, and I also have it installed on my local desktop and laptop. This document is to show the installation steps for installing spark 3+ on Windows 10 in a sudo distributed mode. Steps:- Install WSL2 https://docs.microsoft.com/en-us/windows/wsl/install-win10 Install Ubuntu 20.4 LTS from the Microsoft store. Install windows terminal form the Microsoft store. This step is optional. You can use PowerShell or MobaXterm Fire up the Ubuntu from WSL Once logged in, then go to home dir “ cd ~ ” For spark, we need  Python3 Java  Latest Scala Spark with Hadoop, zip file Let's download and install all the prerequisite install python sudo apt-get install software-properties-common sudo apt-get install python-software-properties install Java (open JDK) sudo apt-get install openjdk-8-jdk Check the java and javac version java -version javac -version Install Scala ...