Skip to main content

How to Install Spark 3 on Windows 10

 I have been using spark for a long time. It is an excellent, distributed computation framework. I use this regularly at work, and I also have it installed on my local desktop and laptop.

This document is to show the installation steps for installing spark 3+ on Windows 10 in a sudo distributed mode.


Steps:-

  1. Install WSL2
    1. https://docs.microsoft.com/en-us/windows/wsl/install-win10
  2. Install Ubuntu 20.4 LTS from the Microsoft store.
  3. Install windows terminal form the Microsoft store. This step is optional. You can use PowerShell or MobaXterm
  4. Fire up the Ubuntu from WSL



  5. Once logged in, then go to home dir
    1. “ cd ~ ”
  6. For spark, we need 
    1. Python3
    2. Java 
    3. Latest Scala
    4. Spark with Hadoop, zip file
  7. Let's download and install all the prerequisite
    1. install python
    2. sudo apt-get install software-properties-common
    3. sudo apt-get install python-software-properties
  8. install Java (open JDK)
    1. sudo apt-get install openjdk-8-jdk
  9. Check the java and javac version
    1. java -version
    2. javac -version

  10. Install Scala
  11. get scala binary for unix
    1. wget https://downloads.lightbend.com/scala/2.13.3/scala-2.13.3.tgz
    2. tar xvf scala-2.13.3.tgz 
  12. edit bashrc file to add Scala
    1. vi ~/.bashrc
  13. add these lines in the end
    1. export SCALA_HOME=Path-where-scala-file-is-located#/root/scala-2.13.3
    2. export PATH=$PATH:$SCALA_HOME/bin
    3. source ~/.bashrc
    4. scala -version
  14. get the spark
    1. I downloaded the spark from the source
    2. wget "https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz"
    3. tar xvf spark-3.1.1-bin-hadoop3.2.tgz
    4. vi ~/.bashrc
    5. export SPARK_HOME="/home/sandipan/spark-3.1.1-bin-hadoop3.2"
    6. export PATH=$PATH:$SPARK_HOME/bin
    7. source ~/.bashrc
  15. Start Spark Services
    1. cd $SPARK_HOME
    2. start the master Server
      1. ./sbin/start-master.sh
    3. once you start the master server you will get a message saying it had started.
      1. you can see the spark status in web console of master using local http://localhost:8080
    4. there you will see the master url
      1. Mine looks like:- "spark://LAPTOP-7DUT93OF.localdomain:7077"
    5. we can start a slave using bellow command
      1. SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 SPARK_WORKER_MEMORY=7G ./sbin/start-worker.sh spark://LAPTOP-7DUT93OF.localdomain:7077
    6. SPARK_WORKER_INSTANCES = how many worker instances you want to start
    7. SPARK_WORKER_CORES =  how many cores per instances you want to give. Generally, I give 1 core.
    8. SPARK_WORKER_MEMORY = Memory per worker. Be very careful with this parameter. My laptop has 32 GB of memory, so I keep 3GB to 4GB for Windows, 2 GB for the driver program and the rest for the worker node.
    9. open the pyspark shell
      1. SPARK_HOME/bin/pyspark --master spark://LAPTOP-7DUT93OF.localdomain:7077 --executor-memory 6500mb
      2. SPARK_HOME/bin/spark-shell --master spark://LAPTOP-7DUT93OF.localdomain:7077 --executor-memory 6500mb
    10. to stop all the workers
      1. SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./sbin/stop-worker.sh spark://LAPTOP-7DUT93OF.localdomain:7077
      2. OR TRY
        1. kill -9 $(jps -l | grep spark | awk -F ' ' '{print $1}')
    

Comments

  1. Thanks for Sharing!
    Really a nice post.
    Now You Can Easily Download Every Crack Software From Here*
    Please Visit!
    FastKeys Crack
    MobaXterm 21.4 crack
    Nitro Pro Enterprise crack
    SlimWare DriverUpdate crack
    MediaHuman YouTube Downloader crack

    ReplyDelete
  2. After looking through a few blog articles on your website,
    we sincerely appreciate the way you blogged.
    We've added it to our list of bookmarked web pages and will be checking back in the near
    future. Please also visit my website and tell us what you think.
    Great work with hard work you have done I appreciate your work thanks for sharing it.
    CrackBins Full Version Softwares Free Download
    MediaHuman YouTube Crack

    ReplyDelete
  3. Amazing blog! I really like the way you explained such information about this post to us. And a blog is really helpful for us this website.
    MediaHuman YouTube Downloader Crack
    ProtonVPN Crack
    Razer Cortex Game Booster Crack
    Antares AVOX Crack
    Easy Duplicate Finder Crack

    ReplyDelete
  4. After looking through best blog article Technology Blog .

    ReplyDelete

Post a Comment

Popular posts from this blog

HOW TO PARSE XML DATA TO A SAPRK DATAFRAME

Purpose :- In one of my project, I had a ton of XML data to perse and process. XML is an excellent format with tags, more like key-value pair. JSON also is almost the same, but more like strip down version of XML, So JSON is very lightweight while XML is heavy. Initially, we thought of using python to parse the data and convert it to JSON for the spark to process. However, the challenge is the size of the data. For the entire 566GB of data would take a long time for python to perse alone. So the obvious choice was the pyspark. We want to perse the data with the schema to a data frame for post-processing. However, I don't think, out of box pysaprk support XML format. This document will demonstrate how to work with XML in pyspark. This same method should work in spark with scala without significant changes. Option 1:- Use spark-xml parser from data bricks Data bricks have 2 xml parser; one spark compiles with scala 2.11 and another one with scala 2.12. Please make sure yo...

How to download really big data sets for big data testing

For a long time, I have been working with big data technologies, like MapReduce, Spark, Hive, and very recently I have started working on AI/ML. For different types of bigdata framework testing and text analysis, I do have to do a large amount of data processing. We have a Hadoop cluster, where we usually do this. However recently, I had a situation where I had to crunch 100 GBs of data on my laptop. I didn't have the opportunity to put this data to our cluster, since it would require a lot of approval, working with admin to get space, opening up the firewall, etc. So I took up the challenge to get it done using my laptop. My system only has 16 Gb of ram and i5 processor. Another challenge was I do not have admin access, so I can not install any required software without approval. However, luckily I had Docker installed.  For processing the data I can use Spark on local mode as spark support parallel processing using CPU cores. As i5 has 4 cores and 4 threads, the sp...