Apache Spark is a general-purpose big data processing engine. It is a very powerful cluster computing framework which can run from a single cluster to thousands of clusters. It can run on clusters managed by Hadoop YARN, Apache Mesos, or by Spark’s standalone cluster manager itself. To read more on Spark Big data processing framework, visit this post “Big Data processing using Apache Spark – Introduction“. Here, in this post, we will learn how we can install Apache Spark on a local Windows Machine in a pseudo-distributed mode (managed by Spark’s standalone cluster manager) and run it using PySpark (Spark’s Python API).

Install Spark on Local Windows Machine

To install Apache Spark on a local Windows machine, we need to follow below steps:

Step 1 – Download and install Java JDK 8

Java JDK 8 is required as a prerequisite for the Apache Spark installation. We can download the JDK 8 from the Oracle official website.

As highlighted, we need to download 32 bit or 64 bit JDK 8 appropriately. Click on the link to start the download. Once the file gets downloaded, double click the executable binary file to start the installation process and then follow the on-screen instructions.

Step 2 – Download and install Apache Spark latest version

Now we need to download Spark latest build from Apache Spark’s home page. The latest available Spark version (at the time of writing) is Spark 2.4.3. The default spark package type is pre-built for Apache Hadoop 2.7 and later which works fine. Next, click on the download “spark-2.4.3-bin-hadoop2.7.tgz” to get the .tgz file.

After downloading the spark build, we need to unzip the zipped folder and copy the “spark-2.4.3-bin-hadoop2.7” folder to the spark installation folder, for example, C:\Spark\ (The unzipped directory is itself a zipped directory and we need to extract the innermost unzipped directory at the installation path.).

Step 3- Set the environment variables

Now, we need to set few environment variables which are required in order to set up Spark on a Windows machine. Also, note that we need to replace “Program Files” with “Progra~1” and “Program Files (x86)” with “Progra~2“.

Set SPARK_HOME = “C:\Spark\spark-2.4.3-bin-hadoop2.7“
Set HADOOP_HOME = “C:\Spark\spark-2.4.3-bin-hadoop2.7“
Set JAVA_HOME = “C:\Progra~1\Java\jdk1.8.0_212“

Step 4 – Update existing PATH variable

Modify PATH variable to add:
1. C:\Progra~1\Java\jdk1.8.0_212\bin
2. C:\Spark\spark-2.4.3-bin-hadoop2.7\bin

Note: We need to replace “Program Files” with “Progra~1” and “Program Files (x86)” with “Progra~2“.

Step 5 – Download and copy winutils.exe

Next, we need to download winutils.exe binary file from this git repository “https://github.com/steveloughran/winutils“. To download this:

Open the given git link.
Navigate to the hadoop- 2.7.1 folder (We need to navigate to the same Hadoop version folder as the package type we have selected while downloading the Spark build).
Go to the bin folder and download the winutils.exe binary file. This is the direct link to download winutils.exe “https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe” for Hadoop 2.7 and later spark build.
Copy this file into bin folder of the spark installation folder which is “C:\Spark\spark-2.4.3-bin-hadoop2.7\bin” in our case.

Step 6 – Create hive temp folder

In order to avoid hive bugs, we need to create an empty directory at “C:\tmp\hive“.

Step 7 – Change winutils permission

Once, we have downloaded and copied the winutils.exe at the desired path and have created the required hive folder, we need to give appropriate permissions to the winutils. In order to do so, open the command prompt as an administrator and execute the below commands:

winutils.exe chmod -R 777 C:\tmp\hive
winutils.exe ls -F C:\tmp\hive

Step 8 – Download and install python latest version

Now, we are good to download and install the python latest version. Python can be downloaded from the official python website link https://www.python.org/downloads/.

Step 9 – pip Install pyspark

Next, we need to install pyspark package to start Spark programming using Python. To do so, we need to open the command prompt window and execute the below command:

pip install pyspark

Step 10 – Run Spark code

Now, we can use any code editor IDE or python in-built code editor (IDLE) to write and execute spark code. Below is a sample spark code written using Jupyter notebook:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
print("Current Spark version is : {0}".format(spark.version))

Thanks for the reading. Please share your input in the comment section.

Rate This

[Total: 1 Average: 5]

Steve Yang

Feb 12, 2021 at 4:24 am

Man, am I luck or what to run into this post? You’re my savior of the day! Your instruction is precise and exactly what I needed to solved my problem running PySpark locally.

Just one minor thing: I needed to reboot my machine in order for those Env variables to take effect.

After I followed the steps you outlined here, step by step, I now was able to run pySpark codes in my VS Code and returned result without error.

Thanks a ton!

Shaun

Oct 22, 2020 at 10:24 am

Just a heads up, the line with “sc = SparkContext.getOrCreate(conf=conf)” was giving me errors. It worked when I changed it to “sc = SparkContext.getOrCreate(conf)”. In case any of your readers are scratching their heads as I was.

Install Spark on Windows (Local machine) with PySpark – Step by Step

Install Spark on Local Windows Machine

Step 1 – Download and install Java JDK 8

Step 2 – Download and install Apache Spark latest version

Step 3- Set the environment variables

Step 4 – Update existing PATH variable

Step 5 – Download and copy winutils.exe

Step 6 – Create hive temp folder

Step 7 – Change winutils permission

Step 8 – Download and install python latest version

Step 9 – pip Install pyspark

Step 10 – Run Spark code

Like this:

Related Articles

2 thoughts on “Install Spark on Windows (Local machine) with PySpark – Step by Step”

Leave a Comment Cancel Reply

Install Spark on Local Windows Machine

Step 1 – Download and install Java JDK 8

Step 2 – Download and install Apache Spark latest version

Step 3- Set the environment variables

Step 4 – Update existing PATH variable

Step 5 – Download and copy winutils.exe

Step 6 – Create hive temp folder

Step 7 – Change winutils permission

Step 8 – Download and install python latest version

Step 9 – pip Install pyspark

Step 10 – Run Spark code

Share this:

Like this:

Related Articles

2 thoughts on “Install Spark on Windows (Local machine) with PySpark – Step by Step”

Leave a Comment Cancel Reply