Apache Spark is a general-purpose big data processing engine. It is a very powerful cluster computing framework which can run from a single cluster to thousands of clusters. It can run on clusters managed by Hadoop YARN, Apache Mesos, or by Spark’s standalone cluster manager itself. To read more on Spark Big data processing framework, visit this post “Big Data processing using Apache Spark – Introduction“. Here, in this post, we will learn how we can install Apache Spark on a local Windows Machine in a pseudo-distributed mode (managed by Spark’s standalone cluster manager) and run it using PySpark (Spark’s Python API).
Install Spark on Local Windows Machine
To install Apache Spark on a local Windows machine, we need to follow below steps:
Step 1 – Download and install Java JDK 8
Java JDK 8 is required as a prerequisite for the Apache Spark installation. We can download the JDK 8 from the Oracle official website.
As highlighted, we need to download 32 bit or 64 bit JDK 8 appropriately. Click on the link to start the download. Once the file gets downloaded, double click the executable binary file to start the installation process and then follow the on-screen instructions.
Step 2 – Download and install Apache Spark latest version
Now we need to download Spark latest build from Apache Spark’s home page. The latest available Spark version (at the time of writing) is Spark 2.4.3. The default spark package type is pre-built for Apache Hadoop 2.7 and later which works fine. Next, click on the download “spark-2.4.3-bin-hadoop2.7.tgz” to get the .tgz file.
After downloading the spark build, we need to unzip the zipped folder and copy the “spark-2.4.3-bin-hadoop2.7” folder to the spark installation folder, for example, C:\Spark\ (The unzipped directory is itself a zipped directory and we need to extract the innermost unzipped directory at the installation path.).
Step 3- Set the environment variables
Now, we need to set few environment variables which are required in order to set up Spark on a Windows machine. Also, note that we need to replace “Program Files” with “Progra~1” and “Program Files (x86)” with “Progra~2“.
- Set SPARK_HOME = “C:\Spark\spark-2.4.3-bin-hadoop2.7“
- Set HADOOP_HOME = “C:\Spark\spark-2.4.3-bin-hadoop2.7“
- Set JAVA_HOME = “C:\Progra~1\Java\jdk1.8.0_212“
Step 4 – Update existing PATH variable
- Modify PATH variable to add:
- C:\Progra~1\Java\jdk1.8.0_212\bin
- C:\Spark\spark-2.4.3-bin-hadoop2.7\bin
Note: We need to replace “Program Files” with “Progra~1” and “Program Files (x86)” with “Progra~2“.
Step 5 – Download and copy winutils.exe
Next, we need to download winutils.exe binary file from this git repository “https://github.com/steveloughran/winutils“. To download this:
- Open the given git link.
- Navigate to the hadoop- 2.7.1 folder (We need to navigate to the same Hadoop version folder as the package type we have selected while downloading the Spark build).
- Go to the bin folder and download the winutils.exe binary file. This is the direct link to download winutils.exe “https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe” for Hadoop 2.7 and later spark build.
- Copy this file into bin folder of the spark installation folder which is “C:\Spark\spark-2.4.3-bin-hadoop2.7\bin” in our case.
Step 6 – Create hive temp folder
In order to avoid hive bugs, we need to create an empty directory at “C:\tmp\hive“.
Step 7 – Change winutils permission
Once, we have downloaded and copied the winutils.exe at the desired path and have created the required hive folder, we need to give appropriate permissions to the winutils. In order to do so, open the command prompt as an administrator and execute the below commands:
winutils.exe chmod -R 777 C:\tmp\hive winutils.exe ls -F C:\tmp\hive
Step 8 – Download and install python latest version
Now, we are good to download and install the python latest version. Python can be downloaded from the official python website link https://www.python.org/downloads/.
Step 9 – pip Install pyspark
Next, we need to install pyspark package to start Spark programming using Python. To do so, we need to open the command prompt window and execute the below command:
pip install pyspark
Step 10 – Run Spark code
Now, we can use any code editor IDE or python in-built code editor (IDLE) to write and execute spark code. Below is a sample spark code written using Jupyter notebook:
from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession conf = SparkConf() conf.setMaster("local").setAppName("My app") sc = SparkContext.getOrCreate(conf=conf) spark = SparkSession(sc) print("Current Spark version is : {0}".format(spark.version))
Thanks for the reading. Please share your input in the comment section.
Man, am I luck or what to run into this post? You’re my savior of the day! Your instruction is precise and exactly what I needed to solved my problem running PySpark locally.
Just one minor thing: I needed to reboot my machine in order for those Env variables to take effect.
After I followed the steps you outlined here, step by step, I now was able to run pySpark codes in my VS Code and returned result without error.
Thanks a ton!
Just a heads up, the line with “sc = SparkContext.getOrCreate(conf=conf)” was giving me errors. It worked when I changed it to “sc = SparkContext.getOrCreate(conf)”. In case any of your readers are scratching their heads as I was.