Getting started with PySpark on Windows

Posted in /  

Getting started with PySpark on Windows

Paritosh Louhan
Last updated on November 12, 2022

    In recent years, with the growing need for data, there has been a growing processing demand for large-sized data. To solve this problem, a framework was designed that can help you process a large amount of data in a distributed manner. This framework is known as Apache Spark .

    Apache Spark is an open-source framework supporting different languages like Java, Scala, Python and R . PySpark is an API for Spark that lets you write Spark applications using Python APIs. This library helps you run Python applications leveraging Spark capabilities. In addition, PySpark provides users with a PySpark shell to analyze data in a distributed environment. It supports four core features of Spark, namely Spark SQL, DataFrame, Streaming, MLlib, and Spark Core.

    In this blog post, we shall guide you on getting started with PySpark on Windows.

    Prerequisites of Pyspark

    To install PySpark on your Windows system, make sure you have the following installed first:

    • Java version 7 or later
    • Python version 2.6 or later.

    Java Setup

    You can download the JDK from the official website of Oracle as per your system requirement. Then you need to run the installer to set up the Java environment on your system.

    Java SE Development Kit

    After Java installation is completed, do check if JAVA_HOME is set into the PATH variable or not, mentioning the Java location.

    We can achieve that with the below steps:

    1) Define JAVA_HOME

    New User Variable

    2) Add JAVA_HOME to the PATH variable

    Edit Enviornment Variable

    3) You need to restart your system after the above step, or you can restart all your terminal , command prompt, and IDE so that environment settings can be picked up by them to know about the JAVA installation. To validate, you can open the command prompt and check the Java version.

    Check Java Version

    Python Setup

    To get started with PySpark, you first need to set up Python on your system. If you don't have Python already installed, you can do it with the link: install-python-on-windows .

    You can simply type the “python” command in the Command Prompt to validate if it's installed already.

    Check Python

    With Python, we do not get Spark installed. We need to do additional steps to get it installed on our local system. To validate this, we can follow the below steps and see:

    Check PySpark

    The above screenshot shows that PySpark is not available, and we need to install it.

    PySpark Installation

    To install PySpark, we need to exit our Python terminal by typing exit() and then use the pip command to install it. Before doing that, we should look at our Python packages:

    This is how the folders look like before Python installation:
    Python Packages

    To install PySpark, we can use the following command:

    pip install pyspark

    Install PySpark

    To validate the installation, Redo the above steps by opening the Python shell and then try to import PySpark again and this time it should not result in an error.

    Import PySpark

    The above screenshot validates that the Python installation has succeeded.

    Python site-packages before installation:

    Python site-packages

    Python site-packages after installation:

    Python site-packages after installation

    Here, we can see the new folders that support PySpark.

    Let us see if we are able to run PySpark programs using the current setup.

    Note that for demonstration purposes, we are going to use the same Spark examples that come with the setup.

    So, let’s now try running one of the programs:

    Windows PowerShell

    The above setup has failed because of two reasons:

    1. It still needs Spark and Hadoop setup on its local system to cover all use cases.
    2. It needs to know the connection between PySpark and Python that can be established by setting the below environment variables:
    PYSPARK_PYTHON=python

    Spark and Hadoop Setup

    To install Spark, please refer to Spark official page for download .

    Apache Spark

    1. You can select the distribution that you want to download. I am choosing the latest version.
    2. After you make a choice in steps 1 and 2, step 3 gives you a link to download the distribution.
    3. You can extract the package either by using winrar or the below command:

      Extract Package
    4. This will extract the package to its named folder. You can move this folder under the C:/ directory.
    5. For Spark, we need to install winutils. Winutils are Windows binaries for Hadoop versions. You can download them from the page. Choose the version as per the above version compatibility. So, I will download winutils.exe for Hadoop 3.
    6. Now, we want to set up environment variables for supporting Hadoop and Spark. Please refer to HADOOP_HOME and SPARK_HOME variables in the below screenshot. Please update them as per your setup location for Spark and winutils.

    As per my setup, I have placed winutils.exe inside “C:\spark-3.3.0-bin-hadoop3\bin\hadoop\bin”. (Note: Please create hadoop/bin inside spark/bin)

    Environment Variables

    1. Once you have created the above variables, please add hadoop/bin and spark/bin to the classpath as shown in the screenshot in the PATH environment variable:

    Environment Variables Path

    After we have made the environment variable settings, we need to restart our system.

    Now, we can again try and execute the same program wordcount.py to see if the setup is ready.

    Before doing that, let us take a look at the two files which we are using for this demo:

    1) wordcount.py

    import sys
    from operator import add
    from pyspark.sql import SparkSession
    
    if __name__ == "__main__":
        if len(sys.argv) != 2:
            print("Usage: wordcount <file>", file=sys.stderr)
            sys.exit(-1)
    
        spark = SparkSession\
            .builder\
            .appName("PythonWordCount")\
            .getOrCreate()
    
        lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
        counts = lines.flatMap(lambda x: x.split(' ')) \
                      .map(lambda x: (x, 1)) \
                      .reduceByKey(add)
        output = counts.collect()
        for (word, count) in output:
            print("%s: %i" % (word, count))
    
        spark.stop()

    2) sampleFile.txt

    sampleFile

    We will try to run the above program as below, and now, it worked for me.

    Program Output

    Also now you can open the PySpark shell using the pyspark command on your shell as below:

    pyspark command
    As per the logs above, the Spark UI is now available at the mentioned link and will display the state of executing tasks as below:

    PySparkShell Spark Jobs

    Now, you can play around by writing simple PySpark programs and see how they behave on the Spark UI. Please see one such example:

    In this example, I will try to read a file. Please note that any submitted job will be visible on Spark UI only after you perform an action command since Spark works with a lazy evaluation concept.

    In the below example, the show is an action.

    Test pyspark steup
    After the show method is invoked, you can see the running job on the Spark UI.


    Spark Jobs

    Conclusion

    That’s all about the installation of PySpark on Windows. After reading this article, you might have got a clear idea of getting started with PySpark setup and a few details. Once you are done with it, you are all set to create Spark applications using Python.

    Stay tuned for more blogs to learn the language and some interesting use cases.

    Happy Reading!

    People are also reading:

    FAQs


    Java is essential because Spark code is written in Scala language, and that requires a Java setup for the compilation of files. One such example is the file SparkContext.scala , which is responsible for the creation of Spark context. It is the first thing to do when you start writing a PySpark program.

    The Python packaging for Spark is not intended to replace all of the use cases of Spark. This Python-packaged version of Spark is only suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but it does not contain the tools required to set up your own standalone Spark cluster. This is the reason we were getting errors initially and had to install spark as a whole

    PySpark is a Python-based API that uses Python language and supports Spark capabilities. But, we all know that Spark is the Big data engine, whereas Python is a programming language.

    Leave a Comment on this Post

    0 Comments