Using Pandas on Spark

Pandas is one of the most popular Python libraries used by Data Scientists/Data Engineers for data wrangling and data analysis. Also, Pandas provide DataFrames (a table-like structure that stores data in rows and columns) to deal with structured datasets. These DataFrames are very similar to Spark’s DataFrames. However, Pandas dataframes are limited to a single machine only and it does not support distributed computing that helps to solve Big Data problems.

Pandas on Spark API gets integrated with Spark 3.2 release considering the Pandas popularity and flexibility. As a result, this integration has leveraged the power of Pandas to the Spark cluster computing framework. Now, we can use most of the functionalities of Pandas on a distributed dataset also.

Pandas-on-Spark using pyspark.pandas API

The integration of Pandas API in Apache Spark helped core python developers to start using Spark with a very minimal learning curve. In addition, it also helped developers to write unified application code to handle Big as well as small datasets. Earlier, when we read a big amount of data into pandas, it fails with an out-of-memory error if the data goes beyond the memory of that single machine. However, using pandas on Spark, we do not need to bring all the data to one single machine now. That way, we can work with large datasets by leveraging the power of distributed computing provided by the Spark framework.

Also, we know that the Pandas library has built-in integration with matplotlib library. We use the matplotlib python package to plot and visualize stunning charts, plots, and other eye-catching visuals. Now, we can easily plot these visuals on large datasets in Spark using Pandas.

Create a Pandas on Spark DataFrame using Pandas API

In order to use the Pandas on Spark, we need to import the Pandas API module using the below command:

import pyspark.pandas as ps

We can use this code to create a pandas dataframe on spark.

#Import the pandas pyspark api
import pyspark.pandas as ps

#Create pandas dataframe on spark
psDF = ps.DataFrame(range(10))

#Display the sample data
psDF.head()

Read a CSV File using Pandas on Spark API

In order to read a CSV or a text file, we can use the read_csv method of the pandas library. However, when we will use this method from pyspark.pandas module, this will create a pandas dataframe on spark. Below is the sample code to read and display the pandas on the spark dataframe. Click here to download the sample CSV file.

#Import the pandas pyspark api
import pyspark.pandas as ps

#Create pandas-on-spark dataframe
sample_psDF = ps.read_csv("file:///Users/admin/Downloads/pandas-api-on-spark-sample-csv-file.csv")

#Display sample data
sample_psDF.head()

Output:

read_csv method to read sample data using pandas api on spark
read_csv method to read sample data using pandas API on spark

Convert Spark dataframe to a pandas-on-spark dataframe

The above line of code will import the pandas API into the spark environment. Now, to convert an existing spark dataframe into a pandas dataframe, we can use the below code.

from pyspark.sql.types import StringType

#Create sample spark dataframe
df = spark.createDataFrame(["John","Alex","Bob","Adam","Bruce"], StringType()).toDF("Emp")
df.show()

#Convert spark df to pandas on spark df
pdf = df.to_pandas_on_spark()
pdf.head()

Convert pandas-on-spark dataframe to a Spark dataframe

So, if we want to convert an existing pandas dataframe into a distributed spark dataframe, we can use the below sample code.

import pyspark.pandas as ps

#Create pandas on spark dataframe
pdf = ps.DataFrame(["John", "Alex", "Bob", "Adam", "Bruce"], columns=["Employee"])
pdf.head()

#Convert pandas on spark dataframe to spark dataframe
df = pdf.to_spark()
df.show()

Thanks for the reading. Please share your inputs in the comment section.

Rate This
[Total: 1 Average: 5]

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.