Read and write data into Hive table from Spark using PySpark

In this post, we will learn how we can read and write the data to a Hive table from a Spark dataframe. Once we have the Hive table data being read into a dataframe, we can apply Spark transformations on that data. Finally, we can write back the data to the the Hive table. We can achieve this using multiple approaches. However, here we will discuss a few common methods to achieve this task. To read how we can read and write data from an RDBMS table like SQL Server, read this post.

Read data from Hive table into Spark DataFrame

As we said, we can read and write data from the Hive table in multiple ways. However, we will discuss the most common approaches in this demo only. We are using on-prem hadoop cluster set up to demonstrate this. However, the same method applies to almost any environment whether you are using an on-prem or cloud version of Hadoop. You can also try to use this link to use Hue and Hadoop for free for learning purposes.

Read Hive table data using HiveContext class

Spark provides HiveContext class to access the hive tables directly in Spark. First, we need to import this class using the import statement like “from pyspark.sql import HiveContext“. Then, we can use this class to create a context for the hive and read the hive tables into Spark dataframe. Below is the complete code which is self-explanatory and I have added some comments to explain it more precisely. For this demo, suppose, we want to read an existing hive table into a spark data frame named as “hive_read_write_demo” which is created in default database.

from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext

#Main module to execute spark code
if __name__ == '__main__':
    conf = SparkConf() #Declare spark conf variable\
    conf.setAppName("Read-and-write-data-to-Hive-table-spark")
    sc = SparkContext.getOrCreate(conf=conf)

    #Instantiate hive context class to get access to the hive sql method  
    hc = HiveContext(sc)

    #Read hive table in spark using .sql method of hivecontext class
    df = hc.sql("select * from default.hive_read_write_demo")

    #Display the spark dataframe values using show method
    df.show(10, truncate = False)

Assuming that we have saved this file on haddop cluster as “read-write-demo-hivecontext.py”, we can use below spark-submit command to execute it.

spark-submit hdfs://mynameservice/user/admin/read-write-demo-hivecontext.py

Below is the output:

Read data in spark data frame from hive table

Read Hive table data by enabling enableHiveSupport option with Spark context

Alternatively, we can set the enableHiveSupport option while creating the Spark session in our PySpark executable.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#Main module to execute spark code
if __name__ == '__main__':
    conf = SparkConf() #Declare spark conf variable\
    sc = SparkContext.getOrCreate(conf=conf)

    #Instantiate spark builder and Set spark app name. Also, enable hive support using enableHiveSupport option of spark builder.
    spark = SparkSession(sc) \
        .builder \
        .appName("Read-and-write-data-to-Hive-table-spark") \
        .enableHiveSupport() \
        .getOrCreate()

    #Read hive table in spark using sql method of spark session class
    df = spark.sql("select * from hive_read_write_demo")

    #Display the spark dataframe values using show method
    df.show(10, truncate = False)


Assuming that we have uploaded this file as “read-write-demo.py” name, we can use the below spark-submit command to execute this file.

spark-submit hdfs://mynameservice/user/admin/read-write-demo.py

Output:

Read data in spark data frame from hive table
Read data in spark data frame from hive table

Save Spark dataframe as hive table

In order to write the data back to a Hive table, we can use the below code. We are reading the above table “hive_read_write_demo” and saving that table with new name “hive_read_write_demo_new“.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#Main module to execute spark code
if __name__ == '__main__':
    conf = SparkConf() #Declare spark conf variable\
    sc = SparkContext.getOrCreate(conf=conf)

    #Instantiate spark builder and Set spark app name. Also, enable hive support using enableHiveSupport option of spark builder.
    spark = SparkSession(sc) \
        .builder \
        .appName("Read-and-write-data-to-Hive-table-spark") \
        .enableHiveSupport() \
        .getOrCreate()

    #Read hive table in spark using sql method of spark session class
    df = spark.sql("select * from default.hive_read_write_demo")

    #Save the spark dataframe as a new hive table
    df.write.saveAsTable("hive_read_write_demo_new")

To execute this, we can use below spark-submit command.

spark-submit hdfs://mynameservice/user/admin/read-write-demo-saveastable.py

To verify, we can query the hive table as this:

SELECT *
FROM 
default.hive_read_write_demo_new;
Write back spark dataframe in a Hive table

Thanks for the reading. Please share your inputs in the comment section.

Rate This
[Total: 2 Average: 5]

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.