In this post, we will learn how we can read and write the data to a Hive table from a Spark dataframe. Once we have the Hive table data being read into a dataframe, we can apply Spark transformations on that data. Finally, we can write back the data to the the Hive table. We can achieve this using multiple approaches. However, here we will discuss a few common methods to achieve this task. To read how we can read and write data from an RDBMS table like SQL Server, read this post.
Read data from Hive table into Spark DataFrame
As we said, we can read and write data from the Hive table in multiple ways. However, we will discuss the most common approaches in this demo only. We are using on-prem hadoop cluster set up to demonstrate this. However, the same method applies to almost any environment whether you are using an on-prem or cloud version of Hadoop. You can also try to use this link to use Hue and Hadoop for free for learning purposes.
Read Hive table data using HiveContext class
Spark provides HiveContext class to access the hive tables directly in Spark. First, we need to import this class using the import statement like “from pyspark.sql import HiveContext“. Then, we can use this class to create a context for the hive and read the hive tables into Spark dataframe. Below is the complete code which is self-explanatory and I have added some comments to explain it more precisely. For this demo, suppose, we want to read an existing hive table into a spark data frame named as “hive_read_write_demo” which is created in default database.
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
#Main module to execute spark code
if __name__ == '__main__':
conf = SparkConf() #Declare spark conf variable\
conf.setAppName("Read-and-write-data-to-Hive-table-spark")
sc = SparkContext.getOrCreate(conf=conf)
#Instantiate hive context class to get access to the hive sql method
hc = HiveContext(sc)
#Read hive table in spark using .sql method of hivecontext class
df = hc.sql("select * from default.hive_read_write_demo")
#Display the spark dataframe values using show method
df.show(10, truncate = False)
Assuming that we have saved this file on haddop cluster as “read-write-demo-hivecontext.py”, we can use below spark-submit command to execute it.
spark-submit hdfs://mynameservice/user/admin/read-write-demo-hivecontext.py
Below is the output:
Read Hive table data by enabling enableHiveSupport option with Spark context
Alternatively, we can set the enableHiveSupport option while creating the Spark session in our PySpark executable.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
#Main module to execute spark code
if __name__ == '__main__':
conf = SparkConf() #Declare spark conf variable\
sc = SparkContext.getOrCreate(conf=conf)
#Instantiate spark builder and Set spark app name. Also, enable hive support using enableHiveSupport option of spark builder.
spark = SparkSession(sc) \
.builder \
.appName("Read-and-write-data-to-Hive-table-spark") \
.enableHiveSupport() \
.getOrCreate()
#Read hive table in spark using sql method of spark session class
df = spark.sql("select * from hive_read_write_demo")
#Display the spark dataframe values using show method
df.show(10, truncate = False)
Assuming that we have uploaded this file as “read-write-demo.py” name, we can use the below spark-submit command to execute this file.
spark-submit hdfs://mynameservice/user/admin/read-write-demo.py
Output:
Save Spark dataframe as hive table
In order to write the data back to a Hive table, we can use the below code. We are reading the above table “hive_read_write_demo” and saving that table with new name “hive_read_write_demo_new“.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
#Main module to execute spark code
if __name__ == '__main__':
conf = SparkConf() #Declare spark conf variable\
sc = SparkContext.getOrCreate(conf=conf)
#Instantiate spark builder and Set spark app name. Also, enable hive support using enableHiveSupport option of spark builder.
spark = SparkSession(sc) \
.builder \
.appName("Read-and-write-data-to-Hive-table-spark") \
.enableHiveSupport() \
.getOrCreate()
#Read hive table in spark using sql method of spark session class
df = spark.sql("select * from default.hive_read_write_demo")
#Save the spark dataframe as a new hive table
df.write.saveAsTable("hive_read_write_demo_new")
To execute this, we can use below spark-submit command.
spark-submit hdfs://mynameservice/user/admin/read-write-demo-saveastable.py
To verify, we can query the hive table as this:
SELECT *
FROM
default.hive_read_write_demo_new;
Thanks for the reading. Please share your inputs in the comment section.