Get HDFS file location of Hive table records as column

In this post, we will learn how we can extract the physical HDFS file location path of the Hive table as a column along with other columns of the table. We will demonstrate this using HiveQL, PySpark, and Scala. We can create the Hive tables as internal or external tables. So, if we create an internal table, the table data is stored to the default hive warehouse path. However, if we already have our data in the HDFS location and don’t want to move that data physically to the default hive warehouse path, we can create an external hive table on top of that data. The main difference between internal and external hive tables is that, if we drop the internal table, both schema and data are deleted. However, in case of an external table, only the schema of the table (metadata) is deleted and not the data.

We know that Apache Spark is a general-purpose big data processing engine. It reads the data in memory and processes it in an optimized way. If we want, we can read and write a Hive table data from the Spark dataframe directly. To know that how we can connect and read the hive tables in Spark, visit this post. And, to know that how we can read and write a dataframe to an RDBMS database like SQL Server using JDBC driver, read this article.

Get HDFS file location of Hive table records as column using HiveQL

In order to get the physical file path of hive table data rows, we can use the built-in function named INPUT__FILE__NAME (input_file_name). We can use this function to get the physical file path as a column name along with other columns. This helps us to get the HDFS file location that physically stores the given data row. We know that HDFS is a distributed file system and the files stored on it may distributed across multiple machines. When we define a schema on top of an HDFS data file using a Hive table, we read all the data as one single object regardless of the data being distributed across clusters of machines.

Let’s create a hive table with 2 partitions and then we will extract the physical file name of the table data rows as a column using Hive QL. For example, below is the code to create the sample hive table with some dummy data with two partitions.

CREATE TABLE default.tbl_demo
(
RowId INT,
Name STRING
)
PARTITIONED BY (RowCreatedOn STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

--Add sample data in two partitions

--Insert data into partition '2021-11-25' 
INSERT INTO default.tbl_demo PARTITION (RowCreatedOn='2021-11-25') 
SELECT 1, 'Smith'
UNION ALL
SELECT 2, 'Adam'
UNION ALL 
SELECT 3, 'Ken'
UNION ALL 
SELECT 4, 'Nathan';

--Insert data into partition '2021-11-26' 
INSERT INTO default.tbl_demo PARTITION (RowCreatedOn='2021-11-26') 
SELECT 5, 'John'
UNION ALL
SELECT 6, 'Jason'
UNION ALL 
SELECT 7, 'Joe'
UNION ALL 
SELECT 8, 'James';

We can use the SHOW PARTITIONS tbl_demo command to verify the partitions of the hive table. Now, to get the input file path of the hive data along with other columns of the table, we can use the below query.

SELECT *,
INPUT__FILE__NAME AS FileLocation 
FROM default.tbl_demo;

Output:

Get HDFS file using HiveQL
Get HDFS file using HiveQL

Get physical HDFS file location of Hive table records using PySpark

Suppose, we want to extract the physical file location of each row of the above table tbl_demo using PySpark. To get that, we need to use the input_file_name() method as below.

df_hive_table = spark.sql("select *, input_file_name() AS filePath from default.tbl_demo")
df_hive_table.show(truncate = False)
Get HDFS file using pyspark
Get HDFS file using pyspark

Get physical HDFS file location of Hive table records using Scala

Now, if we want to get the physical file path in scala, we can use below command.

val df_hive_table = spark.sql("select *, input_file_name() AS filePath from default.tbl_demo")
df_hive_table.show(truncate = false)
Get HDFS file using scala
Get HDFS file using scala

Thanks for the reading. Please share your inputs in the comment section.

Rate This
[Total: 1 Average: 5]

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.