In case we are reading a text file in a Hive table which contains non-English characters and we are not using the appropriate text encoding, these non-English characters might be loaded as junk symbols (like boxes – �). To get these characters in their original form, we need to use the correct character encoding. In this post “Handling special characters in Hive (using encoding properties)“, we are going to learn that how we can read special characters in Hive using encoding properties available with TBLPROPERTIES clause.
To demonstrate it, we will be using a dummy text file which is in ANSI text encoding format and contains Spanish characters. Also, we will be using Microsoft Azure cloud platform to instantiate an on-demand HDInsight cluster that makes it easy to write Hive queries. We will upload the dummy text file to an Azure Data Lake Storage and then we will read it using HiveQL. Let’s have a look at the content of the text file.
The above text file contains four columns which are as below:
- UserName – Name of the user,
- Gender – Gender of the user,
- Age – Age of the user,
- About – A brief summary of the user which contains Spanish characters (Contains special characters)
In the above text file, we can see that the data in “About” column contains few non-English characters. This file (ANSI encoding) can be downloaded for practice purpose from here – Click here to download the sample file.
Let’s have a look at the below Hive query which creates a database named testDB followed by a table named tbl_user_raw inside the testDB database. Also, notice that we are not using any encoding setting in the CREATE TABLE statement while creating the table in the below script.
CREATE DATABASE IF NOT EXISTS testdb; USE testDB; DROP TABLE IF EXISTS testDB.tbl_user_raw; CREATE EXTERNAL TABLE IF NOT EXISTS testdb.tbl_user_raw ( username string, gender string, age string, about string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 'adl://<Data-Lake-Store>.azuredatalakestore.net/<Folder-Name>/';
When we display the data from the tbl_user_raw (Hive table without appropriate encoding setting), it looks as below:
SELECT * FROM tbl_user_raw;
In the above image, we can see that the non-English characters have been converted into junk characters in the tbl_user_raw table.
Handling special characters in Hive
To read this file with these special characters in their original form, first, we need to find the original text encoding of the text file. To do this, we can simply open this file in Notepad++ editor and it will display the actual file encoding at the bottom-right corner as below:
Next, we can write a query with TBLPROPERTIES clause by defining the serialization.encoding setting in order to interpret these special characters in their original form in Hive table. Below, we are creating a new Hive table tbl_user to read the above text file with all the special characters:
DROP TABLE IF EXISTS testDB.tbl_user; CREATE EXTERNAL TABLE IF NOT EXISTS testdb.tbl_user ( username string, gender string, age string, about string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 'adl://<Data-Lake-Store>.azuredatalakestore.net/<Folder-Name>/' TBLPROPERTIES('serialization.encoding'='windows-1252');
Output:
In the above query, we have just added the TBLPROPERTIES clause to define the original file encoding while reading the file in the Hive table. TBLPROPERTIES provides “serialization.encoding” setting which can be used to set the required character set while reading the data into a Hive table.
We have added “TBLPROPERTIES(‘serialization.encoding’=’windows-1252’)” line to define the actual text encoding which is ANSI encoding. We know that ANSI encoding is a generic term used to refer to the standard code page on a Windows system. Also, it is referred to as Windows-1252 code page. To read more about, Windows-1252 code page, click here.
Thanks for the reading. Please share your input.
Pingback: Spark read file with special characters using PySpark - SQLRelease
Amazing information. Thank you very much for posting this has made our customers exceptionally happy.
Nice, Have to test myself but simple and good tip.
Had similar issue the Cent special char as Question mark.
20.5 ¢each shows as 20.5 �each
>>> TBLPROPERTIES(‘serialization.encoding’=’windows-1252’);
Thanks for your information. It helps me to solve same issue.