Suppose, we have a CSV file that contains some non-English characters (Spanish, Japanese, and etc.) and we want to read this file into a Spark data frame. If we read this file without using the right character encoding, we will end up with some junk characters (like �) in the data frame. So, the files that store non-English characters in them, need a special option to be enabled while reading the file using PySpark. Similarly, if we will try to read this file in a Hive table, we will get the junk symbols instead of the actual characters. To know how we can read this file in the hive correctly, visit this post. Now, in this post, we will discuss that how we can read a CSV file with its original file encoding in Spark.
Understanding the Sample CSV file (with Spanish character)
Let’s use a dummy CSV file that contains some Spanish characters and is in ANSI text encoding format. This CSV file has four data rows with a header row. Also, this text file has four columns that are below:
- UserName – This stores the name of the user,
- Gender – This column stores the gender of the user,
- Age – This stores the age of the user,
- About – This column contains a summary of the user and contains some Spanish characters.
The content of this text file looks like this:
The above CSV file is using comma (,) as column delimiter newline character (\n) as row delimiter. In addition, the file is using double quotes (“) as a text qualifier. Most importantly, you can clearly see the red squared highlighted Spanish characters in this text file.
Checking original file encoding using Notepad++
We can check the file encoding of a file by opening it in the Notepad++ application and checking the encoding value at the right bottom corner. Certainly, this helps us to know the actual encoding being used by the file system. For example, in our case, we will get the below character encoding:
In the above image, we can see that the original encoding of this file is ANSI text encoding. However, if we will not use this encoding while reading the file, we will get the garbage characters in the output.
Read CSV file without using character encoding option in PySpark
Let’s read the above CSV file with the default character encoding, without using the original file encoding. So, below is the code we are using in order to read this file in a spark data frame and then displaying the data frame on the console.
df = spark.read.\
option("delimiter", ",").\
option("header","true").\
csv("hdfs:///user/admin/CSV_with_special_characters.csv")
df.show(5, truncate=False)
Output:
In the above output, we can clearly see the junk characters instead of the original characters in the data frame. This is coming because we have not used the right character encoding while reading the file in the data frame. So, these junk characters are coming in the data frame.
Read CSV file using character encoding option in PySpark
To read the above file correctly in PySpark, we need to add the file encoding option in the Spark read method. That is to say, we need to add an extra option in the previous read method. So, our final version of the code looks like this:
df_with_encoding = spark.read.\
option("delimiter",",").\
option("header","true").\
option("encoding", "windows-1252").\
csv("hdfs:///user/admin/CSV_with_special_characters.csv")
df_with_encoding.show(5, truncate=False)
Output:
We can see that the Spanish characters are being displayed correctly now. In other words, the Spanish characters are not being replaced with the junk characters. In conclusion, we are able to read this file correctly into a Spark data frame by adding option(“encoding”, “windows-1252”) in the PySpark code.
Thanks for the reading. Please share your inputs in the comment section.
your post “Spark read file with special characters using PySpark” was very nice and helped me a lot. I tried numerous other “solutions” but they didn’t work. This solution was simple and worked perfectly for my code.
I also really liked the other articles.
Congratulations