Fill null with the previous not null value – Spark Dataframe

In the previous post, we discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill the null values with the previous not-null value in a spark dataframe using the backfill method. To demonstrate this with the help of an example, we will create a sample table with some dummy data. To start this demo, let’s create the sample data and table here.

//Create sample table named empCTCDetail
spark.sql("""CREATE TABLE testdb.empCTCDetail
(
  empName STRING,
  dtYear INTEGER,
  CTC_USD DECIMAL(9,2)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'empCTCDetail/' """)

//Insert some sample data into the table
spark.sql("""INSERT INTO TABLE testdb.empCTCDetail
(empName, dtYear, CTC_USD)
SELECT 'Adam' AS empName, 2015 AS dtYear, 50000 AS CTC_USD
UNION ALL SELECT 'Adam' AS empName, 2016 AS dtYear, NULL AS CTC_USD
UNION ALL SELECT 'Adam' AS empName, 2017 AS dtYear, NULL AS CTC_USD
UNION ALL SELECT 'Adam' AS empName, 2018 AS dtYear, 60000 AS CTC_USD
UNION ALL SELECT 'David' AS empName, 2015 AS dtYear, 80000 AS CTC_USD
UNION ALL SELECT 'David' AS empName, 2016 AS dtYear, 90000 AS CTC_USD
UNION ALL SELECT 'David' AS empName, 2017 AS dtYear, NULL AS CTC_USD
UNION ALL SELECT 'David' AS empName, 2018 AS dtYear, NULL AS CTC_USD""")

//Verify the sample table and data
spark.sql("SELECT * FROM testdb.empCTCDetail").show(false)

Sample data output

Sample table and data
Sample table and data

Problem statement

The above table contains the employee name, year, and CTC value in USD. Now, we need to back fill the null values with the previous year CTC value. That is, if the CTC value is NULL for any given year, we need to consider the previous year’s CTC value instead of the NULL value and populate the result.

Query to fill null with a previous non-null value

Below is the query that we can use to backfill the null values with the previous value for that group.

import org.apache.spark.sql.functions.{last, desc, col, when, expr}
import org.apache.spark.sql.expressions.Window
val df = spark.sql("SELECT * FROM testdb.empCTCDetail")
val window = Window.partitionBy("empName").orderBy(("dtYear")).rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("runningCTC_USD", when(col("CTC_USD").isNull, last("CTC_USD", ignoreNulls = true).over(window)).otherwise(col("CTC_USD"))).orderBy("empName", "dtYear").show(false)

Output

Output
Output

In the above query, we have partitioned the data based on the employee name column because we have to consider the previous year non-null CTC for the same employee. Next, we have ordered the rows based on the dtYear column in descending order. Finally, we have used “Window.unboundedPreceding,Window.currentRow” to extract the previous year CTC. However, we have used “last(“CTC_USD”, ignoreNulls = true)” to consider only one value that is not a null value.

Thanks for the reading. Please share your inputs in the comment section.

Rate This
[Total: 1 Average: 5]

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.