Resampling time series data in SQL Server using Python’s pandas library
In this post, we are going to learn how we can use the power of Python in SQL Server 2017 to resample time series data using Python’s pandas library. Sometimes, we get the sample data (observations) at a different frequency (higher or lower) than the required frequency level. In such kind of scenarios, we need to modify the frequency of the given samples as per the frequency of the required outcome. Modifying the frequency of time series data using T-SQL query becomes a tedious task especially when we need to perform upsampling as we need to generate more rows than what we have in the sample dataset. The Python’s pandas module has in-built capabilities for frequency conversion. With the help of pandas resample method, we can increase or decrease the time series observation frequencies with only few lines of code.
Time Series Data Resampling
Time series data resampling is the process of increasing or decreasing the frequency of the time series data using interpolation schemes or by applying statistical methods. When we apply resampling method on time series data, it generates the fine grained (up or down) sampling distribution on the basis of actual sample. For example, if we have an observation at month level frequency and want to generate samples for each day, we can use resampling techniques. Time series data can be resampled in two ways:
- Upsampling and
- Downsampling
To understand the upsampling and downsampling, let’s have a look at this image.
Let’s discuss each of the time series resampling methods in more detail.
Upsampling
Upsampling is a process where we generate observations at more granular level than the current observation frequency. In upsampling, we increase the date-time frequency of the given sample. For example, changing the frequency from:
- Yearly to quarterly
- Quarterly to monthly
- Monthly to weekly
- Weekly to daily, and so on.
We can upsample the data from any upper level frequency to a more fine grained frequency level. As we move to a more granular level frequency, we get some empty observations also. To fill these empty observations with meaningful values, we can use interpolation techniques like linear interpolation, forward fill, backward fill, and etc.
Upsampling Example
Suppose, we have a table named tbl_sale having saledate and amount columns which stores:
saledate – Date of sale and,
amount – Amount of sale for a given date
To create the sample table with dummy data, use this script.
IF OBJECT_ID('dbo.tbl_sale') IS NOT NULL DROP TABLE dbo.tbl_sale GO CREATE TABLE dbo.tbl_sale ( saledate DATETIME NOT NULL, amount NUMERIC(18, 2) NOT NULL ) GO INSERT INTO dbo.tbl_sale (saledate, amount) VALUES('20180107', 553.25), ('20180114', 412.12), ('20180121', 550.22), ('20180128', 222.45), ('20180204', 521.10) GO
Weekly to daily frequency conversion
Let’s assume that we need to create daily samples from the given weekly samples. We can use pandas resample method to change the frequency from weekly to daily. Also, we can apply an interpolation scheme to fill the empty observations generated as a result of moving the frequency to a more granular level.
We can use the below code to generate the daily samples from the weekly data.
EXEC sp_execute_external_script @language = N'Python' ,@script = N' #Line 1 & 2 - Import pandas and datetime modules import pandas as pd from datetime import datetime #Line 3 - Convert datatype of saledate column from string to datetime df["saledate"] = pd.to_datetime(df["saledate"]) #Line 4 - Set saledate column as index column of the dataframe and remove the duplicate column df = df.set_index("saledate", drop = True) #Line 5 - Resample the data at daily basis frequency and fill the empty observations using lineat interpolation df = df.resample("D").interpolate("linear") #Line 6 - Now, reset the index to range index (For SQL Server compatibility) df = df.reset_index() #Line 7 - Change the datatype of saledate column from datetime to string (For SQL Server compatibility) df["saledate"] = df["saledate"].astype(str) #Line 8 - Assign the dataframe df to the output dataframe OutputDataSet (Specific to SQL Server) OutputDataSet = df ' ,@input_data_1 = N'SELECT CAST(saledate AS VARCHAR(50)) AS saledate, CAST(amount as FLOAT) AS amount FROM dbo.tbl_sale' ,@input_data_1_name = N'df' WITH RESULT SETS(("Sale Date" DATETIME, "Amount" DECIMAL(18,2))) GO
Output
Here, we can see that we have samples for each date starting from the first date of the sample in the dataset (07-Jan-2018) to the last date in the dataset (04-Feb-2018). As we have used linear interpolation, the newly generated observations have been filled with the equally distributed values lying between the previous and the next week amounts.
Now, let’s discuss each line of code in detail.
#Line 1 and 2
In line 1 and line 2, we are importing the pandas and datetime modules being used in the python script.
#Line 3
As of now, SQL Server does not support datetime data types in input datasets. That is why we have converted the saledate column as string values while passing it to the external script. However, we need to convert the saledate values as datetime data type in order to apply resample method on this column.
#Line 4
In this line, we move the saledate column to index column of the dataframe and remove the duplicate column from the regular columns of the dataframe.
#Line 5
In this line, we have used the dataframe’s resample method to generate daily samples from the weekly samples. Also, we have used linear interpolation technique to fill the newly generated empty rows.
#Line 6
As of now, SQL Server does not support dataframe’s index column and that is why we need to move the saledate column from the index column to a regular dataframe column.
#Line 7
As of now, SQL Server does not support datetime data type in output dataset. That is why we need to convert the saledate column into string values before returning it back to the SQL Server.
#Line 8
Finally, we are assigning the dataframe df to an output dataframe named OutputDataSet.
Downsampling – Time series data
Downsampling is a process where we generate observations at more aggregate level than the current observation frequency. In downsampling, we decrease the date-time frequency of the given sample. For example, changing the frequency from:
- Daily to weekly
- Weekly to monthly
- Monthly to quarterly
- Quarterly to yearly, and so on.
We can use aggregation methods to aggregate the data at a lower level. We can apply SUM, MIN, MAX, Count, MEAN, MEDIAN, MODE, and other statistical techniques to aggregate the observations.
Downsampling Examples
Assume that we get daily samples and we need to generate weekly samples from it. The output data which we have generated above in this blog using upsampling method, have been pushed into a table named tbl_sale_daily. Now, let’s change the frequency from daily to weekly using resample method.
Daily to weekly frequency conversion
To generate the weekly samples from the daily samples, we can use below code.
EXEC sp_execute_external_script @language = N'Python' ,@script = N' #Line 1 & 2 - Import pandas and datetime modules import pandas as pd from datetime import datetime #Line 3 - Convert datatype of saledate column from string to datetime df["saledate"] = pd.to_datetime(df["saledate"]) #Line 4 - Set saledate column as index column of the dataframe and remove the duplicate column df = df.set_index("saledate", drop = True) #Line 5 - Resample the data at daily basis frequency and fill the empty observations using lineat interpolation df = df.resample("W").mean() #Line 6 - Now, reset the index to range index (For SQL Server compatibility) df = df.reset_index() #Line 7 - Change the datatype of saledate column from datetime to string (For SQL Server compatibility) df["saledate"] = df["saledate"].astype(str) #Line 8 - Assign the dataframe df to the output dataframe OutputDataSet (Specific to SQL Server) OutputDataSet = df ' ,@input_data_1 = N'SELECT CAST(saledate AS VARCHAR(50)) AS saledate, CAST(amount as FLOAT) AS amount FROM dbo.tbl_sale_daily' ,@input_data_1_name = N'df' WITH RESULT SETS(("Sale Date" DATETIME, "Amount" DECIMAL(18,2))) GO
Output
In above image, we can see that, we get the original dataset back when we changed the frequency of the dataset from daily to weekly using downsampling method. We have used similar Python code as we have used in upsampling while performing the downsampling. The main difference is that in upsampling we have used interpolation technique to fill the newly created gaps in the datetime frequency. However, we have used mean() method to aggregate the data on weekly basis.
Thanks for the reading. Please share your inputs in comments.