In this post, we are going to learn how we can leverage the power of Python’s pandas module in SQL Server 2017. pandas is an open source Python library providing data frame as data structure similar to the SQL table with the vectorized operation support for high performance. To know more about pandas, you can click here.
Let’s discuss the problem we face while using the SQL UNPIVOT clause especially when we have a large number of columns. We can use UNPIVOT clause in SQL Server to convert the columns as row values and normalize the output result set. To use the UNPIVOT command, we need to specify each column name as a fixed value while writing the T-SQL query. However, this becomes annoying if we need to specify a large number of columns in the UNPIVOT clause. Also, if the column names are not fixed (dynamic in nature), we cannot specify the column names as fixed values while designing the query.
To overcome this issue, we can use the power of dynamic SQL to create a T-SQL query dynamically with all the columns and then we can use EXEC or sp_EXECUTESQL statements to execute it. We already have a blog on “Dynamic PIVOT query in SQL Server“. Creating a dynamic T-SQL query is not that simple and needs extensive testing as the query will be created and executed dynamically where some unseen issues might occur. Also, the maintenance of the dynamic SQL is tedious as compared to the static SQL.
With the integration of Python in SQL Server 2017, Microsoft has opened up possibilities of using Python’s libraries which are very powerful, efficient, and can accomplish the most complex tasks in just a few line of codes. In this blog, we will be using melt method from the pandas library to dynamically unpivot the table data.
We can use the below code to create a dummy table with sample data for demonstration purpose.
IF OBJECT_ID('dbo.tbl_unpivot_demo') IS NOT NULL DROP TABLE dbo.tbl_unpivot_demo GO --Create dummy table CREATE TABLE dbo.tbl_unpivot_demo ( EmpName NVARCHAR(512), [Bib Shorts] float, [Bike Racks] float, [Bike Stands] float, [Bottles and Cages] float, [Bottom Brackets] float, [Brakes] float, [Caps] float, [Chains] float, [Cleaners] float, [Cranksets] float ) GO --Insert sample data in dummy table INSERT INTO dbo.tbl_unpivot_demo (EmpName, [Bib Shorts], [Bike Racks], [Bike Stands], [Bottles and Cages], [Bottom Brackets], Brakes, Caps, Chains, Cleaners, Cranksets) VALUES ('John', 98583.63, 50060.56, 27877.11, 44454.45, 48582.3 , 55460.04, 92488.6 , 61331.34, 62704.1 , 212.9 ), ('Adam', 80061.15, 60420.61, 3208.4 , 69232.6 , 91672.85, 80702.13, 18307.98, 21814.2 , 91096.07, 33044.65), ('Michal', 40932.12, 78804.44, 20560.96, 28836.26, 16754.98, 601.77, 27582.42, 70861.79, 80285.37, 40892.41), ('Smith', 65244.43, 76834.61, 45700.16, 56240.19, 97608.88, 44787.54, 8769.13, 47577.39, 27627.93, 12692.64), ('David', 86557.44, 31916.65, 57052.24, 15513.05, 49318.12, 95908.1, 13421.06, 22238.98, 83941.94, 74022.22) GO
Let’s have a look at the table data.
Above, we have a denormalized table which contains the details of the employees with their total sale amount for each product type. To keep it simple, we are using only a subset of columns from the actual table. Now, we need to normalize the above table by transposing the name of the product types into rows with their corresponding values for each employee. The output that we need is this.
To achieve the above output in SQL Server 2017, we can use the power of pandas melt method. The Python script we are using to transpose the product types (currently placed on the columns) into rows is this.
EXEC sp_execute_external_script @language = N'Python', @script = N' #Line 1 df = SalesData #Line 2 import pandas as pd #Line 3 OutputDataSet = pd.melt(df, id_vars = "EmpName", var_name = "Product Type", value_name = "SaleAmount") ' ,@input_data_1 = N'SELECT * FROM dbo.tbl_unpivot_demo' ,@input_data_1_name = N'SalesData' WITH RESULT SETS(("EmployeeName" varchar(512), "Product Type" nvarchar(256), "SalesAmount" numeric(18, 2)))
How it is working
We already have a blog which explains the parameters of the system stored procedure “sp_execute_external_script“. Let’s discuss the python code used above in detail.
Python Script – #Line 1
In the first line “df = SalesData“, we have assigned the values of table data to a data frame named df.
Python Script – #Line 2
In the second line, “import pandas as pd“, we are importing the pandas module under the alias pd.
Python Script – #Line 3
The third and the final line of code is “OutputDataSet = pd.melt(df, id_vars = “EmpName”, var_name = “Product Type”, value_name = “SaleAmount”)“.
Here, we are have used the pd.melt method to dynamically transpose the columns as row values. Finally, we have assigned the output of the melt method to a data frame named OutputDataSet.
Quick performance summary
We have compared the performance of the pandas melt method with the static SQL UNPIVOT query and found that the Python script is running faster than the SQL UNPIVOT query in case we have a large number of input rows. This is the performance summary.
Conclusion
The Python script has reduced the line of codes drastically and has also increased the performance if the input number of rows is large. It is easy to understand and maintain.
Thanks for reading. Please share your inputs in the comments.
Pingback: Tidy Data in Python - First Step in Data Science and Machine Learning - SQLRelease