We have discussed that how we can use Jupyter Lab/Jupyter Notebook to do Interactive Data Analysis with SQL Server using Jupyter Notebooks. Jupyter Notebook is a very powerful and useful tool for any Data Analyst/Data Scientist. The Jupyter Lab is the next generation tool for the Jupyter Notebooks. It provides an interface where we can open multiple Jupyter Notebooks in different tabs side by side. Also, it provides many useful plugins that can be used to get a better user experience while working with a data analysis task. In this post “Interactive Data Analysis with HANA using Jupyter Notebook/Jupyter Lab”, we will discuss how we can set up an environment that can be used to do interactive data analysis with HANA using Jupyter Notebooks.
Installing required packages
In order to connect the Jupyter notebooks to HANA, we need to install these packages:
- jupyterlab – This module installs the Jupyter lab (which includes Jupyter Notebooks also) environment on your local machine. We can use the Jupyter lab command to start it only if it is installed on the machine. Below is the pip command to install the Jupyter lab on your machine:
pip install jupyterlab
- hdbcli – This package enables libraries that can be used to access/change the information in an SAP HANA database from Python. To install it, use the below pip command:
pip install hdbcli
- ipython-sql – This package is required to enable SQL magic commands in Jupyter notebooks. Once this package is installed, we can use %sql and %%sql commands to execute SQL queries from the notebook. In order to install it, use the below pip command:
pip install ipython-sql
- sqlalchemy-hana – This module is used by sqlalchemy to communicate with the HANA database. We need this module to be installed because the IPython magic commands internally use sqlalchemy libraries to execute the given queries. This the pip command to install it:
pip install sqlalchemy-hana
Once we have installed all the required libraries, we can start using the Jupyter lab for interactive data analysis with HANA databases system.
Querying HANA from Jupyter Notebook
First, we need to start the Jupyter lab from the command prompt. If you have installed the Anaconda on your machine, you can start the Jupyter lab from Anaconda Navigator also. In order to start the Jupyter lab from the command prompt, follow below steps:
- Open command prompt window.
- Type “jupyter lab” and hit Enter key.
- It will open jupyter lab in the default web browser like this:
To know how we can change the Jupyter notebook startup folder, visit this post. Now, go to the web browser and add a new notebook. Add the below commands to your notebook:
#We need to use this command in order to load the IPython's sql module which enables the macgic command in this Jupyter notebook %load_ext #Let's define a connection string for HANA. The connection string URL looks like this. %sql hana+hdbcli://testuser:p@ssword_12345@hanaserver:instancenumber #Sample HANA query using %sql. %sql SELECT id AS EmpID, fullname as EmpName from "TESTDB"."TESTTABLE"; #Sample multiline HANA query using %%sql %%sql SELECT id AS EmpID, fullname as EmpName FROM "TESTDB"."TESTTABLE";
This is a screenshot from the Jupyter Lab to show how we can use these commands in the notebook and execute them to get the output.
We have used %load_ext to load the IPython’s magic command in our notebook. Then, we have created a connection string using “%sql hana+hdbcli://testuser:p@ssword_12345@hanaserver:00“. Finally, we have used a single line and multiline HANA queries to fetch some data from the TESTTABLE.
Querying multiple data sources (HANA and SQL Server) from a single Jupyter Notebook
We can create multiple connections for various RDBMS databases or for the various database environments. Once these connections are created, we can use the name of the connection along with the %sql or %%sql magic commands in order to execute queries against these connections. However, we need to install the required packages to access different RDBMS systems. For example, to access SQL Server from the Jupyter notebook, we need to install the pyodbc module.
- pyodbc – This package is required to access an ODBC database. To install it, we can use the below pip command:
pip install pyodbc
This example demonstrates how we can read a table from a SQL Server database and from a HANA database in a single Jupyter notebook. The ability to connect with multiple data sources in the same notebook makes the Jupyter notebooks a powerful tool for data analysis. We can bring the data from various data sources and can store them into pandas data frames. Then, we can easily perform join, filter, sort, and other data analysis methods on these data frames.
#We need to use this command in order to load the IPython's sql module which enables the macgic command in this Jupyter notebook %load_ext sql #Let's define connection strings for HANA and SQL Server both and store it into variables. The connection strings look like this. hanaservercon = 'hana+hdbcli://testuser:p@ssword_12345@hanaserver:00' sqlservercon = 'mssql+pyodbc://testuser@localhost/testDB?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server&MARS_Connection=Yes' #Sample HANA query using hana connection string variable with %%sql command. %%sql $hanaservercon SELECT id AS EmpID, fullname as EmpName FROM "TESTDB"."TESTTABLE"; #Sample SQL Server query using SQL Server connection string variable with %%sql command. %%sql $sqlservercon SELECT id AS EmpID, fullname as EmpName FROM "TESTDB"."TESTTABLE";
This is a screenshot from the Jupyter notebook with the above queries along with the output.
Similarly, we can define any number of connections and use these connection string variables with %sql or %%sql commands in order to execute the SQL queries against these servers. We can do a data analysis to compare the data from different environments or different servers also. This Jupyter notebook can be exported to PDF or any other format in order to be shared with other users.
Thanks for the reading. Please share your inputs in the comment section.