data analysis

Fill null with the next not null value – Spark Dataframe

In this post, we discussed how to fill a null value with the previous not-null value in a Spark Dataframe. We have also discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill a null value with the next available not-null value […]

Fill null with the next not null value – Spark Dataframe Read More »

Fill null with the previous not null value – Spark Dataframe

In the previous post, we discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill the null values with the previous not-null value in a spark dataframe using the backfill method. To demonstrate this with the help of an example, we will

Fill null with the previous not null value – Spark Dataframe Read More »

Create jar in IntelliJ IDEA for sbt-based Scala + Spark project

Just like the Maven build tool, sbt is another tool that can be used to manage the project development lifecycle. It helps us to build, test, and package the Scala and Java-based projects into a .jar file. This jar file can be used as a package in another application/project, or it can be simply used

Create jar in IntelliJ IDEA for sbt-based Scala + Spark project Read More »

Create jar in IntelliJ IDEA for Maven-based Scala + Spark project

In this post, we will learn how we can create a jar in IntelliJ IDEA for a Maven-based Scala + Spark project. We will use the maven build tool to create the jar file from the sample Scala project. We know that the Maven is a project management tool that can be used to manage

Create jar in IntelliJ IDEA for Maven-based Scala + Spark project Read More »

Create scala sbt project using IntelliJ IDEA – Step by step

In the previous post, we discussed how to set up a maven-based Scala project. Now, in this post, we will learn how we can create an sbt-based Scala project using IntelliJ IDEA IDE. The sbt is an open-source build tool for Scala and Java projects like Maven and Ant. If you need to install IntelliJ

Create scala sbt project using IntelliJ IDEA – Step by step Read More »

Create scala maven project using IntelliJ IDEA – Step by step

In this post, we will learn how to create a Maven-based Scala project using IntelliJ IDEA from scratch. Spark is an open-source unified general-purpose Big Data Processing Framework that is written in Scala programming language. Apache Spark is a multi-language data processing engine that supports SQL, Java, Python, R, and Scala languages. However, most of

Create scala maven project using IntelliJ IDEA – Step by step Read More »

Introduction to k-fold Cross-Validation in Python

This post briefs how we can use the k-fold cross-validation to evaluate a Machine Learning model performance using the Scikit-learn library in Python. We know that the performance of a Machine Learning model depends on the training dataset. Also, if the training dataset has a peculiarity, the model created with that dataset will not work

Introduction to k-fold Cross-Validation in Python Read More »

Create pair plots using scatter_matrix method in pandas

The exploratory data analysis is a very important step in a Data Science project. It helps us to visualize the data and identify any hidden trends that might not be visible with summary statistics alone. So, we can use matplotlib and seaborn libraries to create stunning visuals in Python. However, the pandas.plotting module of the

Create pair plots using scatter_matrix method in pandas Read More »

Plot ECDF in Python

We know that EDA (Exploratory Data Analysis), is the process of organizing, plotting, and summarizing the data to find trends, patterns, and outliers using statistical and visual methods. Here, we have already discussed various methods of performing EDA with their pros and cons on an underlying dataset. ECDF plot is another visual method of performing

Plot ECDF in Python Read More »

Interactive Data Analysis with HANA using Jupyter Notebook/Jupyter Lab

We have discussed that how we can use Jupyter Lab/Jupyter Notebook to do Interactive Data Analysis with SQL Server using Jupyter Notebooks. Jupyter Notebook is a very powerful and useful tool for any Data Analyst/Data Scientist. The Jupyter Lab is the next generation tool for the Jupyter Notebooks. It provides an interface where we can

Interactive Data Analysis with HANA using Jupyter Notebook/Jupyter Lab Read More »