2021

Create scala maven project using IntelliJ IDEA – Step by step

In this post, we will learn how to create a Maven-based Scala project using IntelliJ IDEA from scratch. Spark is an open-source unified general-purpose Big Data Processing Framework that is written in Scala programming language. Apache Spark is a multi-language data processing engine that supports SQL, Java, Python, R, and Scala languages. However, most of […]

Create scala maven project using IntelliJ IDEA – Step by step Read More »

Get HDFS file location of Hive table records as column

In this post, we will learn how we can extract the physical HDFS file location path of the Hive table as a column along with other columns of the table. We will demonstrate this using HiveQL, PySpark, and Scala. We can create the Hive tables as internal or external tables. So, if we create an

Get HDFS file location of Hive table records as column Read More »

Read and write data into Hive table from Spark using PySpark

In this post, we will learn how we can read and write the data to a Hive table from a Spark dataframe. Once we have the Hive table data being read into a dataframe, we can apply Spark transformations on that data. Finally, we can write back the data to the the Hive table. We

Read and write data into Hive table from Spark using PySpark Read More »

Hyperparameter tuning using GridSearchCV and RandomizedSearchCV in Python

In the previous post, we had a brief discussion about the GridSearchCV and RandomizedSearchCV. Now, in this post, we will demonstrate that how we can use the GridSearchCV and RandomizedSearchCV methods available with the Sci-kit learn library for hyperparameter tuning in Python. We will use the sklearn built-in diabetes dataset in this demo. However, if

Hyperparameter tuning using GridSearchCV and RandomizedSearchCV in Python Read More »

An introduction to GridSearchCV and RandomizedSearchCV

In the previous post, we discussed that how we can assess the performance of a Machine learning model using a k-fold cross-validation method. In this post, we will discuss that how we can leverage the GridSearchCV and RandomizedSearchCV methods to find the optimal hyperparameter values. The hyperparameter value is the value that is required before

An introduction to GridSearchCV and RandomizedSearchCV Read More »

Introduction to k-fold Cross-Validation in Python

This post briefs how we can use the k-fold cross-validation to evaluate a Machine Learning model performance using the Scikit-learn library in Python. We know that the performance of a Machine Learning model depends on the training dataset. Also, if the training dataset has a peculiarity, the model created with that dataset will not work

Introduction to k-fold Cross-Validation in Python Read More »

Get minimum value from multiple columns in SQL Server

This post will discuss how we can extract the minimum value from multiple columns in SQL Server. For example, we have a table that stores the temperature of multiple cities in columns. The temperature data of each city is stored in a separate column. However, we have to select the minimum temperature value throughout all

Get minimum value from multiple columns in SQL Server Read More »

Show full column content in Spark

This post briefs how we can display the full contents of data frame columns in Apache Spark. The default behavior of Spark truncates the column values if it is more than 20 characters. However, sometimes we need to display the full values rather than the truncated data. Having truncated data might not be useful in

Show full column content in Spark Read More »

Spark read file with special characters using PySpark

Suppose, we have a CSV file that contains some non-English characters (Spanish, Japanese, and etc.) and we want to read this file into a Spark data frame. If we read this file without using the right character encoding, we will end up with some junk characters (like �) in the data frame. So, the files

Spark read file with special characters using PySpark Read More »

Read CSV file with Newline character in PySpark

Apache Spark is a Big Data cluster computing framework that can run on Standalone, Hadoop, Kubernetes, Mesos clusters, or in the cloud. We can read and write data from various data sources using Spark. For example, we can use CSV (comma-separated values), and TSV (tab-separated values) files as an input source to a Spark application.

Read CSV file with Newline character in PySpark Read More »