Big Data/Cloud

Create scala sbt project using IntelliJ IDEA – Step by step

In the previous post, we discussed how to set up a maven-based Scala project. Now, in this post, we will learn how we can create an sbt-based Scala project using IntelliJ IDEA IDE. The sbt is an open-source build tool for Scala and Java projects like Maven and Ant. If you need to install IntelliJ […]

Create scala sbt project using IntelliJ IDEA – Step by step Read More »

Create scala maven project using IntelliJ IDEA – Step by step

In this post, we will learn how to create a Maven-based Scala project using IntelliJ IDEA from scratch. Spark is an open-source unified general-purpose Big Data Processing Framework that is written in Scala programming language. Apache Spark is a multi-language data processing engine that supports SQL, Java, Python, R, and Scala languages. However, most of

Create scala maven project using IntelliJ IDEA – Step by step Read More »

Get HDFS file location of Hive table records as column

In this post, we will learn how we can extract the physical HDFS file location path of the Hive table as a column along with other columns of the table. We will demonstrate this using HiveQL, PySpark, and Scala. We can create the Hive tables as internal or external tables. So, if we create an

Get HDFS file location of Hive table records as column Read More »

Read and write data into Hive table from Spark using PySpark

In this post, we will learn how we can read and write the data to a Hive table from a Spark dataframe. Once we have the Hive table data being read into a dataframe, we can apply Spark transformations on that data. Finally, we can write back the data to the the Hive table. We

Read and write data into Hive table from Spark using PySpark Read More »

Show full column content in Spark

This post briefs how we can display the full contents of data frame columns in Apache Spark. The default behavior of Spark truncates the column values if it is more than 20 characters. However, sometimes we need to display the full values rather than the truncated data. Having truncated data might not be useful in

Show full column content in Spark Read More »

Spark read file with special characters using PySpark

Suppose, we have a CSV file that contains some non-English characters (Spanish, Japanese, and etc.) and we want to read this file into a Spark data frame. If we read this file without using the right character encoding, we will end up with some junk characters (like �) in the data frame. So, the files

Spark read file with special characters using PySpark Read More »

Read CSV file with Newline character in PySpark

Apache Spark is a Big Data cluster computing framework that can run on Standalone, Hadoop, Kubernetes, Mesos clusters, or in the cloud. We can read and write data from various data sources using Spark. For example, we can use CSV (comma-separated values), and TSV (tab-separated values) files as an input source to a Spark application.

Read CSV file with Newline character in PySpark Read More »

Sort By, Order By, Distribute By, and Cluster By in Hive

This post will briefly discuss the difference and similarity between Sort By, Order By, Distribute By, and Cluster By in hive queries. This is one of the most important questions being asked in Big data/Hadoop interviews. These Sort By, Order By, Distribute By, and Cluster By clauses are available in the hive query language and

Sort By, Order By, Distribute By, and Cluster By in Hive Read More »

Access git repository using SSH key in PyCharm on Windows and Mac machine

In this post, we are going to discuss how we can set up git bash, SSH keys, and PyCharam IDE to access a git repository using the command line on a Windows or Mac machine. First, we will set it up on a Windows machine followed by a Mac machine. The setup process is very

Access git repository using SSH key in PyCharm on Windows and Mac machine Read More »

Continuous Integration and Continuous Deployment (CI/CD) – SQL Server Database testing using tSQLt – Part 4

In the previous posts, we have created a Continuous integration and a Conntinuous Deployment pipeline for a SQL Server database using the Azure DevOps server. Also, we have demonstrated how we can set up the cross-database dependency for a SQL Server database project in the Azure DevOps pipeline. Below are the links in case you

Continuous Integration and Continuous Deployment (CI/CD) – SQL Server Database testing using tSQLt – Part 4 Read More »