Spark

Create jar in IntelliJ IDEA for sbt-based Scala + Spark project

Just like the Maven build tool, sbt is another tool that can be used to manage the project development lifecycle. It helps us to build, test, and package the Scala and Java-based projects into a .jar file. This jar file can be used as a package in another application/project, or it can be simply used […]

Create jar in IntelliJ IDEA for sbt-based Scala + Spark project Read More »

Create jar in IntelliJ IDEA for Maven-based Scala + Spark project

In this post, we will learn how we can create a jar in IntelliJ IDEA for a Maven-based Scala + Spark project. We will use the maven build tool to create the jar file from the sample Scala project. We know that the Maven is a project management tool that can be used to manage

Create jar in IntelliJ IDEA for Maven-based Scala + Spark project Read More »

Create scala sbt project using IntelliJ IDEA – Step by step

In the previous post, we discussed how to set up a maven-based Scala project. Now, in this post, we will learn how we can create an sbt-based Scala project using IntelliJ IDEA IDE. The sbt is an open-source build tool for Scala and Java projects like Maven and Ant. If you need to install IntelliJ

Create scala sbt project using IntelliJ IDEA – Step by step Read More »

Create scala maven project using IntelliJ IDEA – Step by step

In this post, we will learn how to create a Maven-based Scala project using IntelliJ IDEA from scratch. Spark is an open-source unified general-purpose Big Data Processing Framework that is written in Scala programming language. Apache Spark is a multi-language data processing engine that supports SQL, Java, Python, R, and Scala languages. However, most of

Create scala maven project using IntelliJ IDEA – Step by step Read More »

Get HDFS file location of Hive table records as column

In this post, we will learn how we can extract the physical HDFS file location path of the Hive table as a column along with other columns of the table. We will demonstrate this using HiveQL, PySpark, and Scala. We can create the Hive tables as internal or external tables. So, if we create an

Get HDFS file location of Hive table records as column Read More »

Read and write data into Hive table from Spark using PySpark

In this post, we will learn how we can read and write the data to a Hive table from a Spark dataframe. Once we have the Hive table data being read into a dataframe, we can apply Spark transformations on that data. Finally, we can write back the data to the the Hive table. We

Read and write data into Hive table from Spark using PySpark Read More »

Show full column content in Spark

This post briefs how we can display the full contents of data frame columns in Apache Spark. The default behavior of Spark truncates the column values if it is more than 20 characters. However, sometimes we need to display the full values rather than the truncated data. Having truncated data might not be useful in

Show full column content in Spark Read More »

Spark read file with special characters using PySpark

Suppose, we have a CSV file that contains some non-English characters (Spanish, Japanese, and etc.) and we want to read this file into a Spark data frame. If we read this file without using the right character encoding, we will end up with some junk characters (like �) in the data frame. So, the files

Spark read file with special characters using PySpark Read More »

Read CSV file with Newline character in PySpark

Apache Spark is a Big Data cluster computing framework that can run on Standalone, Hadoop, Kubernetes, Mesos clusters, or in the cloud. We can read and write data from various data sources using Spark. For example, we can use CSV (comma-separated values), and TSV (tab-separated values) files as an input source to a Spark application.

Read CSV file with Newline character in PySpark Read More »

Read and write data to SQL Server from Spark using pyspark

Apache Spark is a very powerful general-purpose distributed computing framework. It provides a different kind of data abstractions like RDDs, DataFrames, and DataSets on top of the distributed collection of the data. Spark is highly scalable Big data processing engine which can run on a single cluster to thousands of clusters. To follow this exercise,

Read and write data to SQL Server from Spark using pyspark Read More »