big data processing Archives - Page 2 of 3

Scala Option, Some, None – Exception and Null handling

Leave a Comment / Scala / Gopal Krishna Ranjan / Oct 5, 2022 / big data processing, Hadoop, scala

In the previous post, we discussed the Try, Success, Failure exception handling method. Now, in this post, we will discuss the use of Scala’s Option, Some, None pattern and its usage. Scala is a high-level programming language combining object-oriented and functional programming in one place. It is a very powerful programming language that can be […]

Scala Option, Some, None – Exception and Null handling Read More »

Scala Try, Success, Failure – Functional error handling

1 Comment / Scala / Gopal Krishna Ranjan / Sep 26, 2022 / big data processing, Hadoop, scala

In this post, we will discuss the Scala’s functional error handling method using Try, Success, Failure. We know that Scala is a high-level programming language that combines both object-oriented and functional programming in one place. It runs on JVM so it can be mixed seamlessly with Java. Scala’s static types helps to identify bugs at

Scala Try, Success, Failure – Functional error handling Read More »

Execute Scala file in Spark without creating a jar

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Aug 17, 2022 / big data processing, Hadoop, scala

This post will teach us how to execute a scala file in Spark without creating a jar file. We know that a scala source code file has an extension of .scala. Also, we need to create or package the source code into a jar file to execute an application written in Scala. We can create

Execute Scala file in Spark without creating a jar Read More »

Using Pandas on Spark

Leave a Comment / Python, Spark / Gopal Krishna Ranjan / Jul 31, 2022 / big data processing, pyspark, python

Pandas is one of the most popular Python libraries used by Data Scientists/Data Engineers for data wrangling and data analysis. Also, Pandas provide DataFrames (a table-like structure that stores data in rows and columns) to deal with structured datasets. These DataFrames are very similar to Spark’s DataFrames. However, Pandas dataframes are limited to a single

Using Pandas on Spark Read More »

Create jar in IntelliJ IDEA for sbt-based Scala + Spark project

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Mar 31, 2022 / big data processing, data analysis, scala, step by step

Just like the Maven build tool, sbt is another tool that can be used to manage the project development lifecycle. It helps us to build, test, and package the Scala and Java-based projects into a .jar file. This jar file can be used as a package in another application/project, or it can be simply used

Create jar in IntelliJ IDEA for sbt-based Scala + Spark project Read More »

Create jar in IntelliJ IDEA for Maven-based Scala + Spark project

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Feb 28, 2022 / big data processing, data analysis, scala, step by step

In this post, we will learn how we can create a jar in IntelliJ IDEA for a Maven-based Scala + Spark project. We will use the maven build tool to create the jar file from the sample Scala project. We know that the Maven is a project management tool that can be used to manage

Create jar in IntelliJ IDEA for Maven-based Scala + Spark project Read More »

Create scala sbt project using IntelliJ IDEA – Step by step

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Jan 31, 2022 / big data processing, data analysis, scala, step by step

In the previous post, we discussed how to set up a maven-based Scala project. Now, in this post, we will learn how we can create an sbt-based Scala project using IntelliJ IDEA IDE. The sbt is an open-source build tool for Scala and Java projects like Maven and Ant. If you need to install IntelliJ

Create scala sbt project using IntelliJ IDEA – Step by step Read More »

Create scala maven project using IntelliJ IDEA – Step by step

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Dec 29, 2021 / big data processing, data analysis, scala, step by step

In this post, we will learn how to create a Maven-based Scala project using IntelliJ IDEA from scratch. Spark is an open-source unified general-purpose Big Data Processing Framework that is written in Scala programming language. Apache Spark is a multi-language data processing engine that supports SQL, Java, Python, R, and Scala languages. However, most of

Create scala maven project using IntelliJ IDEA – Step by step Read More »

Get HDFS file location of Hive table records as column

Leave a Comment / Hive, Spark / Gopal Krishna Ranjan / Nov 30, 2021 / big data processing, Hadoop, HiveQL, pyspark, python, scala

In this post, we will learn how we can extract the physical HDFS file location path of the Hive table as a column along with other columns of the table. We will demonstrate this using HiveQL, PySpark, and Scala. We can create the Hive tables as internal or external tables. So, if we create an

Get HDFS file location of Hive table records as column Read More »

Read and write data into Hive table from Spark using PySpark

Leave a Comment / Hive, Spark / Gopal Krishna Ranjan / Oct 31, 2021 / big data processing, Hadoop, HiveQL, pyspark

In this post, we will learn how we can read and write the data to a Hive table from a Spark dataframe. Once we have the Hive table data being read into a dataframe, we can apply Spark transformations on that data. Finally, we can write back the data to the the Hive table. We

Read and write data into Hive table from Spark using PySpark Read More »