big data processing

Reading Data from Cosmos DB in Databricks: A Comprehensive Guide

In today’s data-driven world, organizations leverage various data storage solutions to manage and analyze their data effectively. Cosmos DB, a globally distributed NoSQL database service from Microsoft Azure, is widely used for building highly scalable and responsive applications. In this blog post, we will explore how to read data from Cosmos DB in Databricks, a […]

Reading Data from Cosmos DB in Databricks: A Comprehensive Guide Read More »

PySpark Dataframes: Adding a Column with a List of Values

PySpark is a tool that lets you work with big amounts of data in Python. It’s part of Apache Spark, which is known for handling really big datasets. A common thing people need to do when they’re organizing data is to add a new piece of information to a table, which in the world of

PySpark Dataframes: Adding a Column with a List of Values Read More »

Dynamically Create Spark DataFrame Schema from Pandas DataFrame

Apache Spark has become a powerful tool for processing large-scale data in a distributed environment. One of its key components is the Spark DataFrame, which offers a higher-level abstraction over distributed data and enables efficient data manipulation. Spark DataFrame is typically used to manipulate large amounts of data in a distributed environment. When working within

Dynamically Create Spark DataFrame Schema from Pandas DataFrame Read More »

Optimize Spark dataframe write performance for JDBC

Apache Spark is a popular big data processing engine that is designed to handle large-scale data processing tasks. When it comes to writing data to JDBC, Spark provides a built-in JDBC connector that allows users to write data to various relational databases easily. We can write Spark dataframe to SQL Server, MySQL, Oracle, Postgres, etc.

Optimize Spark dataframe write performance for JDBC Read More »

Create requirements.txt file in Python automatically

In this post, we will learn how to create a requirements.txt file for a python project. The requirements.txt file contains the list of all the packages needed to execute the Python project. It is very helpful, especially during the deployment. Using the requirement.txt file, we can automate the deployment of the project to a different

Create requirements.txt file in Python automatically Read More »

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x

We know that Apache Hadoop is a framework that allows us to perform data processing in a distributed way on very large datasets using commodity computers. That is why, this framework is highly scalable and can scale up from a single machine to thousands of machines. Most importantly, Hadoop is an open source and provides

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x Read More »

Fill null with the next not null value – Spark Dataframe

In this post, we discussed how to fill a null value with the previous not-null value in a Spark Dataframe. We have also discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill a null value with the next available not-null value

Fill null with the next not null value – Spark Dataframe Read More »

Fill null with the previous not null value – Spark Dataframe

In the previous post, we discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill the null values with the previous not-null value in a spark dataframe using the backfill method. To demonstrate this with the help of an example, we will

Fill null with the previous not null value – Spark Dataframe Read More »

ERROR Utils: Aborting task java.io.IOException: Failed to connect to – Local Spark

In this post, we will discuss an error/warning message “java.io.IOException: Failed to connect to”. This error keeps coming when we try to execute a hive query from spark-shell using spark SQL. This error occurs when Spark tries to execute a task in local mode (pseudo-distributed mode). It is caused because of a connection exception. The

ERROR Utils: Aborting task java.io.IOException: Failed to connect to – Local Spark Read More »

Get the first non-null value per group Spark dataframe

Suppose, we need to get the first non-null value from a Dataframe from each partition. Certainly, we want to get only the first not null value from each column regardless of the rows. That means a not-null value from column A from row 5 can be stitched with another not-null value of column B from

Get the first non-null value per group Spark dataframe Read More »