Apache Spark is a general purpose distributed computing engine used for Big Data processing – Batch and stream processing. It provides high level APIs like Spark SQL, Spark Streaming, MLib, and GraphX to allow interaction with core functionalities of Apache Spark. Spark also facilitates several core data abstractions on top of the distributed collection of data which are RDDs, DataFrames, and DataSets. In this post, we are going to discuss these core data abstractions available in Apache Spark.
Spark Data Abstraction
The data abstraction in Spark represents a logical data structure to the underlying data distributed on different nodes of the cluster. The data abstraction APIs provides wide range of transformation methods (like map(), filter(), etc) which are used to perform computations in a distributed way. However, in order to execute these transformations, we need to call an action method like show(), collect(), etc.
Let’s have a deeper look at the Spark core data abstraction now.
Spark RDD (Resilient Distributed Dataset)
Spark RDD stands for Resilient Distributed Dataset which is the core data abstraction API and is available since very first release of Spark (Spark 1.0). It is a lower-level API for manipulating distributed collection of data. The RDD APIs exposes some extremely useful methods which can be used to get very tight control over underlying physical data structure. It is an immutable (read only) collection of partitioned data distributed on different machines. RDD enables in-memory computation on large clusters to speed up big data processing in a fault tolerant manner.
To enable fault tolerance, RDD uses DAG (Directed Acyclic Graph) which consists of a set of vertices and edges. The vertices and edges in DAG represent the RDD and the operation to be applied on that RDD respectively. The transformations defined on RDD are lazy and executes only when an action is called. Let’s have a quick look at features and limitations of RDD:
RDD Features:
- Immutable collection: RDD is an immutable partitioned collection distributed on different nodes. A partition is a basic unit of parallelism in Spark. The immutability helps to achieve fault tolerance and consistency.
- Distributed data: RDD is a collection of distributed data which helps in big data processing by distributing the workload to different nodes in the cluster.
- Lazy evaluation: The defined transformations do not gets evaluated until an action is called. It helps Spark in optimizing the overall transformations in one go.
- Fault tolerant: RDD can be recomputed in case of any failure using DAG(Directed acyclic graph) of transformations defined for that RDD.
- Multi-language support: RDD APIs supports Python, R, Scala, and Java programming languages.
Limitation of RDD:
- No optimization engine: RDD does not have an in-built optimization engine. Programmers need to write their own code in order to minimize the memory usage and to improve execution performance.
Spark DataFrame
Spark 1.3 introduced two new data abstraction APIs – DataFrame and DataSet. The DataFrame APIs organizes the data into named columns like a table in relational database. It enables programmers to define schema on a distributed collection of data. Each row in a DataFrame is of object type row. Like an SQL table, each column must have same number of rows in a DataFrame. In short, DataFrame is lazily evaluated plan which specifies the operations needs to be performed on the distributed collection of the data. DataFrame is also an immutable collection. Below are the features and limitations of DataFrame:
DataFrame Features:
- In-built Optimization: DataFrame uses Catalyst engine which has an in-built execution optimization that improves the data processing performance significantly. When an action is called on a DataFrame, the Catalyst engine analyzes the code and resolves the references. Then, it creates a logical plan. After that, the created logical plan gets translated into an optimized physical plan. Finally, this physical plan gets executed on the cluster.
- Hive compatible: The DataFrame is fully compatible with Hive query language. We can access all hive data, queries, UDFs, etc using Spark SQL from hive MetaStore and can execute queries against these hive databases.
- Structured, semi-structured, and highly structured data support: DataFrame APIs supports manipulation of all kind of data from structured data files to semi-structured data files and highly structured parquet files.
- Multi-language support: DataFrame APIs are available in Python, R, Scala, and Java.
- Schema support: We can define a schema manually or we can read a schema from a data source which defines the column names and their data types.
DataFrame Limitations:
- Type safety: Each row in a DataFrame is of object type row and hence is not strictly typed. That is why DataFrame does not support compile time safety.
Spark DataSet
As an extension to the DataFrame APIs, Spark 1.3 also introduced DataSet APIs which provides strictly typed and object-oriented programming interface in Spark. It is immutable, type-safe collection of distributed data. Like DataFrame, DataSet APIs also uses Catalyst engine in order to enable execution optimization. DataSet is an extension to the DataFrame APIs. Features and limitations of the DataSet are as below:
DataSet Features:
- Combination of RDD and DataFrame: DataSet enables functional programming like RDD APIs and relational queries and execution optimization like DataFrame APIs. Thus, it provides the benefit of best of both worlds – RDDs and DataFrames.
- Type-safe: Unlike DataFrames, DataSet APIs provides compile time type safety. It conforms the specification at compile time using defined case classes (for Scala) or Java beans (for Java).
DataSet Limitations:
- Limited language support: DataSet is only available to JVM based langauges like Java and Scala. Python and R do not support DataSet because these are dynamically typed languages.
- High garbage collection: JVM types can cause high garbage collection and object instantiation cost.
RDD vs DataFrame vs DataSet
Below table represents a quick comparison between RDD, DataFrame and DataSet:
Thanks for the reading. Please share your inputs in comment section.