Understanding Map join in Hive

Apache Hive is a big data query language which is used to read, transform and write large datasets in a distributed environment. It has a SQL like syntax which gets translated into a MapReduce job in order to execute on Hadoop clusters. In Hadoop ecosystem, we use Hive for batch processing to extract, transform and load the data into a data warehouse system or in a file system which can be HDFS, Amazon S3, Azure Blob or Azure DataLake. However, Hive is not meant for OLTP tasks as it has high latency. In this post, we are going to learn Map Join which can be used to improve the performance of a hive query. We will also discuss the parameters required in order to enable the Map join along with its limitations.

What is Map join in Hive

Join clause in hive is used to combine records from two tables based on the given join condition. The default join type in hive is Common join which is also known as Shuffle join or Distributed join or Sort Merge join. The common join is performed during Shuffle phase (Reduce side). Mappers read the tables and generate the intermediate file with join key-value pair. During the shuffle phase, the data gets sorted and merged based on the keys so that the same key will be sent to the same reducer. In reduce phase, the reducer performs the join operation on the sorted input data. The shuffle and reduce phases are resource intensive operations which eventually degrade the query performance. The shuffle and reduce phases can be skipped in a MapReduce task if the join operation can be performed in the Map phase.

The Map join is performed by loading the smaller table into memory and matching the join keys with the larger table to perform the join operation on each mapper. Map join is also known as broadcast join or Map side join. Before starting the original MapReduce task, a Map join starts a local task which performs below steps: 

Local task:

  1. Read smaller table from HDFS source to build an in-memory hashtable.
  2. Serialize the in-memory hashtable into a hashtable file.
  3. Add hashtable file to Hadoop distributed cache to populate this file to each mapper’s local disk.

Next, it performs below steps in the map phase.

Map phase:

  1. Read hashtable file into memory
  2. Match and combine records based on the join key and write it to the output

In a Map join no reduce task is required that is why it improves the query performance significantly in comparison of a common join/reduce join.

Map join in hive
Map join in hive

Map join can be very helpful in joining a fact table with a dimension table as dimension tables are significantly smaller in compared to the fact tables.

Parameters required to enable Map join

Below are the parameters which need to be enabled in order to facilitate map side join in hive.

hive.auto.convert.join

We need to set this option true to convert the join into a map join automatically if the table size is smaller than the defined size of the parameter hive.mapjoin.smalltable.filesize (25MB).

hive.auto.convert.join.noconditionaltask

In case of three or more tables, hive generates three or more map side joins. To combine all these map joins in a single task, we need to set this option as true. If the combined size of n-1 tables is less than the defined size of the parameter hive.auto.convert.join.noconditionaltask.size (10MB), it will convert the join into a map join automatically.

Types of Map join

  1. Plain map join – Smaller table fits into memory to enable map join.
  2. Bucket map join – Both tables are bucketed tables and the number of buckets in one table is a multiple of number of buckets in another table.
  3. Sort merge bucket join – Both tables are bucketed and sorted. The number of buckets in one table is a multiple of number of buckets in another table.

Syntax for specifying Map Join

Below is the syntax to specify map join using query hint in hive.

SELECT /*+ MAPJOIN(Product)*/ Product.*, Sales.* 
FROM Sales 
INNER JOIN Product ON Sales.ProductId = Product.ProductId;

/*+ MAPJOIN(Product)*/ is used to specify a map side join in the above query.

Limitations of Map join

Below are the limitations of map join in hive:

  1. Full outer join is not supported since both the tables need to be streamed to perform a full outer join.
  2. A left join can only be converted into a map join if the right table is small enough to fit into memory.
  3. A right join can only be converted into a map join if the left table is small enough to fit into memory.

Thanks for the reading. Please share your inputs in comment.

Rate This
[Total: 3 Average: 5]

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.