Sort By, Order By, Distribute By, and Cluster By in Hive

This post will briefly discuss the difference and similarity between Sort By, Order By, Distribute By, and Cluster By in hive queries. This is one of the most important questions being asked in Big data/Hadoop interviews. These Sort By, Order By, Distribute By, and Cluster By clauses are available in the hive query language and we can use them to distribute and order the output data in different ways. The SORT BY and ORDER BY clauses are used to define the order of the output data. However, DISTRIBUTE BY and CLUSTER BY clauses are used to distribute the data to multiple reducers based on the key columns. We can use Sort by or Order by or Distribute by or Cluster by clauses in a hive SELECT query to get the output data in the desired order.

We will use the below sample table data as an input dataset to understand the difference between these four hive clauses (SORT BY, ORDER BY, DSITRIBUTE BY, and CLUSTER BY).

SalesYearAmount
2017100
2018200
2019300
2020400
2017500
2018600
2019700
2020800
Sample input dataset – Table tbl_Sales

In the above dataset, table tbl_Sales contains two columns:

  1. SalesYear– Year of the sale as an Integer value in YYYY format.
  2. Amount – Total Sales Amount for the given year.

Now, let’s start using the Sort by, Order by, Distribute by, and Cluster by clauses on the above table data to understand the difference and similarity among these clauses.

SORT BY

The SORT by clause sorts the data per reducer. As a result, if we have N number of reducers, we will have N number of sorted files in the output. These files can have overlapping data ranges. Also, the output data is not globally sorted because the hive sorts the rows before feeding them to reducers based on the key columns used in the SORT BY clause. The syntax of the SORT BY clause is as below:

SELECT Col1, Col2,……ColN FROM TableName SORT BY Col1 <ASC | DESC>, Col2 <ASC | DESC>, …. ColN <ASC | DESC>

SELECT SalesYear, Amount
FROM tbl_Sales 
SORT BY SalesYear;

Now, suppose we have two reducers to execute this query. So, in the final output, we will get two locally sorted files based on the SalesYear column values from each of the reducers. That means the files are sorted per reducer basis only, and not globally. Note that the final output data is not globally sorted and may have overlapping data ranges.

The output of the above query with 2 reducers will look like this:

How SORT BY works in hive

ORDER BY

ORDER BY clause orders the data globally. Because it ensures the global ordering of the data, all the data need to be passed from a single reducer only. As a result, the order by clause outputs one single file only. Bringing all the data on one single reducer can become a performance killer, especially if our output dataset is significantly large. So, we should always avoid the ORDER BY clause in the hive queries. However, if we need to enforce a global ordering of the data, and the output dataset is not that big, we can use this hive clause to order the final dataset globally.

The syntax of the ORDER BY clause in hive is as below:

SELECT Col1, Col2,……ColN FROM TableName ORDER BY Col1 <ASC | DESC>, Col2 <ASC | DESC>, …. ColN <ASC | DESC>

SELECT SalesYear, Amount
FROM tbl_Sales 
SORT BY SalesYear;

The output of the above query will look like this:

How ORDER BY works in hive

DISTRIBUTE BY

DISTRIBUTE BY clause is used to distribute the input rows among reducers. It ensures that all rows for the same key columns are going to the same reducer. So, if we need to partition the data on some key column, we can use the DISTRIBUTE BY clause in the hive queries. However, the DISTRIBUTE BY clause does not sort the data either at the reducer level or globally. Also, the same key values might not be placed next to each other in the output dataset.

As a result, the DISTRIBUTE BY clause may output N number of unsorted files where N is the number of reducers used in the query processing. But, the output files do not contain overlapping data ranges.

The syntax of the DISTRIBUTE BY clause in hive is as below:

SELECT Col1, Col2,……ColN FROM TableName DISTRIBUTE BY Col1, Col2, ….. ColN

SELECT SalesYear, Amount
FROM tbl_Sales 
DISTRIBUTE BY SalesYear;

The output of the above query is as below:

How DISTRIBUTE BY works in hive

CLUSTER BY

CLUSTER BY clause is a combination of DISTRIBUTE BY and SORT BY clauses together. That means the output of the CLUSTER BY clause is equivalent to the output of DISTRIBUTE BY + SORT BY clauses. The CLUSTER BY clause distributes the data based on the key column and then sorts the output data by putting the same key column values adjacent to each other. So, the output of the CLUSTER BY clause is sorted at the reducer level. As a result, we can get N number of sorted output files where N is the number of reducers used in the query processing. Also, the CLUSTER by clause ensures that we are getting non-overlapping data ranges into the final outputs. However, if the query is processed by only one reducer the output will be equivalent to the output of the ORDER BY clause.

The syntax of the CLUSTER BY clause is as below:

SELECT Col1, Col2,……ColN FROM TableName CLUSTER BY Col1, Col2, ….. ColN

SELECT SalesYear, Amount
FROM tbl_Sales 
CLUSTER BY SalesYear;

The output of the above query look like this:

How CLUSTER BY works in hive

Feature-wise comparison of Sort By, Order By, Distribute By, and Cluster By in hive queries

Comparison of SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

Thanks for the reading. Please share your inputs in the comment section of this post.

Rate This
[Total: 7 Average: 5]

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.