In this article, we will discuss two important concepts “Partitioning and Bucketing” in Hive. These are used to improve query performance and it is important to understand them so that you can apply them efficiently. So let’s start with Partitioning.
Partitioning in Hive
Partitioning is a technique which is used to enhance query performance in hive. It is done by restructuring data into sub directories. Let us understand this concept with an example.
Suppose we have a large file of 10 GB having geographical data for a customer. Now we want to extract a record for a particular country and for a particular employeId. In order to do so, It will perform a table scan to read all the rows and then pick only those records that satisfy the given predicate.
Now if we partition that table by country and run the query, it will not scan the entire table and will look into that particular country sub directory only. We can see the execution plan of the query to verify that it looked for only one filter predicate i.e. employeeId. it will directly look into sub directory /Country=’India’ and search for employee in that sub directory. This technique is known as partition pruning. The query will filter out partitions it does not need to scan and is hence very efficient. Also, the columns that are used for partitioning are not a part of create statement but can be used in query. They are called virtual partition columns. Whenever we use these virtual partition columns as a filter, the queries are served much faster than the non partitioned tables especially if the data size is large.
Let’s have a look at how we can create hive internal (managed tables) and external partitioned table and load data into these tables.
For hive managed table:
Create portioned table:
Create table tbl_managed ( employeeId STRING, name STRING ) Partitioned By(country STRING) Stored as TextFile;
To load data into partitioned table:
Load Data Inpath '/mydata/Employee/India' into Table tbl_managed Partition(country='India')
For External Table:
Create portioned table:
Create External table tbl_external ( employeeid STRING, name STRING ) Partitioned By(country STRING) Stored as TextFile;
Add data into external table:
Alter Table tbl_external Add Partition(country='India') Location '/mydata/Employee/India';
Things to keep in mind while partitioning:
- The column you are using for partitioning should have low cardinality i.e. low distinct values for that column. The reason being that with high cardinality we will end up with many sub directories or files. Since the number of mappers is dependent on input size and block size, creating many partitions would end up using many mappers and it would lead to wastage of resources in most cases.
- One more thing that would happen with many partitions is that the Name Node which keeps track of meta data of file system in memory will also have unnecessary overhead since it has to track many partitions now.
- When we create partition on external tables, location is optional. But we should always provide the location (like root/a/b) as it can be used to sync with hive metastore later on. So if you had provided the location and then added subdirectories like root/a/b/country=’India’ and when we run command, MSCK Repair Table Tablename. It would automatically add this partition.
So we can say that partitioning is useful when:
- We have limited number of partitions
- All partitions are equally distributed
Bucketing in Hive
When we do not get query improvement with partitioning because of unequal partitions or many number of partitions, we can try bucketing. Bucketing concept is based on hashing function on bucketed column. The records which generate same hash will always be in the same bucket.
To divide a table into buckets we use Clustered by clause. Each bucket is just like a file in directory and all files are equally distributed.
The advantages of bucketing are:
- Map-Side joins are faster on bucketed tables because they are of similar size.
- We can keep the records sorted in each bucket
- When the data is sorted by a column in buckets and they are used for join, Map-side joins are even more faster
- They also offer efficient sampling over non-bucketed tables
- With bucketing we can always define number of buckets to be formed which is not the case in partitioning
- Bucketing can be used with or without partitioning
How to create a bucketed table:
Create Table t1(a INT,b STRING,c STRING) CLUSTERED BY (b) BUCKETS 128 BUCKETS
Hive does not enforce bucketing on data loaded into table. We have to manage that after we have created the table. There are two ways to achieve that:
- Set mapred.reduce.tasks=64(number of buckets)
Insert Overwrite Table t1
Select a,b,c from table2 cluster by b
- Set hive.enforce.bucketing=true;
Insert Overwrite Table t1
Select a,b,c from table2
This will automatically figure out the number of buckets and bucketed column from Create Table statement.
Thanks for reading. Please share your inputs in comments.