In this post, we will talk about how we can use the partitioning features available in Hive to improve performance of Hive queries.
Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values. Generally, Hive users know about the domain of the data that they deal with. With this knowledge they can identify common columns that are frequently queried in order to identify columns with low cardinality which can be used to organize data using the partitioning feature of Hive. In non-partitioned tables, Hive would have to read all the files in a table’s data directory and subsequently apply filters on it. This is slow and expensive—especially in cases of large tables.
The concept of partitioning is not new for folks who are familiar with relational databases. Partitions are essentially horizontal slices of data which allow larger sets of data to be separated into more manageable chunks. In Hive, partitioning is supported for both managed and external tables in the table definition as seen below.
As you can see, multi-column partition is supported (REGION/COUNTRY). You do not need to include the partition columns in the table definition and you can still use them in your query projections. The partition statement lets Hive alter the way it manages the underlying structures of the table’s data directory. If you browse the location of the data directory for a non-partitioned table, it will look like this: <DATABASE_NAME>.db/<TABLE_NAME>. All the data files are directly written to this directory. In case of partitioned tables, subdirectories are created under the table’s data directory for each unique value of a partition column. In case the table is partitioned on multiple columns, then Hive creates nested subdirectories based on the order of partition columns in the table definition. For instance, from the above example of the registration data table the subdirectories will look like the example below.
When a partitioned table is queried with one or both partition columns in criteria or in the “where” clause, what Hive effectively does is partition elimination by scanning only those data directories that are needed. If no partitioned columns are used, then all the directories are scanned (full table scan) and partitioning will not have any effect.
A few things to keep in mind when using partitioning:
- It’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data and put strain on the name node to manage all the underlying structures in HDFS.
- Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.
- Partitioning columns should be selected such that it results in roughly similar size partitions in order to prevent a single long running thread from holding up things.
- If hive.exec.dynamic.partition.mode is set to strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic.
- If your partitioned table is very large, you could block any full table scan queries by putting Hive into strict mode using the below command. In this mode, when users submit a query that would result in a full table scan (i.e. queries without any partitioned columns) an error is issued.
In the second part of this blog, I will talk about utilizing the bucketing feature in Hive to improve query performance. Stay tuned.View all Blog Posts