How does bucketing work at HIVE

Optimizing Apache Hive queries in Azure HDInsight

This article describes some of the most common performance tweaks that you can use to improve the performance of your Apache Hive queries.

Selection of the cluster type

In Azure HDInsight, you can apply Apache Hive queries to different types of clusters.

Choose the appropriate cluster type to optimize performance for your workload needs:

  • Choose e.g. B. the cluster type Interactive query for the optimization of interactive queries.
  • Choose the Apache HadoopCluster type to optimize for Hive batch processing queries.
  • The cluster types Spark and HBase can also run Hive queries and may be suitable if you are running these workloads.

For more information about running Hive queries on different types of HDInsight clusters, see What are Apache Hive and HiveQL in Azure HDInsight.

Scaling up the worker nodes

In an HDInsight cluster with more worker nodes available, more mappers and reducers can be run in parallel for work. In HDInsight, you can increase the upscaling in two ways:

  • When deploying a cluster, you can specify the number of worker nodes in the Azure portal, Azure PowerShell, or the command line interface. For more information, see Creating HDInsight Clusters. The following screenshot shows the configuration of the worker nodes in the Azure portal:

  • Once created, you can also edit the number of worker nodes to further scale up a cluster without repeating the creation:

For more information about scaling HDInsight, see Scaling HDInsight Clusters.

Using Apache Tez instead of MapReduce

Apache Tez is an alternative to the MapReduce execution engine. Linux-based HDInsight clusters have Tez enabled by default.

However, Tez is faster for the following reasons:

  • Execution of a directed acyclic graph as a single job in the MapReduce engine. The directed acyclic graph requires that each group of mappers be followed by a group of reducers. Because of this requirement, several MapReduce jobs are started for each Hive query. This restriction does not apply to Tez. Tez can also process a complex directed acyclic graph in one job, so that fewer jobs have to be started.
  • Avoid unnecessary writing. Several jobs are used to process the Hive query in the MapReduce engine. The output of the individual MapReduce jobs is written to the HDFS buffer. Tez, on the other hand, minimizes the number of jobs for each Hive query and thus avoids unnecessary write operations.
  • Minimizing start delays. Tez minimizes startup delays by reducing the number of mappers required for startup and by improving overall optimization.
  • Reuse of containers. Tez tries to reuse containers as much as possible, thereby reducing latency due to container starts.
  • Continuous Optimization Techniques. So far, optimization has usually been done in the compilation phase. Now, however, more information is available about the inputs, which also enables optimization during runtime. Tez uses ongoing optimization techniques that allow the plan to be optimized well into the runtime phase.

For more information on these concepts, see Apache Tez.

You can make any Hive query Tez compatible by prefixing the query with the following set command:

Partitioning in Hive

I / O operations are the main performance bottleneck when executing Hive queries. You can improve performance by reducing the amount of data to be read. By default, Hive queries search entire Hive tables. However, for queries that only have to search through an excerpt of the data (e.g. for queries with filters), this means unnecessary additional work. Partitioning in Hive limits access to Hive queries to the required subsets of Hive tables.

Hive partitioning is implemented by reorganizing the raw data into new directories. Each partition has its own directory. The partitioning is defined by the user. The following diagram illustrates the partitioning of a Hive table by column year. A new directory is created for each year.

Partitioning considerations:

  • Partition generously: If you partition on columns with only a few values, you will get only a few partitions. For example, if you partition according to gender, you will only get two partitions (male, female), so that the latency is halved at most.
  • But not too generous: In the other extreme case, multiple partitions are created when a partition is created using a column with a unique value (e.g. userid). Too generous partitioning puts a lot of strain on the NameNode of the cluster, as it has to cope with a large number of directories.
  • Avoid partition sizes that are too different - Choose your partition key so that the partitions are roughly the same size. For example, partitioning by column State may skew the distribution of the data. Because the state of California has a population nearly 30 times that of Vermont, the partition size may be skewed and performance can vary significantly.

To create a partition table, use the clause Partitioned By:

You can choose static or dynamic partitioning for the created partition table.

  • Static partitioning means that horizontally partitioned data already exists in the corresponding directories. With static partitions, you manually add Hive partitions based on the location of the directory. The following snippet of code is an example.

  • Dynamic partitioning means that you want Hive to automatically create and adjust the partitions. Since you have already created the partition table based on the staging table, you now only have to fill the table with data:

For more information, see Partitioned Tables.

Using the ORC file format

Hive supports various file formats. Example:

  • text: The standard file format that works in most scenarios.
  • Avro: This file format is particularly suitable for interoperability scenarios.
  • ORC / Parquet: This file format is performance-oriented.

The ORC (Optimized Row Columnar) format is a very efficient storage method for Hive data. Compared to other formats, ORC has the following advantages:

  • Support for complex types, including DateTime, and complex and semi-structured types
  • Compression up to 70%
  • Indexes every 10,000 lines, which makes it possible to skip lines
  • Much faster run-time execution

To enable ORC, first create a table with the clause Stored as ORC:

Then add data from the staging table to the ORC table. Example:

For more information on the ORC format, see the Apache Hive Language Guide.

Vectorization

With vectorization, Hive can process batches of 1024 lines at a time instead of individual lines. Simple operations are faster because there is less internal code to run.

To activate vectorization, precede your Hive query with the following setting:

For more information, see Vectorized Query Execution.

Further optimization methods

There are other optimization methods that are worth considering, for example the following:

  • Hive bucketing: A method with which large amounts of data are summarized or segmented to optimize query performance.
  • Join optimization: Optimizing the Hive query execution plan to make joins more efficient. In addition, this should make user instructions largely unnecessary. For more information, see Join Optimization.
  • Increase reducer.

Next Steps

In this article, you learned about several common Hive methods for optimizing queries. For more information, see the following articles: