Breyta

Deila með


What is Apache Hive and HiveQL on Azure HDInsight?

Apache Hive is a data warehouse system for Apache Hadoop. Hive enables data summarization, querying, and analysis of data. Hive queries are written in HiveQL, which is a query language similar to SQL.

Hive allows you to project structure on largely unstructured data. After you define the structure, you can use HiveQL to query the data without knowledge of Java or MapReduce.

HDInsight provides several cluster types, which are tuned for specific workloads. The following cluster types are most often used for Hive queries:

Cluster type Description
Interactive Query A Hadoop cluster that provides Low Latency Analytical Processing (LLAP) functionality to improve response times for interactive queries. For more information, see the Start with Interactive Query in HDInsight document.
Hadoop A Hadoop cluster that is tuned for batch processing workloads. For more information, see the Start with Apache Hadoop in HDInsight document.
Spark Apache Spark has built-in functionality for working with Hive. For more information, see the Start with Apache Spark on HDInsight document.
HBase HiveQL can be used to query data stored in Apache HBase. For more information, see the Start with Apache HBase on HDInsight document.

How to use Hive

Use the following table to discover the different ways to use Hive with HDInsight:

Use this method if you want... ...interactive queries ...batch processing ...from this client operating system
HDInsight tools for Visual Studio Code Linux, Unix, macOS X, or Windows
HDInsight tools for Visual Studio Windows
Hive View Any (browser based)
Beeline client Linux, Unix, macOS X, or Windows
REST API   Linux, Unix, macOS X, or Windows
Windows PowerShell   Windows

HiveQL language reference

HiveQL language reference is available in the language manual.

Hive and data structure

Hive understands how to work with structured and semi-structured data. For example, text files where the fields are delimited by specific characters. The following HiveQL statement creates a table over space-delimited data:

CREATE EXTERNAL TABLE log4jLogs (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/example/data/';

Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data. For more information, see the How to use a custom JSON SerDe with HDInsight document.

For more information on file formats supported by Hive, see the Language manual (https://cwiki.apache.org/confluence/display/Hive/LanguageManual)

Hive internal tables vs external tables

There are two types of tables that you can create with Hive:

  • Internal: Data is stored in the Hive data warehouse. The data warehouse is located at /hive/warehouse/ on the default storage for the cluster.

    Use internal tables when one of the following conditions applies:

    • Data is temporary.
    • You want Hive to manage the lifecycle of the table and data.
  • External: Data is stored outside the data warehouse. The data can be stored on any storage accessible by the cluster.

    Use external tables when one of the following conditions applies:

    • The data is also used outside of Hive. For example, the data files are updated by another process (that doesn't lock the files.)
    • Data needs to remain in the underlying location, even after dropping the table.
    • You need a custom location, such as a non-default storage account.
    • A program other than hive manages the data format, location, and so on.

For more information, see the Hive Internal and External Tables Intro blog post.

User-defined functions (UDF)

Hive can also be extended through user-defined functions (UDF). A UDF allows you to implement functionality or logic that isn't easily modeled in HiveQL. For an example of using UDFs with Hive, see the following documents:

Example data

Hive on HDInsight comes pre-loaded with an internal table named hivesampletable. HDInsight also provides example data sets that can be used with Hive. These data sets are stored in the /example/data and /HdiSamples directories. These directories exist in the default storage for your cluster.

Example Hive query

The following HiveQL statements project columns onto the /example/data/sample.log file:

DROP TABLE log4jLogs;
CREATE EXTERNAL TABLE log4jLogs (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/example/data/';
SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs
    WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log'
    GROUP BY t4;

In the previous example, the HiveQL statements perform the following actions:

Statement Description
DROP TABLE If the table already exists, delete it.
CREATE EXTERNAL TABLE Creates a new external table in Hive. External tables only store the table definition in Hive. The data is left in the original location and in the original format.
ROW FORMAT Tells Hive how the data is formatted. In this case, the fields in each log are separated by a space.
STORED AS TEXTFILE LOCATION Tells Hive where the data is stored (the example/data directory) and that it's stored as text. The data can be in one file or spread across multiple files within the directory.
SELECT Selects a count of all rows where the column t4 contains the value [ERROR]. This statement returns a value of 3 because there are three rows that contain this value.
INPUT__FILE__NAME LIKE '%.log' Hive attempts to apply the schema to all files in the directory. In this case, the directory contains files that don't match the schema. To prevent garbage data in the results, this statement tells Hive that we should only return data from files ending in .log.

Note

External tables should be used when you expect the underlying data to be updated by an external source. For example, an automated data upload process, or MapReduce operation.

Dropping an external table does not delete the data, it only deletes the table definition.

To create an internal table instead of external, use the following HiveQL:

CREATE TABLE IF NOT EXISTS errorLogs (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
STORED AS ORC;
INSERT OVERWRITE TABLE errorLogs
SELECT t1, t2, t3, t4, t5, t6, t7 
    FROM log4jLogs WHERE t4 = '[ERROR]';

These statements perform the following actions:

Statement Description
CREATE TABLE IF NOT EXISTS If the table doesn't exist, create it. Because the EXTERNAL keyword isn't used, this statement creates an internal table. The table is stored in the Hive data warehouse and is managed completely by Hive.
STORED AS ORC Stores the data in Optimized Row Columnar (ORC) format. ORC is a highly optimized and efficient format for storing Hive data.
INSERT OVERWRITE ... SELECT Selects rows from the log4jLogs table that contains [ERROR], and then inserts the data into the errorLogs table.

Note

Unlike external tables, dropping an internal table also deletes the underlying data.

Improve Hive query performance

Apache Tez

Apache Tez is a framework that allows data intensive applications, such as Hive, to run much more efficiently at scale. Tez is enabled by default. The Apache Hive on Tez design documents contains details about the implementation choices and tuning configurations.

Low Latency Analytical Processing (LLAP)

LLAP (sometimes known as Live Long and Process) is a new feature in Hive 2.0 that allows in-memory caching of queries.

HDInsight provides LLAP in the Interactive Query cluster type. For more information, see the Start with Interactive Query document.

Scheduling Hive queries

There are several services that can be used to run Hive queries as part of a scheduled or on-demand workflow.

Azure Data Factory

Azure Data Factory allows you to use HDInsight as part of a Data Factory pipeline. For more information on using Hive from a pipeline, see the Transform data using Hive activity in Azure Data Factory document.

Hive jobs and SQL Server Integration Services

You can use SQL Server Integration Services (SSIS) to run a Hive job. The Azure Feature Pack for SSIS provides the following components that work with Hive jobs on HDInsight.

For more information, see the Azure Feature Pack documentation.

Apache Oozie

Apache Oozie is a workflow and coordination system that manages Hadoop jobs. For more information on using Oozie with Hive, see the Use Apache Oozie to define and run a workflow document.

Note

Phoenix Storage Handler for Hive is not supported in HDInsight

Next steps

Now that you've learned what Hive is and how to use it with Hadoop in HDInsight, use the following links to explore other ways to work with Azure HDInsight.