Create Iceberg Catalog in Apache Flink® on HDInsight on AKS

Note

We will retire Azure HDInsight on AKS on January 31, 2025. Before January 31, 2025, you will need to migrate your workloads to Microsoft Fabric or an equivalent Azure product to avoid abrupt termination of your workloads. The remaining clusters on your subscription will be stopped and removed from the host.

Only basic support will be available until the retirement date.

Important

This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include more legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability. For information about this specific preview, see Azure HDInsight on AKS preview information. For questions or feature suggestions, please submit a request on AskHDInsight with the details and follow us for more updates on Azure HDInsight Community.

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines like Apache Flink, using a high-performance table format that works just like a SQL table. Apache Iceberg supports both Apache Flink’s DataStream API and Table API.

In this article, we learn how to use Iceberg Table managed in Hive catalog, with Apache Flink on HDInsight on AKS cluster.

Prerequisites

  • You're required to have an operational Flink cluster with secure shell, learn how to create a cluster
    • Refer this article on how to use CLI from Secure Shell on Azure portal.

Add dependencies

Script actions

  1. Upload hadoop-hdfs-client and iceberg-flink connector jar into Flink cluster Job Manager and Task Manager.

  2. Go to Script actions on Cluster Azure portal.

  3. Upload hadoop-hdfs-client_jar

    Screenshot showing how to add script action.

    Screenshot showing script action added successfully.

  4. Once you launch the Secure Shell (SSH), let us start downloading the dependencies required to the SSH node, to illustrate the Iceberg table managed in Hive catalog.

    wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.17/1.4.0/iceberg-flink-runtime-1.17-1.4.0.jar -P $FLINK_HOME/lib
    wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-column/1.12.2/parquet-column-1.12.2.jar -P $FLINK_HOME/lib
    wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/3.3.4/hadoop-hdfs-client-3.3.4.jar -P $FLINK_HOME
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$FLINK_HOME/hadoop-hdfs-client-3.3.4.jar
    

A detailed explanation is given on how to get started with Flink SQL Client using Secure Shell on Azure portal. You're required to start the SQL Client as described on the article by running the following command.

./bin/sql-client.sh

Create Iceberg Table managed in Hive catalog

With the following steps, we illustrate how you can create Flink-Iceberg catalog using Hive catalog.

  CREATE CATALOG hive_catalog WITH (
  'type'='iceberg',
  'catalog-type'='hive',
  'uri'='thrift://hive-metastore:9083',
  'clients'='5',
  'property-version'='1',
  'warehouse'='abfs://container@storage_account.dfs.core.windows.net/iceberg-output');

Note

  • In the above step, the container and storage account need not be same as specified during the cluster creation.
  • In case you want to specify another storage account, you can update core-site.xml with fs.azure.account.key.<account_name>.dfs.core.windows.net: <azure_storage_key> using configuration management.
  USE CATALOG hive_catalog;

Add dependencies to server classpath

ADD JAR '/opt/flink-webssh/lib/iceberg-flink-runtime-1.17-1.4.0.jar';
ADD JAR '/opt/flink-webssh/lib/parquet-column-1.12.2.jar';

Create Database

  CREATE DATABASE iceberg_db_2;
  USE iceberg_db_2;

Create Table

    CREATE TABLE `hive_catalog`.`iceberg_db_2`.`iceberg_sample_2`
    (
    id BIGINT COMMENT 'unique id',
    data STRING
    )
    PARTITIONED BY (data);

Insert Data into the Iceberg Table

    INSERT INTO `hive_catalog`.`iceberg_db_2`.`iceberg_sample_2` VALUES (1, 'a');

Output of the Iceberg Table

You can view the Iceberg Table output on the ABFS container.

Screenshot showing output of the Iceberg table in ABFS.

Reference