Create Iceberg Catalog in Apache Flink® on HDInsight on AKS
Note
We will retire Azure HDInsight on AKS on January 31, 2025. Before January 31, 2025, you will need to migrate your workloads to Microsoft Fabric or an equivalent Azure product to avoid abrupt termination of your workloads. The remaining clusters on your subscription will be stopped and removed from the host.
Only basic support will be available until the retirement date.
Important
This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include more legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability. For information about this specific preview, see Azure HDInsight on AKS preview information. For questions or feature suggestions, please submit a request on AskHDInsight with the details and follow us for more updates on Azure HDInsight Community.
Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines like Apache Flink, using a high-performance table format that works just like a SQL table. Apache Iceberg supports both Apache Flink’s DataStream API and Table API.
In this article, we learn how to use Iceberg Table managed in Hive catalog, with Apache Flink on HDInsight on AKS cluster.
Prerequisites
- You're required to have an operational Flink cluster with secure shell, learn how to create a cluster
- Refer this article on how to use CLI from Secure Shell on Azure portal.
Add dependencies
Script actions
Upload hadoop-hdfs-client and iceberg-flink connector jar into Flink cluster Job Manager and Task Manager.
Go to Script actions on Cluster Azure portal.
Upload hadoop-hdfs-client_jar
Once you launch the Secure Shell (SSH), let us start downloading the dependencies required to the SSH node, to illustrate the Iceberg table managed in Hive catalog.
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.17/1.4.0/iceberg-flink-runtime-1.17-1.4.0.jar -P $FLINK_HOME/lib wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-column/1.12.2/parquet-column-1.12.2.jar -P $FLINK_HOME/lib wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/3.3.4/hadoop-hdfs-client-3.3.4.jar -P $FLINK_HOME export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$FLINK_HOME/hadoop-hdfs-client-3.3.4.jar
Start the Apache Flink SQL Client
A detailed explanation is given on how to get started with Flink SQL Client using Secure Shell on Azure portal. You're required to start the SQL Client as described on the article by running the following command.
./bin/sql-client.sh
Create Iceberg Table managed in Hive catalog
With the following steps, we illustrate how you can create Flink-Iceberg catalog using Hive catalog.
CREATE CATALOG hive_catalog WITH (
'type'='iceberg',
'catalog-type'='hive',
'uri'='thrift://hive-metastore:9083',
'clients'='5',
'property-version'='1',
'warehouse'='abfs://container@storage_account.dfs.core.windows.net/iceberg-output');
Note
- In the above step, the container and storage account need not be same as specified during the cluster creation.
- In case you want to specify another storage account, you can update
core-site.xml
withfs.azure.account.key.<account_name>.dfs.core.windows.net: <azure_storage_key>
using configuration management.
USE CATALOG hive_catalog;
Add dependencies to server classpath
ADD JAR '/opt/flink-webssh/lib/iceberg-flink-runtime-1.17-1.4.0.jar';
ADD JAR '/opt/flink-webssh/lib/parquet-column-1.12.2.jar';
Create Database
CREATE DATABASE iceberg_db_2;
USE iceberg_db_2;
Create Table
CREATE TABLE `hive_catalog`.`iceberg_db_2`.`iceberg_sample_2`
(
id BIGINT COMMENT 'unique id',
data STRING
)
PARTITIONED BY (data);
Insert Data into the Iceberg Table
INSERT INTO `hive_catalog`.`iceberg_db_2`.`iceberg_sample_2` VALUES (1, 'a');
Output of the Iceberg Table
You can view the Iceberg Table output on the ABFS container.
Reference
- Apache Flink Website
- Apache, Apache Hive, Hive, Apache Iceberg, Iceberg, Apache Flink, Flink, and associated open source project names are trademarks of the Apache Software Foundation (ASF).