Create and manage a Kubernetes supported self-hosted integration runtime

This article covers the details for the new SHIR feature that is Kubernetes-based for Linux that improves the underlying infrastructure to provide several benefits:

  • Scalability: Ability to scale to hundreds of machines.
  • Performance: Improved performance in scanning workloads.
  • Security (containerized): Ability to have containerized security on a Kubernetes cluster, instead of hosting SHIR on a Windows machine directly

This article covers the details to install and manage a Kubernetes supported self-hosted integration runtime.

Supported data sources

For a list of all supported sources, see the supported data sources for each integration runtime table.

Architecture

At a high-level architectural view, when a Kubernetes based SHIR is installed, several pods get autocreated on the nodes of users' Kubernetes cluster. This installation can be triggered by a command line tool named IRCTL (more detail in following sections). IRCTL connects to the Microsoft Purview Service to register the SHIR and connect to the Kubernetes cluster to install the SHIR. 

During the installation, SHIR images are downloaded from MCR (Microsoft Container Registries) to the SHIR pods. After installation is done, the pods in users’ cluster will connect to the Microsoft Purview Service to pull scan jobs. As a scan job is pulled, it can connect users’ on-premises Data Source for Data Scanning.

Infographic of the network architecture for the Kubernetes supported self-hosted integration runtime.

Prerequisites

  • A Microsoft Purview account using enterprise data governance solutions.

  • Kubernetes cluster: You need to have an existing Linux-based Kubernetes cluster or to prepare one. The nodes can be identified by node selector, which follows the definition of Kubernetes node selector. Minimum configuration:

    • Container type: Linux
    • Kubernetes version: 1.24.9 or above
    • Node OS: Linux based OS running on x86 architecture
    • Node spec: minimal eight cores CPU, 32-GB memory, and at least 80 GB of available hard disk space
    • Node count: >=1 (should be fixed, not enable cluster auto scaler)
    • Pod number per Node: >= 20 (max Pod number – count of other Pods not belonging to Self-Hosted IR)

    Note

    The folder /var/irstorage/ of each Node is reserved for SHIR. It is readable and writable to SHIR. You can get logs being persisted from this folder or upload external drivers to this folder. It will be created by SHIR if it does not exist, and it will not be deleted after SHIR being deleted. The container images used by SHIR are managed by Kubernetes Garbage Collection, which will not be cleaned-up by SHIR. Please configure the proper threshold for your Kubernetes cluster.

  • Kubernetes cluster network: The Kubernetes cluster you have should be able to connect to the endpoint listed in networking requirements.

  • Integration runtime command line tool: In order manage your Microsoft Purview Kubernetes SHIR locally, you need a command line tool named IRCTL. You can download this tool during the SHIR creation process. IRCTL is a command line tool to manage your Microsoft Purview SHIR. For more information, see the IRCTL documentation.

  • Kubernetes context: Kubernetes context, which contains Kubernetes cluster information and user’s permissions and credential for this cluster, is needed to talk to your Kubernetes cluster. To ease the configuration for the user’s permissions for SHIR management, you can start with Kubernetes Admin role. This context is generated with the setup of your Kubernetes cluster and saved in a config file. Where and how you can get this file depends on your setting up the Kubernetes cluster.
    • If you use kubeadm init to set up the Kubernetes cluster, you can find the config file under /etc/Kubernetes/admin.conf.
    • If you use AKS, you can follow the guidance of AKS to use Az PowerShell module command to get credentials of this cluster to your local machine. The context can be merged to the config file under $HOME/.kube/config directly.
    • If you're using other tools setting up a Kubernetes cluster, refer to the Kubernetes documentation.
    • As you have the config file of the Kubernetes context, merge it to the config file, which is $HOME/.kube/config, on the machine you would like to run IRCTL command. Or you can set the config file of the Kubernetes context in an environment variable named KUBECONFIG as well. For more information about the Kubernetes context, see Configure Access to Multiple Clusters.

Create Kubernetes supported self-hosted integration runtime

To control and manage a Kubernetes SHIR, users can download a command line tool named IRCTL. The following are the steps to your Kubernetes supported self-hosted integration runtime.

The steps will take you through downloading IRCTL, but for direct links, see the IRCTL documentation.

Set up a Kubernetes supported self-hosted integration runtime

  1. Open the Integration runtimes window in the Microsoft Purview Data Map

  2. Select the + New button

    Screenshot of the integration runtimes window in the Microsoft Purview Data Map.

  3. Select Self-hosted and then select Continue

    Screenshot of the new integration runtime window, with self-hosted selected.

  4. Give your runtime a name, then select the Kubernetes service support toggle to enable

    Screenshot of the new integration runtime window with the Kubernetes toggle enabled.

  5. Select Create

  6. Select Get registration key

    Screenshot of the view integration runtime page with the Get registration key button highlighted.

  7. Copy the key value. You need it to run commands in IRCTL later.

    Tip

    If needed, you can regenerate a key or revoke a generated key.

  8. Select the Download IRCTL and install integration runtime link to download the IRCTL tool. (You can also follow these steps to download IRCTL directly.)

  9. On the machine where you want to run the IRCTL command line, install IRCTL from the download. IRCTL connects to your Kubernetes cluster by context of the Kube config. If context isn't specified, IRCTL uses the current context. You can set the context in one of two ways:

    • Run kubectl command line and execute this command to confirm the current context:

      kubectl config get-contexts – List all contexts configured on the machine
      
      kubectl config current-context – Get the current context name
      
      kubectl config use-context <name of context>
      
    • Run IRCTL and execute --context to specify the context in the Kube config

  10. Run the IRCTL command line and execute this command with the registration key you copied.

    ./irctl create --registration-key <registration key copied from the portal>
    

    Note

    If node selector is not specified, will use all nodes of the Kubernetes cluster. For AKS, suggest to use the label of AKS node pool as the node selector or you can customize different labels to the SHIR nodes.

  11. You'll see this printout:

    [Info] Start to create SHIR with Kubernetes context [your-context]......
    [Info] Environment validation passed!
    [Info] Registering SHIR[example-k8s-shir] for Microsoft Purview Account [yourpurviewaccount]......
    [Info] SHIR Registration done!
    [Info] Provisioning SHIR, it may take about 5-30 minutes......done!
    [Info] SHIR creation succeeded!  
    

    Tip

    If the installation progress is broken by Ctrl-C or other reasons, the following command can be used to monitor the installation progress: ./irctl install status

  12. Once installation is complete, to check the current status of the SHIR, run this command:

    ./irctl describe
    
  13. You can also check the status of your SHIR in the Microsoft Purview portal, on the Integration runtimes page.

Set up a scan with external drivers

When scanning some data sources, you need to install the corresponding driver on the machine where the SHIR is installed for Microsoft Purview to connect with the data source. Below is an example for Db2 scan. Refer to respective connector article for specific prerequisites.

Note

Data sources that need these external drivers will have the information listed in their prerequisites.

In this example, we'll be installing the Db2 driver. Steps for other drivers will be similar.

  1. First, install the integration runtime.

  2. Download the driver (each source will have their individual driver listed.) For example, you can find the DB2 driver here: Connect to and manage Db2.

  3. Upload the driver to each node for your integration runtime. You can use a command like this:

    ./irctl storage upload --source jdbc_sqlj/db2_driver --destination driver/db2
    

    A successful upload confirmation will look like this:

    ========== Context ========== 
    Kubernetes Context             : k8s-shir-test-cluster 
    Purview Account                : test-purview-1 
    Self-hosted Intrgration Runtime: k8s-shir-demo 
    ========== Progress ========== 
    Processing 2/2 nodes... 
    aks-shirpool-27141791-vmss000000: SUCCEEDED 
    aks-shirpool-27141791-vmss000001: SUCCEEDED 
    ========== Results ========== 
    jdbc_sqlj/db2_driver -> /var/irstorage/driver/db2 
    

    Note

    If you replace nodes or scale out to new nodes, you'll need to upload the external driver again.

  4. Verify the files uploaded with this command:

    ./irctl storage list driver/db2
    

    You should see a response like this:

    ========== Context ========== 
    Kubernetes Context             : k8s-shir-test-cluster 
    Purview Account                : test-purview-1 
    Self-hosted Intrgration Runtime: k8s-shir-demo 
    ========== Progress ========== 
    Processing 2/2 nodes... 
    aks-shirpool-27141791-vmss000000: SUCCEEDED 
    aks-shirpool-27141791-vmss000001: SUCCEEDED 
    ========== Results ========== 
    Node: aks-shirpool-27141791-vmss000000 - Succeeded 
    /var/irstorage/driver/db2 
    total 9364 
    drwxr-xr-x    2 root     root          4096 May 15 14:23 . 
    drwxr-xr-x    3 root     root          4096 May 15 14:23 .. 
    -rwxrwxr-x    1 root     root       6568346 May 15 14:23 db2jcc4.jar 
    Node: aks-shirpool-27141791-vmss000001 - Succeeded 
    /var/irstorage/driver/db2 
    total 9364 
    drwxr-xr-x    2 root     root          4096 May 15 14:23 . 
    drwxr-xr-x    3 root     root          4096 May 15 14:23 .. 
    -rwxrwxr-x    1 root     root       6568346 May 15 14:23 db2jcc4.jar 
    
  5. Create scan with the value for DriverLocation with the Destination value from step 3.

    Screenshot of the scan set up window, showing the driver location listed as driver/db2.

High availability and scalability

You can assign multiple nodes of the Kubernetes cluster to have high availability by using the node-selector during the Kubernetes supported self-hosted integration runtime installation. The benefits of having multiple nodes are:

  • Higher availability of the self-hosted integration runtime so that it's no longer the single point of failure for scans.
  • Run more concurrent scans. Each node can empower many scan runs at the same time. You can manually scale out nodes of the Kubernetes cluster if you need more concurrent scans.
  • When scanning some sources like Azure Blob, Azure Data Lake Storage Gen2, and Azure Files, each scan run can use multiple nodes to boost the scan performance. For other sources, scans are executed on only one of the nodes.

The capability of Kubernetes supported self-hosted integration runtime can be updated by manually scaling out/in nodes of the Kubernetes cluster.

Note

You must upload all necessary drivers for scanning on each new node.

Networking requirements

Domain name Outbound ports Description
Public cloud: <tenantID>-api.purview-service.microsoft.com
Azure Government: <tenantID>-api.purview-service.microsoft.us
China: <tenantID>-api.purview-service.microsoft.cn
443 Required to connect to Microsoft Purview service. If you use Microsoft Purview Private Endpoints, this endpoint is covered by account private endpoint.
Public cloud: <purview_account>.purview.azure.com
Azure Government: <purview_account>.purview.azure.us
China: <purview_account>.purview.azure.cn
443 Required to connect to Microsoft Purview service. If you use Microsoft Purview Private Endpoints, this endpoint is covered by account private endpoint.
Public cloud: <managed_storage_account>.blob.core.windows.net or <ingestion_storage_account>.*.blob.storage.azure.net
Azure Government: <managed_storage_account>. blob.core.usgovcloudapi.net or <ingestion_storage_account>. blob.core.usgovcloudapi.net
China: <managed_storage_account>.blob.core.chinacloudapi.cnor <ingestion_storage_account>.blob.core.chinacloudapi.cn
443 Required to connect to the Microsoft Purview managed Azure Blob storage account.
Public cloud: <managed_storage_account>.queue.core.windows.net or <ingestion_storage_account>.*.queue.storage.azure.net
Azure Government: <managed_storage_account>. queue.core.usgovcloudapi.net or <ingestion_storage_account>. queue.core.usgovcloudapi.net
China: <managed_storage_account>.queue.core.chinacloudapi.cnor <ingestion_storage_account>.queue.core.chinacloudapi.cn
443 Required to connect to the Microsoft Purview managed Azure Queue storage account.
Public cloud: *.compute.governance.azure.com
Azure Government: *.compute.governance.azure.us
China: *.compute.governance.azure.cn
443 Required to connect to the Microsoft Purview service. Currently wildcard is required as there's no dedicated resource.
mcr.microsoft.com 443 Required to download images.
*.data.mcr.microsoft.com 443 Required to download images.

Note

Depending on the sources users want to scan, they also need to allow other domains and outbound ports for other Azure or external sources.

Version

Typically, we release one new minor version of self-hosted integration runtime every month, which includes features, enhancements, and bug fixes.

Each version of the self-hosted integration runtime expires in one year.

How to check the current version

You can check the version of your Kubernetes self-hosted integration runtime either on the portal, or with the IRCTL.

Portal

  1. In the Microsoft Purview portal, navigate to the Data Map.
  2. Select Integration runtimes
  3. The fourth column in your integration runtime's description line will be Version, and you can check the version there.

IRCTL (1.1.0 and above)

The describe command will return the integration runtime's version.

./irctl describe

Auto-update

Starting from version 1.1.0, the Kubernetes self-hosted integration runtime supports auto-update, which is enabled by default. This feature ensures your integration runtime is automatically upgraded to the latest Microsoft-managed version approximately once a month.

Opt-out

We recommend keeping auto-update enabled to benefit from the newest features and enhancements. However, you have the option to opt-out of auto-update using IRCTL. The auto-update configuration persists through reinstallation, so you don't need to disable it with each installation.

./irctl config set autoUpdate.enabled false
./irctl config view

Auto-update version vs latest version

To ensure stability, the auto-update is usually behind the latest version with a one-month delay. The auto-update version is managed by Microsoft.

If you would like to upgrade your integration runtime to newer versions, a manual upgrade should be performed with IRCTL of the specific version.

Next Steps