Connect to Infoworks

Important

This feature is in Public Preview.

Infoworks DataFoundry is an automated enterprise data operations and orchestration system that runs natively on Azure Databricks and leverages the full power of Azure Databricks to deliver a easy solution for data onboarding—an important first step in operationalizing your data lake. DataFoundry not only automates data ingestion, but also automates the key functionality that must accompany ingestion to establish a foundation for analytics. Data onboarding with DataFoundry automates:

  • Data ingestion: from all enterprise and external data sources
  • Data synchronization: CDC to keep data synchronized with the source
  • Data governance: cataloging, lineage, metadata management, audit, and history

Here are the steps for using Infoworks with Azure Databricks.

Step 1: Generate a Databricks personal access token

Infoworks authenticates with Azure Databricks using an Azure Databricks personal access token. To generate a personal access token, follow the instructions in Generate a personal access token.

Note

As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. To create access tokens for service principals, see Manage access tokens for a service principal.

Step 2: Set up a cluster to support integration needs

Infoworks will write data to an Azure Data Lake Storage path and the Azure Databricks integration cluster will read data from that location. Therefore the integration cluster requires secure access to the Azure Data Lake Storage path.

Secure access to an Azure Data Lake Storage path

To secure access to data in Azure Data Lake Storage (ADLS) you can use an Azure storage account access key (recommended) or an Azure service principal.

Use an Azure storage account access key

You can configure a storage account access key on the integration cluster as part of the Spark configuration. Ensure that the storage account has access to the ADLS container and file system used for staging data and the ADLS container and file system where you want to write the Delta Lake tables. To configure the integration cluster to use the key, follow the steps in Access Azure Data Lake Storage Gen2 and Blob Storage.

Use an Azure service principal

You can configure a service principal on the Azure Databricks integration cluster as part of the Spark configuration. Ensure that the service principal has access to the ADLS container used for staging data and the ADLS container where you want to write the Delta tables. To configure the integration cluster to use the service principal, follow the steps in Access ADLS Gen2 with service principal.

Specify the cluster configuration

  1. Set Cluster Mode to Standard.

  2. Set Databricks Runtime Version to a Databricks runtime version.

  3. Enable Auto Optimize by adding the following properties to your Spark configuration:

    spark.databricks.delta.optimizeWrite.enabled true
    spark.databricks.delta.autoCompact.enabled true
    
  4. Configure your cluster depending on your integration and scaling needs.

For cluster configuration details, see Configure clusters.

See Retrieve the connection details for the steps to obtain the JDBC URL and HTTP path.

Step 3: Obtain JDBC and ODBC connection details to connect to a cluster

To connect an Azure Databricks cluster to Infoworks you need the following JDBC/ODBC connection properties:

  • JDBC URL
  • HTTP Path

Step 4: Get Infoworks for Azure Databricks

Go to Infoworks to learn more and get a demo.

Additional resources

Support