Integrate OneLake with Azure HDInsight

Azure HDInsight is a managed cloud-based service for big data analytics that helps organizations process large amounts data. This tutorial shows how to connect to OneLake with a Jupyter notebook from an Azure HDInsight cluster.

Using Azure HDInsight

To connect to OneLake with a Jupyter notebook from an HDInsight cluster:

  1. Create an HDInsight (HDI) Apache Spark cluster. Follow these instructions: Set up clusters in HDInsight.

    1. While providing cluster information, remember your Cluster login Username and Password, as you need them to access the cluster later.

    2. Create a user assigned managed identity (UAMI): Create for Azure HDInsight - UAMI and choose it as the identity in the Storage screen.

      Screenshot showing where to enter the user assigned managed identity in the Storage screen.

  2. Give this UAMI access to the Fabric workspace that contains your items. For help deciding what role is best, see Workspace roles.

    Screenshot showing where to select an item in the Manage access panel.

  3. Navigate to your lakehouse and find the name for your workspace and lakehouse. You can find them in the URL of your lakehouse or the Properties pane for a file.

  4. In the Azure portal, look for your cluster and select the notebook.

    Screenshot showing where to find your cluster and notebook in the Azure portal.

  5. Enter the credential information you provided while creating the cluster.

    Screenshot showing where to enter your credential information.

  6. Create a new Apache Spark notebook.

  7. Copy the workspace and lakehouse names into your notebook and build the OneLake URL for your lakehouse. Now you can read any file from this file path.

    fp = 'abfss://' + 'Workspace Name' + '@onelake.dfs.fabric.microsoft.com/' + 'Lakehouse Name' + '/Files/' 
    df = spark.read.format("csv").option("header", "true").load(fp + "test1.csv") 
    df.show()
    
  8. Try writing some data into the lakehouse.

    writecsvdf = df.write.format("csv").save(fp + "out.csv") 
    
  9. Test that your data was successfully written by checking your lakehouse or by reading your newly loaded file.

You can now read and write data in OneLake using your Jupyter notebook in an HDI Spark cluster.