Share via

Is it possible to install OS libraries in Apache Spark pool nodes?

2024-01-04T10:12:07.0133333+00:00

Hello,

I want to install some R packages in Azure Synapse Spark pool nodes, and they require some Operating System libraries not present currently. I am working with Azure Synapse Runtime for Apache Spark 3.3 (GA), that runs in Ubuntu 18.04.

In the documentation, I found it is possible to install additional Python modules or R packages, for example, but I did not find anything related to changes in the OS.

I wonder whether is possible to install additional OS libraries in the nodes.

In my case, I need to install libgdal-dev and libudunits2-dev libraries.

In case so, I would like you provide me some guidelines on how to do it.

Thanks very much in advance

Azure Synapse Analytics
Azure Synapse Analytics

An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.

0 comments No comments

Answer accepted by question author

  1. shirivo 160 Reputation points
    2024-01-04T17:33:33.22+00:00

    Hello @Calabria Montero, Salvador (SGRE SE D FP&DC WEF)

    In Azure Synapse Analytics, you can install additional Python modules or R packages in Apache Spark pool nodes. However, the installation of Operating System libraries like libgdal-dev and libudunits2-dev is not directly supported.

    Azure Synapse Analytics provides built-in support for many popular open-source R packages, including TidyVerse3. You can install or remove these packages into a Spark pool1. Pool-level libraries are available to all notebooks and jobs running on the pool.

    To install R packages, you can manage workspace packages. In Synapse, workspace packages can be custom or private R tar.gz files. You can upload these packages to your workspace and later assign them to a specific serverless Apache Spark pool. Once assigned, these workspace packages are installed automatically on all Spark pool sessions started on the corresponding pool.

    However, the installation of OS libraries is not directly supported in Azure Synapse Analytics. This is because the environment is managed and does not provide sudo access required to install OS libraries. Therefore, you might need to find a workaround. One possible solution could be to find an R package that includes the necessary binaries or another package that does not have these dependencies.

    ** Important: Any changes to the OS level would affect the stability and security of the Spark pool nodes, and it’s one of the reasons why such operations are not permitted. If you have a specific requirement, I recommend reaching out to Azure support for more tailored assistance. They might be able to provide a solution or workaround for your specific use case.

    More info:

    https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-pool-packages https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-r-language Wishing you well,

    @shirivo

    Was this answer helpful?

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.