Hi @John Li
Thanks for reaching out to Microsoft Q&A.
I understand your concerns about the security of Python libraries used in your Synapse Analytics Spark jobs. Here's how Synapse manages libraries and addresses your specific questions:
1. How is Synapse managing all the Python libraries it is using during runtime? Synapse offers multiple ways to manage libraries for Spark jobs: Pool-level Libraries: You can specify pre-installed libraries for all sessions in a Spark pool using:
- Requirements.txt: This file lists packages and their versions.
- Environment.yml: This file defines a Conda environment with specific dependencies.
Workspace Packages: Upload custom or private libraries (.whl or .jar files) to your workspace and assign them to specific pools. These libraries are then available to all sessions in those pools. Session-level Libraries: Install libraries for a specific notebook session only using:
- Conda environment.yml: Upload this file within the notebook to create a temporary environment.
- Spark
pip
: Usepip install
within your notebook to install libraries dynamically.
Security Management:
- All uploaded libraries are scanned for potential vulnerabilities before installation.
- You can also configure additional security controls like whitelisting specific versions or blocking known-vulnerable packages.
2. Are the libraries getting changed (version change) over time? I am particularly interested in Spark 3.3, which is the version we are using.
- Spark Runtime: Synapse Spark pools provide Spark 3.3 runtime by default. You can confirm this in the Spark pool settings under "Spark version".
- Pool-level Libraries: Versions specified in your
requirements.txt
orenvironment.yml
for the pool control library versions. These versions remain fixed unless you update the file and redeploy the pool. - Workspace Packages: Versions in uploaded packages are static unless you overwrite them with newer versions.
- Session-level Libraries: You control versions dynamically within each notebook session.
Recommendations:
- Regularly review and update your pool-level libraries and workspace packages to address vulnerabilities and benefit from new features.
- Leverage session-level libraries for experimentation or testing while maintaining control over dependencies.
- Utilize Synapse's security controls to minimize vulnerabilities.
Additional Resources:
- Manage libraries for Apache Spark pools: https://learn.microsoft.com/en-us/azure/synapse-analytics/
- Azure Synapse Analytics security: https://learn.microsoft.com/en-us/azure/synapse-analytics/guidance/security-white-paper-introduction
Hope this helps. Do let us know if you any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.