Securing args of the Python script in Azure Data Factory Custom Azure Batch Activity

Adam Kupiec 196 Reputation points
2024-07-29T07:58:14.4633333+00:00

Hi,

Imagine running a simple ADF Pipeline that consists of 2 steps:

User's image

  1. Fetching a sensitive value - like Storage Account URL - from Azure Key Vault
  2. Executing a Python script with argsparse where this value is set as an argument with the command
       @{activity('SAURL').output.value
       
    

Both Web Activity and Custom Activity have secure input and output values.

THE PROBLEM:

Even though it seems fine from ADF perdpective, when I move to Batch Account -> jobs -> tasks, select Task and move to 'Properties' section, I am able to see the full command line command with the value of my SAURL.

QUESTIONS:

  1. How bad is it from security perspective?
  2. What is the proper workaround? My current solution is to move the sensitive data to the code and call them with azure.keyvault. secrets library, but it forces me to hard-code the name of the Key-Vault which is also bad. Also it would require adjusting all scripts (from the argsparse versions to the ones calling AKV).
  3. Therefore - is there any option to hash the command line command in the task (from the Batch or ADF perspective)?
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,690 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 24,556 Reputation points
    2024-07-30T07:35:28.83+00:00

    From a security perspective, exposing sensitive information such as a Storage Account URL in command-line arguments within Azure Batch tasks is a significant risk. Command-line arguments can be logged, and anyone with access to the Batch Account or the ADF logs could potentially view this information, leading to potential security breaches.

    Here is the scenario :

    • Anyone with access to the Azure Batch account or logs could see the command-line arguments, exposing the Storage Account URL.
    • Exposing sensitive information may violate security policies or compliance requirements.

    I may found some workarounds :

    Using Azure Key Vault from within the Script:

    This is generally the most secure method as it avoids exposing sensitive data in command-line arguments. However, as you mentioned, this requires updating all scripts and hardcoding the Key Vault name, which is not ideal.

    Using Environment Variables:

    Another approach is to pass sensitive data via environment variables instead of command-line arguments. Azure Batch allows you to specify environment variables for tasks. Environment variables are more secure as they are not logged in the same way as command-line arguments.

    
    {
    
      "type": "Microsoft.DataFactory/factories/pipelines",
    
      "name": "YourPipelineName",
    
      "properties": {
    
        "activities": [
    
          {
    
            "name": "ExecuteBatchJob",
    
            "type": "Custom",
    
            "typeProperties": {
    
              "command": "python your_script.py",
    
              "environmentVariables": {
    
                "SAURL": "@{activity('SAURL').output.value}"
    
              }
    
            }
    
          }
    
        ]
    
      }
    
    }
    
    

    In your Python script, you can then access this environment variable using the os module:

    
    import os
    
    storage_account_url = os.getenv('SAURL')
    
    

    Encrypting Arguments:

    Encrypt sensitive data before passing it as a command-line argument and then decrypt it within the script. This method adds a layer of security, but requires managing encryption keys and ensuring they are stored securely.

    
    # Encryption
    
    from cryptography.fernet import Fernet
    
    key = Fernet.generate_key()
    
    cipher_suite = Fernet(key)
    
    encrypted_url = cipher_suite.encrypt(b"your_storage_account_url")
    
    # Decryption
    
    decrypted_url = cipher_suite.decrypt(encrypted_url).decode('utf-8')
    
    

    Use Azure Managed Identity:

    Use Azure Managed Identity to access Azure Key Vault directly from your batch script without hardcoding credentials. This way, you can fetch the secrets at runtime securely.

    Assign a managed identity to your Batch account and grant it access to the Key Vault. In your script, use Azure SDK to access the Key Vault.

    
    from azure.identity import ManagedIdentityCredential
    
    from azure.keyvault.secrets import SecretClient
    
    credential = ManagedIdentityCredential()
    
    client = SecretClient(vault_url="https://your-key-vault-name.vault.azure.net/", credential=credential)
    
    secret = client.get_secret("your-secret-name")
    
    storage_account_url = secret.value
    

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.