Azure to AWS

Sourav 80 Reputation points
2024-05-20T08:39:35.8366667+00:00

Hello

We need to transfer files from ADLS to AWS (S3 bucket) for a SAS application hosted in third party in batches. We need to ensure data security and best practices. My understanding, we can use ADF to create a linked service for AWS S3 but IT DOES NOT SUPPORT sink which means we cannot choose destination as AWS S3 , using databricks we will create gpg encryption key to send the files encrypted in batches and schedule a trigger and will decrypt in AWS. ADF uses https for data in transit and in ADLS data is encrypted at rest. What are the best possible suggestion for file transfer to ensure data is encrypted at rest and in transit.

Thanks,

Sourav

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,398 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,030 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,933 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 18,106 Reputation points
    2024-05-20T11:47:58.1733333+00:00

    Hello Sourav, from what you explained, you want to secure transfer of files from ADLS to the AWS S3 bucket for your SAS application (both at rest and in transit).

    You need to set up the encryption at rest for both ADLS (using Azure Storage Service Encryption ) and AWS S3 (either AWS-managed keys (SSE-S3) or AWS KMS-managed keys (SSE-KMS))

    Since ADF does not support AWS S3 as a sink directly, we can use the following method:

    • Create a linked service in ADF for your ADLS
    • Use Databricks to temporarily stage the data before transferring it to S3

    If you are opting for Databricks for batch processing and encryption:

    • GPG Key Creation for encryption and decryption and store the public key securely in Databricks and the private key securely in AWS
    • Databricks Notebook to read data from ADLS, encrypt the data using the GPG public key and save the encrypted data to a staging area in Databricks
    
    from cryptography.fernet import Fernet
    
    # Generate a key for encryption
    
    key = Fernet.generate_key()
    
    cipher_suite = Fernet(key)
    
    # Example function to encrypt data
    
    def encrypt_data(data):
    
        encrypted_data = cipher_suite.encrypt(data)
    
        return encrypted_data
    
    # Reading data from ADLS
    
    data = dbutils.fs.head("dbfs:/mnt/adls/path/to/data.csv")
    
    # Encrypting the data
    
    encrypted_data = encrypt_data(data.encode())
    
    # Save the encrypted data
    
    dbutils.fs.put("/mnt/databricks/staging/encrypted_data.csv", encrypted_data.decode())
    
    

    Then you need to install AWS CLI in the Databricks cluster and use Databricks to upload the encrypted files to S3

    
    import os
    
    # Set AWS credentials
    
    os.environ['AWS_ACCESS_KEY_ID'] = 'your_access_key_id'
    
    os.environ['AWS_SECRET_ACCESS_KEY'] = 'your_secret_access_key'
    
    # Upload file to S3
    
    dbutils.fs.cp("dbfs:/mnt/databricks/staging/encrypted_data.csv", "s3://your-s3-bucket/encrypted_data.csv")
    

    You can use ADF to create a pipeline that triggers the Databricks notebook at your desired schedule and the GPG private key to decrypt data once it is in S3.

    
    # Decrypt the file using GPG
    
    gpg --decrypt --output decrypted_data.csv s3://your-s3-bucket/encrypted_data.csv
    

    More links :

    https://www.geeksforgeeks.org/how-do-i-add-a-s3-bucket-to-databricks/

    https://community.boomi.com/s/article/Inserting-Data-into-Databricks-with-AWS-S3

    https://learn.microsoft.com/en-us/azure/security/fundamentals/encryption-atrest

    https://learn.microsoft.com/en-us/azure/storage/common/storage-service-encryption

    https://learn.microsoft.com/en-us/azure/security/fundamentals/encryption-overview


  2. Sumarigo-MSFT 44,586 Reputation points Microsoft Employee
    2024-05-28T04:54:59.46+00:00

    @Sourav Yes, you may use AWS Transfer Family SFTP with SSH Keys that uses port 22 to transfer files from AWS S3 bucket to Azure Data Lake Storage (ADLS) Gen2( I haven't repro'd this issue my lab). SFTP is encrypted by default, so GPG encryption can be optional. ADF uses HTTPS, so it is encrypted in transit by default. Both ADLS and AWS S3 bucket are encrypted at rest, and you can choose to use platform-managed keys or customer-managed keys.

    To use AWS Transfer Family SFTP with ADF, you can create an SFTP linked service with MFA authentication to connect to the AWS SFTP endpoint linked to the AWS S3 bucket. You can then use Azure Databricks to transfer files in batches invoked in ADF pipeline.

    Please note that you need to configure the network security settings to allow traffic between AWS and Azure. You can use a private peering link between AWS Direct Connect and Azure Express Route to achieve higher security if you do not want data to be transferred over the public internet.

    Please let us know if you have any further queries. I’m happy to assist you further. 


    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments