How to connect On-Prem FTP server using Azure Databrick

Subhadip Roy 31 Reputation points
2024-10-18T04:49:48.2866667+00:00

There are some csv/xlsx files in On-Prem FTP Server which Azure Databricks need to connect and load it to Delta table.

Please advise

  1. What are the pre-requisite to connect On-Prem FTP server.
  2. Is there any firewall/IP need to be whitelisted.
  3. Any other infrastructure / security setting need to be consider.
  4. Sample code snippet to connect to it and read the files.
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} vote

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,641 Reputation points Moderator
    2024-10-18T08:11:31.65+00:00

    @Subhadip Roy - Thanks for the question and using MS Q&A platform.

    To connect to an On-Prem FTP server using Azure Databricks, you can use the ftplib library in Python.

    Here are the steps to connect to an On-Prem FTP server and read the files:

    Step1: Import the ftplib library:

    import ftplib
    

    Step2: Connect to the FTP server using the FTP function:

    ftp = ftplib.FTP('ftp.server.com', 'username', 'password')
    
    1. Replace ftp.server.com with the hostname or IP address of your FTP server, and username and password with your FTP credentials.

    Step3: Change to the directory where your files are located using the cwd function:

    ftp.cwd('/path/to/files')
    

    Replace /path/to/files with the path to the directory where your files are located.

    Step4: List the files in the directory using the nlst function:

    files = ftp.nlst()
    

    This will return a list of filenames in the directory.

    Step5: Download the files using the retrbinary function:

    files = ftp.nlst()
    

    This will download each file in the directory to your local machine.

    Regarding the pre-requisites to connect to an On-Prem FTP server, you need to ensure that the FTP server is accessible from the Azure Databricks cluster. If the FTP server is behind a firewall, you may need to whitelist the IP address of the Azure Databricks cluster. Additionally, you need to ensure that you have the necessary FTP credentials to connect to the server.

    OR

    In the code snippet below, we leverage scope credentials to define IP address, password, port, and username. Furthermore, we specify the FTP site's data and file location, with the local path mirroring the file path on the Unity Catalog.

    # Import the FTP module from the ftplib library
    from ftplib import FTP
    
    # Retrieve FTP credentials from KeyVault using Scoped Credentials
    ip_address = dbutils.secrets.get(scope='dev',key='ip-address')
    password =  dbutils.secrets.get(scope='dev',key='password')
    port =  int(dbutils.secrets.get(scope='dev',key='port'))
    username =  dbutils.secrets.get(scope='dev',key='username')
    
    # Assign FTP connection details
    ftp_host = ip_address
    ftp_user = username
    ftp_password = password
    
    # Set FTP server path
    ftp_path = "/Path/To/FTP"
    
    # Set local paths for storing downloaded files
    local_path = "/Volumes/bronze_dev/ftp-files/ftp_files"
    
    # Create an FTP object
    ftp = FTP()
    # Initialize an empty list to store filenames from the FTP server
    flat_files=[]
    
    # Connect to the FTP server using the provided credentials
    ftp.connect(ip_address, port)
    ftp.login(username, password)
    print("connected")
    
    # Change the working directory on the FTP server
    ftp.cwd(ftp_path)
    
    # List files in the current directory on the FTP server
    files_list = ftp.nlst(ftp_path)
    
    # Iterate through each file in the FTP server directory
    for file in files_list:
      print(file)
      print("local_path :" + local_path)
      # Display progress message
      print('Downloading files from remote server :' + file)
    
      # Open a local file for writing in binary mode
      with open(local_path + file, "wb") as local_file:
        print('local file: ', local_file)
        # Download the file from the FTP server and write it to the local file
        ftp.retrbinary("RETR " + file, local_file.write)
        # Close the local file
        local_file.close()
    
    

    For more details, refer to the articles which helps to connect to connect On-Prem FTP server using Azure Databricks.

    Uploading Files from FTP Server to Databricks Unity Catalog.

    How to connect and process FTP Data from Azure Databricks.

    Disclaimer: This response contains a reference to a third-party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control these sites and has not tested any software or information found on these sites; therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet.

    Hope this helps. Do let us know if you have any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.