ADF and azure-batch incorrectly uploads batch task dependency files when stored inside a directory

Dave 1 Reputation point
2021-12-20T23:17:18.793+00:00

Does anyone know to correctly preserve the directory structure for supporting files stored on an ADLSGen2 storage account that gets used in a custom ADF batch Python activity?

I have an ADF pipeline that executes a Python script (called run_script.py) on an Ubuntu Linux batch pool using the custom batch activity.

The script depends on multiple supporting .py modules that are stored in a lib directory.
The lib directory is located at the same level as the python script like so:
run_script.py
lib/lib_file1.py
lib/lib_file2.py

The same structure is uploaded to a location on my data lake storage, which gets linked to the batch activity in the ADF pipeline.

The problem is the pipeline fails to run - Python throws a ModuleNotFound error (no module named lib).
When I inspect the working directory for the job, the directory structure looks something like this:
run_script.py
lib_file1.py
lib_file2.py
lib (empty file)

The files are getting copied to the same level as the main script and the directory structure of the lib folder is not being preserved. Because of this, the script cannot find the corresponding lib module in the correct place, which throws the error.

I'd be very grateful if someone could help me to resolve this issue or provide more insight. I'm unsure if there is a way to instruct azure-batch to preserve the file structure?

I've simplified the structure to make it easier to explain, but in reality, there are more files and they are also nested in sub-modules.
It is not easy to modify the main script to use a flat file structure.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,737 Reputation points Microsoft Employee Moderator
    2021-12-22T00:11:37.45+00:00

    Hi @Dave ,

    Welcome to Microsoft Q&A forum and thanks for posting your query.

    Thanks for the details provided but I'm still trying to understand from where (Source - Is it a Azure blob storage/On Prem location or something else) the .py and dependency files ( lib directory ) are copied to ADLS Gen2 location and what is process that is being used to copy the files and lib folder.

    I'm assuming that you are using ADF copy activity to copy those files and directory from x source to ADLS gen2 and then pointing ADF Custom Batch activity to the new ADLS Gen2 location. Please correct if I misunderstood your requirement.

    If you agree with my understanding, then I would suggest you to please check this doc which has detailed info on how to preserve the directory structure using Copy activity in ADF: ADF Copy activity - recursive and copyBehavior examples

    159531-image.png

    When you select the recursive property to true under copy activity source settings and copyBehavior as preserveHierarchy under sink settings then the target folder is created with the same structure as the source.

    Copy source settings:

    159494-image.png

    Copy sink settings:

    159495-image.png

    Hope this info helps. Do let us know if you have further query.

    ----------

    • Please don't forget to click on 130616-image.png and upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.